github ggml-org/llama.cpp b7845

4 hours ago
Details

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888)

  • Boilerplate for q6_K repack

  • q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera alberto.cabrera@liquid.ai

  • q6_K generic gemv and gemm

  • wip, gemm_q6_K 8x8

  • Still WIP: loading of q8s, q6h and q6l

  • first working version of q6_K gemm

  • Moved q6 loads outside of sb block, Unrolled inner loop

  • Replaced modulo with mask

  • First implementation of GEMV

  • ggml_vdotq_s32 -> vdotq_s32

  • Reduce width of accumulators in q6_K gemv

  • Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

  • Reuse scales in GEMM (same GEMV opt)

  • Added todos for bsum and different qh repack

  • Arch fallback

  • VSLIQ for merging qh adn ql

  • Removed TODO, already tested

  • Apply suggestions

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

  • Removed unused import

Signed-off-by: Alberto Cabrera alberto.cabrera@liquid.ai
Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.