github ggml-org/llama.cpp b9370

one hour ago
Details

hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647)

  • hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

  • hmx-mm: add support for Q4_1

  • hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

  • hexagon: fix repack scratch buffer overflow

  • hex-mm: fix Q4_1 repack buffer sizing

  • hexagon: flip the build order for mm and fa (seems to help LTO)

  • hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

  • hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

  • hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.


Co-authored-by: Todor Boinovski todorb@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.