github ggml-org/llama.cpp b9470

latest releases: b9478, b9474, b9473...
5 hours ago
Details

hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989)

  • hex-mm: initial support for F32 * F32 -> F32 matmuls

  • hex-rms-norm: fix src1 stride use in fused rms_norm_mul

  • hex-ops: clear spad pointers in the ops that clober it

This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.

  • hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX

Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.

  • hmx-mm: route f16 2D matmuls through the same kernel used for all other types

  • hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way

This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.

  • hmx-fa: slighlty improved pipeline simimar to matmul updates

  • hmx-mm: initial version of MAT_MUL_ID support for HMX

  • hmx-mm: fixed mxfp4 handling for MUL_MAT_ID

  • hex-gdn: optimize GATED_DELTA_NET

DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)

  • hmx-mm: missed one more case where we can use fastmod

  • hexagon: update DCVS settings for a slight perf bump

  • hmx-fa: use fastdiv in hmx-flash-attn

  • hmx-fa: precompute slope values to avoid disrupting the inner loop

  • hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi

  • hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty

  • hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.