ggml-org/llama.cpp b9470 on GitHub

Details

hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989)

hex-mm: initial support for F32 * F32 -> F32 matmuls
hex-rms-norm: fix src1 stride use in fused rms_norm_mul
hex-ops: clear spad pointers in the ops that clober it

This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.

hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX

Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.

hmx-mm: route f16 2D matmuls through the same kernel used for all other types
hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way

This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.

hmx-fa: slighlty improved pipeline simimar to matmul updates
hmx-mm: initial version of MAT_MUL_ID support for HMX
hmx-mm: fixed mxfp4 handling for MUL_MAT_ID
hex-gdn: optimize GATED_DELTA_NET

DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)

hmx-mm: missed one more case where we can use fastmod
hexagon: update DCVS settings for a slight perf bump
hmx-fa: use fastdiv in hmx-flash-attn
hmx-fa: precompute slope values to avoid disrupting the inner loop
hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi
hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty
hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI: