ggml-org/llama.cpp b7652 on GitHub

Details

Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)

Implements double-buffered DMA prefetching for K, V, and Mask tensors.
Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
Correctly synchronizes DMA transfers to prevent race conditions.
Uses FLASH_ATTN_BLOCK_SIZE of 128 for efficient chunking.

Introduced vec_dot_f16_f16 and vec_dot_f16_f16_rx2 kernels using efficient HVX dot product intrinsics.
Added quantize_fp32_f16 to copy/convert weights from DDR to VTCM
Updated op_matmul to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
Implemented fallback logic to the original implementation for complex broadcasting scenarios.

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

hexagon: optimize set-rows threading
hexagon: update adb/run-bench.sh to properly support experimental and verbose options
hexagon: flash_atten use aligned vectors for dot products

macOS/iOS:

Linux:

Windows:

openEuler: