github ggml-org/llama.cpp b7652

latest releases: b7699, b7698, b7697...
3 days ago
Details

Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)

  • hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

  • hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

  • hexagon: add support for SCALE fp32

  • hexagon: replace scalar fp32 -> fp16 copy with HVX

  • hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

  • Implements double-buffered DMA prefetching for K, V, and Mask tensors.
  • Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
  • Correctly synchronizes DMA transfers to prevent race conditions.
  • Uses FLASH_ATTN_BLOCK_SIZE of 128 for efficient chunking.
  • hexagon: use aligned mad_f16

  • hexagon: flash_atten more aligned ops

  • hexagon: optimize scale_f32 hvx helpers

  • hexagon: unroll fa loops

  • hexagon: remove unused set-rows log

  • hexagon: flash_attn_ext add support for DMAing Q

  • Update op_flash_attn_ext to include Q row size in scratchpad allocation.
  • Pad Q row size to 128 bytes for alignment.
  • Implement DMA transfer for Q tensor in flash_attn_ext_f16_thread.
  • Update dot product computations to use VTCM-buffered Q data.
  • hexagon: fix handling of NANs hvx dotproducts

  • hexagon: cleanup spad allocation in flash-atten

  • hexagon: improve fp16/fp32 matmul

  • Introduced vec_dot_f16_f16 and vec_dot_f16_f16_rx2 kernels using efficient HVX dot product intrinsics.
  • Added quantize_fp32_f16 to copy/convert weights from DDR to VTCM
  • Updated op_matmul to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
  • Implemented fallback logic to the original implementation for complex broadcasting scenarios.
  • hexagon: fix HVX_ARCH check

  • hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

  • hexagon: fix fp16 x fp16 matmuls and some minor refactoring

  • hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

  • hexagon: optimize set-rows threading

  • hexagon: update adb/run-bench.sh to properly support experimental and verbose options

  • hexagon: flash_atten use aligned vectors for dot products

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.