Details
Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)
-
hexagon: improve fp16 matmul and add fp32/fp16 flash-attention
-
hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx
-
hexagon: add support for SCALE fp32
-
hexagon: replace scalar fp32 -> fp16 copy with HVX
-
hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA
- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses
FLASH_ATTN_BLOCK_SIZEof 128 for efficient chunking.
-
hexagon: use aligned mad_f16
-
hexagon: flash_atten more aligned ops
-
hexagon: optimize scale_f32 hvx helpers
-
hexagon: unroll fa loops
-
hexagon: remove unused set-rows log
-
hexagon: flash_attn_ext add support for DMAing Q
- Update
op_flash_attn_extto include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in
flash_attn_ext_f16_thread. - Update dot product computations to use VTCM-buffered Q data.
-
hexagon: fix handling of NANs hvx dotproducts
-
hexagon: cleanup spad allocation in flash-atten
-
hexagon: improve fp16/fp32 matmul
- Introduced
vec_dot_f16_f16andvec_dot_f16_f16_rx2kernels using efficient HVX dot product intrinsics. - Added
quantize_fp32_f16to copy/convert weights from DDR to VTCM - Updated
op_matmulto use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios.
-
hexagon: fix HVX_ARCH check
-
hexagon: matmul cleanup and fp16 fixes
Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.
-
hexagon: fix fp16 x fp16 matmuls and some minor refactoring
-
hexagon: add support for GET_ROWS f32 -> f32
Also optimize SET_ROWS threading a bit when we have just a few rows to process.
-
hexagon: optimize set-rows threading
-
hexagon: update adb/run-bench.sh to properly support experimental and verbose options
-
hexagon: flash_atten use aligned vectors for dot products
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: