Details
hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates (#20118)
- ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation
Conflicts:
ggml/src/ggml-hexagon/htp/flash-attn-ops.c
-
ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity
-
ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention
Conflicts:
ggml/src/ggml-hexagon/htp/flash-attn-ops.c
- optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation
Conflicts:
ggml/src/ggml-hexagon/htp/flash-attn-ops.c
- ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration
Conflicts:
ggml/src/ggml-hexagon/htp/flash-attn-ops.c
-
ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity
-
ggml-hexagon: fix compiling error
-
fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking
-
refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility
-
wip
-
fa: instrumentation and dma reordering
-
hex-fa: use block-size 64 to improve DMA pipelining
-
hex-fa: optimize vec-dot for v79 and above
-
hex-fa: use block size 64
-
hex-fa: avoid scalar fp32->fp16 conversions
-
hex-fa: simplify dot_f16 functions using optimized vec_mpyacc
-
hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc
-
hex-mm: use mpyacc in matmul dot functions
Co-authored-by: chraac chraac@gmail.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: