Details
hexagon: further optimizations and refactoring for flash attention (#19583)
- ggml-hexagon: fa improvements
ggml-hexagon: optimize flash attention calculations with improved variable handling
ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32
ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements
ggml-hexagon: optimize flash attention by changing slope vector type to F16
-
hexfa: fixed test-backend-ops failurs due to leftover element handling
-
hexagon: refactor and optimize fa to use local context struct
-
ggml-hexagon: optimize flash-attention using hvx_vec_expf
Use HVX for online softmax.
Co-authored-by: chraac chraac@gmail.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: