github ggml-org/llama.cpp b8040

2 hours ago
Details

hexagon: further optimizations and refactoring for flash attention (#19583)

  • ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

  • hexfa: fixed test-backend-ops failurs due to leftover element handling

  • hexagon: refactor and optimize fa to use local context struct

  • ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.


Co-authored-by: chraac chraac@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.