github ggml-org/llama.cpp b8609

2 hours ago
Details

CUDA: Add Flash Attention Support for Head Dimension 512 (#20998)

  • flash attention support for head dimension 512 added

  • FA D=512 - match 576 configs, limit ncols2, revert vec cap

  • fix HIP tile kernel build for D=512

  • fix HIP tile kernel occupancy for D=512 on AMD

  • Apply suggestions from code review

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • fix tile FA compilation

Co-authored-by: Johannes Gäßler johannesg@5d6.de

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.