github ggml-org/llama.cpp b8724

one hour ago
Details

sycl : add flash-attn support for head size 512 (#21654)

  • sycl : add flash-attn support for head size 512

This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512.

Changes:

  • Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels.
  • Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256).
  • Removed unused/redundant AMD and RDNA-specific configuration functions in fattn-tile.hpp.
  • Refactored ggml_backend_sycl_buffer_init_tensor to use a switch statement for clearer tensor extra buffer initialization.
  • Added necessary template instances for the new 512 head size across various quantization types.
  • remove defunct mxfp4 reorder from setting buffer type

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.