github ggml-org/llama.cpp b8779

6 hours ago
Details

vulkan: Flash Attention DP4A shader for quantized KV cache (#20797)

  • use integer dot product for quantized KV flash attention

  • small improvements

  • fix SHMEM_STAGING indexing

  • add missing KV type quants

  • fixes

  • add supported quants to FA tests

  • readd fast paths for <8bit quants

  • fix mmq gate and shmem checks

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.