github ggml-org/llama.cpp b8143

latest release: b8144
3 hours ago
Details

Vulkan Scalar Flash Attention Refactor (#19625)

  • vulkan: allow using fp16 in scalar flash attention shader

  • split rows inside of subgroups for faster synchronization

  • use row_split when Br >= 4, change reductions to use shared memory if row_split == 1

  • use f32 scalar FA if f16 is not supported by device

  • fix amd workgroup size issue

  • optimize masksh use

  • add medium rows FA shader Br size

  • fixes

  • add padding to mask shmem buffer

  • cache q values into registers for KQ

  • fuse lf accumulation, pf and v accumulation into a loop

  • stage K loads through shmem

  • stage V loads through shmem

  • only stage through shmem on Nvidia

  • default to Bc 32

  • also stage V through shmem when this is done for K

  • dynamic subgroups for intel

  • use vectorized stores

  • use float_type for dequantize4 functions

  • use smaller scalar rows size for smaller rows count

  • relax flash attention split_k condition to allow non-gqa use

  • use minimal subgroup size on Intel

  • fix shmem support function

  • fix rebase issues

  • fixes

  • Bc 4 for scalar FA is not a valid configuration

  • Use wave32 on AMD RDNA for scalar FA

  • add Intel shader core count lookup-table

  • fix regressions

  • device tuning

  • tmpsh size fix

  • fix editorconfig

  • refactor fa tuning logic into a single place

  • fix gqa opt logic

  • fix block_rows with small n_rows

  • amd tuning

  • fix hsk=72/80 issue

  • tuning

  • allow condition skipping for column check

  • use float16 for Of if available

  • address feedback

  • fix bad RDNA performance on head size <= 128 by limiting occupancy

  • allow printing pipeline stats

  • cleanup and fixes

  • limit occupancy for GCN for small batch FA with large HSK

  • disable f16 FA for GCN AMD GPUs on the proprietary driver

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.