Details
Vulkan Scalar Flash Attention Refactor (#19625)
-
vulkan: allow using fp16 in scalar flash attention shader
-
split rows inside of subgroups for faster synchronization
-
use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
-
use f32 scalar FA if f16 is not supported by device
-
fix amd workgroup size issue
-
optimize masksh use
-
add medium rows FA shader Br size
-
fixes
-
add padding to mask shmem buffer
-
cache q values into registers for KQ
-
fuse lf accumulation, pf and v accumulation into a loop
-
stage K loads through shmem
-
stage V loads through shmem
-
only stage through shmem on Nvidia
-
default to Bc 32
-
also stage V through shmem when this is done for K
-
dynamic subgroups for intel
-
use vectorized stores
-
use float_type for dequantize4 functions
-
use smaller scalar rows size for smaller rows count
-
relax flash attention split_k condition to allow non-gqa use
-
use minimal subgroup size on Intel
-
fix shmem support function
-
fix rebase issues
-
fixes
-
Bc 4 for scalar FA is not a valid configuration
-
Use wave32 on AMD RDNA for scalar FA
-
add Intel shader core count lookup-table
-
fix regressions
-
device tuning
-
tmpsh size fix
-
fix editorconfig
-
refactor fa tuning logic into a single place
-
fix gqa opt logic
-
fix block_rows with small n_rows
-
amd tuning
-
fix hsk=72/80 issue
-
tuning
-
allow condition skipping for column check
-
use float16 for Of if available
-
address feedback
-
fix bad RDNA performance on head size <= 128 by limiting occupancy
-
allow printing pipeline stats
-
cleanup and fixes
-
limit occupancy for GCN for small batch FA with large HSK
-
disable f16 FA for GCN AMD GPUs on the proprietary driver
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: