Details
Vulkan Flash Attention Coopmat1 Refactor (#19075)
-
vulkan: use coopmat for flash attention p*v matrix multiplication
-
fix P loading issue
-
fix barrier position
-
remove reduction that is no longer needed
-
move max thread reduction into loop
-
remove osh padding
-
add bounds checks and padding
-
remove unused code
-
fix shmem sizes, loop duration and accesses
-
don't overwrite Qf, add new shared psh buffer instead
-
add missing bounds checks
-
use subgroup reductions
-
optimize
-
move bounds check, reduce barriers
-
support other Bc values and other subgroup sizes
-
remove D_split
-
replace Of register array with shared memory Ofsh array
-
parallelize HSV across the rowgroups
-
go back to Of in registers, not shmem
-
vectorize sfsh
-
don't store entire K tile in shmem
-
fixes
-
load large k tiles to shmem on Nvidia
-
adapt shared memory host check function to shader changes
-
remove Bc 32 case
-
remove unused variable
-
fix missing mask reduction tmspsh barrier
-
fix mask bounds check
-
fix rowmax f16 under/overflow to inf
-
fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: