ggml-org/llama.cpp b7865
on GitHub

latest releases: b8390, b8389, b8388...

one month ago

Details

Vulkan Flash Attention Coopmat1 Refactor (#19075)

vulkan: use coopmat for flash attention p*v matrix multiplication
fix P loading issue
fix barrier position
remove reduction that is no longer needed
move max thread reduction into loop
remove osh padding
add bounds checks and padding
remove unused code
fix shmem sizes, loop duration and accesses
don't overwrite Qf, add new shared psh buffer instead
add missing bounds checks
use subgroup reductions
optimize
move bounds check, reduce barriers
support other Bc values and other subgroup sizes
remove D_split
replace Of register array with shared memory Ofsh array
parallelize HSV across the rowgroups
go back to Of in registers, not shmem
vectorize sfsh
don't store entire K tile in shmem
fixes
load large k tiles to shmem on Nvidia
adapt shared memory host check function to shader changes
remove Bc 32 case
remove unused variable
fix missing mask reduction tmspsh barrier
fix mask bounds check
fix rowmax f16 under/overflow to inf
fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

macOS/iOS:

Linux:

Windows:

openEuler:

Check out latest releases or
releases around ggml-org/llama.cpp b7865

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.

Get notifications