Details
ggml-webgpu: add vectorized flash attention (#20709)
-
naive vectorized version
-
add vectorized flash attention
-
update vec version
-
remove unused path and shader
-
remove unused helper functions
-
add comments
-
remove pad path
-
ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization
-
change back to vec4
-
enable multi split
-
enable vec path when:
- Q->ne[1] < 20
- Q->ne[0] % 32 == 0
- V->ne[0] % 4 == 0
- K->type == f16
-
update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select
-
enable vec path for q4 and q8
-
flash-attn vec nwg=1 fast path (skip tmp/reduce staging)
-
use packed f16 K loads in flash-attn vec split
-
use packed f16 K loads in flash-attn vec split on host side
-
tune flash-attn vec f16 VEC_NE by head dim
-
cleanup
-
cleanup
-
keep host side clean
-
cleanup host side
-
change back to original host wait/submit behavior
-
formatting
-
reverted param-buffer pool r ecfactor
-
add helper functions
-
ggml-webgpu: move flash-attn vec pipeline caching back into shader lib
-
ggml-webgpu: remove duplicate functions
-
ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation
-
ggml-webgpu: revert unrelated change
-
ggml-webgpu: revert deleted comment
-
disable uniformity check
-
remove unnecessary change
-
Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
-
Update ggml/src/ggml-webgpu/ggml-webgpu.cpp
Co-authored-by: Reese Levine reeselevine1@gmail.com
macOS/iOS:
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: