ggml-org/llama.cpp b8639 on GitHub

Details

ggml-webgpu: add vectorized flash attention (#20709)

update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select
enable vec path for q4 and q8
flash-attn vec nwg=1 fast path (skip tmp/reduce staging)
use packed f16 K loads in flash-attn vec split
use packed f16 K loads in flash-attn vec split on host side
tune flash-attn vec f16 VEC_NE by head dim
cleanup
cleanup
keep host side clean
cleanup host side
change back to original host wait/submit behavior
formatting
reverted param-buffer pool r ecfactor
add helper functions
ggml-webgpu: move flash-attn vec pipeline caching back into shader lib
ggml-webgpu: remove duplicate functions
ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation
ggml-webgpu: revert unrelated change
ggml-webgpu: revert deleted comment
disable uniformity check
remove unnecessary change
Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
Update ggml/src/ggml-webgpu/ggml-webgpu.cpp

Co-authored-by: Reese Levine reeselevine1@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler: