github ggml-org/llama.cpp b8639

2 hours ago
Details

ggml-webgpu: add vectorized flash attention (#20709)

  • naive vectorized version

  • add vectorized flash attention

  • update vec version

  • remove unused path and shader

  • remove unused helper functions

  • add comments

  • remove pad path

  • ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization

  • change back to vec4

  • enable multi split

  • enable vec path when:

  • Q->ne[1] < 20
  • Q->ne[0] % 32 == 0
  • V->ne[0] % 4 == 0
  • K->type == f16
  • update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select

  • enable vec path for q4 and q8

  • flash-attn vec nwg=1 fast path (skip tmp/reduce staging)

  • use packed f16 K loads in flash-attn vec split

  • use packed f16 K loads in flash-attn vec split on host side

  • tune flash-attn vec f16 VEC_NE by head dim

  • cleanup

  • cleanup

  • keep host side clean

  • cleanup host side

  • change back to original host wait/submit behavior

  • formatting

  • reverted param-buffer pool r ecfactor

  • add helper functions

  • ggml-webgpu: move flash-attn vec pipeline caching back into shader lib

  • ggml-webgpu: remove duplicate functions

  • ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation

  • ggml-webgpu: revert unrelated change

  • ggml-webgpu: revert deleted comment

  • disable uniformity check

  • remove unnecessary change

  • Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl

  • Update ggml/src/ggml-webgpu/ggml-webgpu.cpp


Co-authored-by: Reese Levine reeselevine1@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.