ggml-org/llama.cpp b7678 on GitHub

Details

ggml webgpu: initial flashattention implementation (#18610)

FlashAttention (#13)
Add inplace softmax
Move rms_norm to split row approach
Update debug for supports_op
clean up debug statements
neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
neg passes backend test
unary operators pass ggml tests
rms_norm double declaration bug atoned
abides by editor-config
removed vestigial files
fixed autoconfig
All operators (inlcluding xielu) working
removed unnecesarry checking if node->src[1] exists for unary operators
responded and dealt with PR comments
implemented REPL_Template support and removed bug in unary operators kernel
formatted embed wgsl and ggml-webgpu.cpp
Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

Co-authored-by: Xuan Son Nguyen son@huggingface.co

Start work on flash attention
Shader structure set up (many bugs still)
debugging
Working first test
Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32
Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling
Start work on integrating pre-wgsl
Separate structs/initial shader compilation library into separate files
Work on compilation choices for flashattention
Work on subgroup matrix/tile size portability
subgroup size agnostic online softmax
Cleanups, quantization types
more cleanup
fix wasm build
Refactor flashattention to increase parallelism, use direct loads for KV in somce cases
Checkpoint
formatting
Update to account for default kv cache padding
formatting shader
Add workflow for ggml-ci webgpu
Try passing absolute path to dawn in ggml-ci
Avoid error on device destruction, add todos for proper cleanup
Fix unused warning
Forgot one parameter unused
Move some flashattn computation to f32 for correctness

macOS/iOS:

Linux:

Windows:

openEuler: