ggml-org/llama.cpp b8749 on GitHub

Details

ggml-webgpu: address quantization precision and backend lifecycle managment (#21521)

ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
Merge with upstream
Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
Update Unary wgsl EXP and EXPM1 for f16 stability
Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
Fix numerical percision for unary sqrt when working with f16
Fix NaN canonicalization for packed integers using f16
Update err threshold for binary div ops when using f16
backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
clean: uncomment existing code logs
clean: clean the unncessary debug info
Refactor and generalize dequant helpers
Remove deprecated quant structs
Refactor shader defines to reduce repetition
Remove error override for F16 type
fix: fix the accidential removal of the proper initialization of ctx
clean: clean legacy and format code
fix: did not modify tests ops

Co-authored-by: Jeremy J. Hartmann jeremy@mtion.tv

macOS/iOS:

Linux:

Windows:

openEuler: