ggml-org/llama.cpp b9510 on GitHub

Details

ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (#22209)

ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128

Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using
WASM SIMD128 intrinsics, gated behind #ifdef wasm_simd128 so
non-wasm builds are completely unaffected.

Approach:

single wasm_v128_load covers all 32 packed 4-bit weights
nibbles unpacked via AND/SHR into two u8x16 registers
widened to i16 before multiply (WASM SIMD has no i8*i8 instruction)
4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs
horizontal reduce via 4x wasm_i32x4_extract_lane

Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32,
200k iterations):

impl	ns/call	speedup
scalar	880.7	1.00x
simd	257.8	3.42x

Correctness verified against scalar reference across 10 random seeds
with exact output match.

ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend

Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c.
Move for loop in the else block.

ggml: use generic q4_1_q8_1 fallback in wasm backend

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI: