ggml-org/llama.cpp b8190 on GitHub

Details

ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965)

ggml-webgpu: fix workgroup dispatch limit for large batch sizes

WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.

add compute_2d_workgroups() helper to split total workgroup ID across
X/Y dimensions
update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
dispatch
update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
from 2D dispatch
update mul_mat.wgsl to compute global index from 2D workgroup
coordinates
refactor all three mul_mat dispatch paths to use the shared helper
ggml-webgpu: add bounds checking for over-dispatched workgroups

2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.

add batch_idx bound check to mul_mat_reg_tile.wgsl and
mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
from accessing invalid memory
fixes test failures with large batch sizes (eg., bs=[128, 1024])
ggml-webgpu: add back TODO for spliting large sizes into batches
Optimize 2d workgroup provisioning
Set some parameters that increase speed

Co-authored-by: Reese Levine reeselevine1@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler: