github ggml-org/llama.cpp b8190

latest release: b8191
3 hours ago
Details

ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965)

  • ggml-webgpu: fix workgroup dispatch limit for large batch sizes

WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.

  • add compute_2d_workgroups() helper to split total workgroup ID across
    X/Y dimensions

  • update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
    dispatch

  • update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
    from 2D dispatch

  • update mul_mat.wgsl to compute global index from 2D workgroup
    coordinates

  • refactor all three mul_mat dispatch paths to use the shared helper

  • ggml-webgpu: add bounds checking for over-dispatched workgroups

2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.

  • add batch_idx bound check to mul_mat_reg_tile.wgsl and
    mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
    from accessing invalid memory

  • fixes test failures with large batch sizes (eg., bs=[128, 1024])

  • ggml-webgpu: add back TODO for spliting large sizes into batches

  • Optimize 2d workgroup provisioning

  • Set some parameters that increase speed


Co-authored-by: Reese Levine reeselevine1@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.