github ggml-org/llama.cpp b9112

one hour ago
Details

CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944)

im2col_cuda and im2col_3d_cuda both dispatch with
block_nums.y = OW. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet
at 11 s lands at OW = 176000 -- and the launch returns
invalid configuration argument.

Clamp block_nums.y to MIN(OW, MAX_GRIDDIM_Y) and loop inside the
kernel with stride MAX_GRIDDIM_Y. Same in-kernel stride pattern
already used for the z axis (MAX_GRIDDIM_Z). Both 2D im2col_kernel
and 3D im2col_3d_kernel need the same fix. Bit-identical for
OW <= 65535 (single iteration of the new outer loop).

Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns
invalid configuration argument, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.