github ggml-org/llama.cpp b9275

one hour ago
Details

metal : optimize concat kernel and fix set kernel threads (#23411)

  • metal : fix GGML_OP_SET kernel threads

  • tests : extend test_cpy to support different src/dst shapes

Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.

  • Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
  • Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
  • Tests exercise 1024 boundary, small shapes, and large dimensionality changes
  • Fixed dangling reference bug (storing & to temporary std::array)
  • Updated all existing test calls with permute/transpose args for compatibility

Assisted-by: llama.cpp:local pi

  • metal : optimize concat kernel with row batching for small widths

When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.

  • Dispatch nth = min(256, ne0) threads per group
  • Calculate nrptg (rows per threadgroup) to fill up to 256 threads
  • Update kernel index calculation to handle the row batching
  • Add boundary check for i1 >= ne1

Assisted-by: llama.cpp:local pi

  • tests : clean-up

  • tests : refactor CPY shape tests to use dimension permutations

Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).

Assisted-by: llama.cpp:local pi

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.