ggml-org/llama.cpp b9275 on GitHub

Details

metal : optimize concat kernel and fix set kernel threads (#23411)

metal : fix GGML_OP_SET kernel threads
tests : extend test_cpy to support different src/dst shapes

Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.

Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
Tests exercise 1024 boundary, small shapes, and large dimensionality changes
Fixed dangling reference bug (storing & to temporary std::array)
Updated all existing test calls with permute/transpose args for compatibility

Assisted-by: llama.cpp:local pi

metal : optimize concat kernel with row batching for small widths

When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.

Dispatch nth = min(256, ne0) threads per group
Calculate nrptg (rows per threadgroup) to fill up to 256 threads
Update kernel index calculation to handle the row batching
Add boundary check for i1 >= ne1

Assisted-by: llama.cpp:local pi

tests : clean-up
tests : refactor CPY shape tests to use dimension permutations

Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).

Assisted-by: llama.cpp:local pi

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler: