ggml-org/llama.cpp b8333 on GitHub

Details

graph : remove redundant GDN state transposes (#20443)

ggml : transpose fused GDN state access for coalesced memory reads (#20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:

Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
CUDA: curr_state[iS_v+col] -> curr_state[colS_v+i] (coalesced)
CPU: restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags

Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

llama : rever fgdn argument changes
graph : remove GDN state transposes
vulkan : adapt
cuda : remove obsolete smem code

Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Co-authored-by: Oliver Simons osimons@nvidia.com

macOS/iOS:

Linux:

Windows:

openEuler: