github ggml-org/llama.cpp b8333

latest release: b8334
2 hours ago
Details

graph : remove redundant GDN state transposes (#20443)

  • ggml : transpose fused GDN state access for coalesced memory reads (#20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:

  • Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
  • CUDA: curr_state[iS_v+col] -> curr_state[colS_v+i] (coalesced)
  • CPU: restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
  • Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
    dot products in the CPU fused GDN kernel (delta and attention output)
  • Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
    path lacks device support, disable both to prevent state layout mismatch
    between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • llama : rever fgdn argument changes

  • graph : remove GDN state transposes

  • vulkan : adapt

  • cuda : remove obsolete smem code


Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Co-authored-by: Oliver Simons osimons@nvidia.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.