ggml-org/llama.cpp b7922 on GitHub

Details

CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053)

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the two for-loops
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

GPU	Model	Test	t/s master	t/s osimons/fix_bw_mmq_fixup_kernel	Speedup
NVIDIA RTX 6000 Ada Generation	gpt-oss 20B MXFP4 MoE	pp8096	8404.05	8375.79	1.00
NVIDIA RTX 6000 Ada Generation	llama 3B Q4_K_M	pp8096	16148.93	16019.60	0.99
NVIDIA RTX 6000 Ada Generation	llama 8B Q4_0	pp8096	8008.29	7978.80	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B BF16	pp8096	4263.16	4248.53	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B Q4_K_M	pp8096	5165.11	5157.43	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	gpt-oss 20B MXFP4 MoE	pp8096	12582.80	12758.37	1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 3B Q4_K_M	pp8096	16879.10	17619.47	1.04
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 8B Q4_0	pp8096	10649.90	10982.65	1.03
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B BF16	pp8096	7717.73	7716.22	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B Q4_K_M	pp8096	7301.90	7370.38	1.01

macOS/iOS:

Linux:

Windows:

openEuler: