github ggml-org/llama.cpp b7922

2 hours ago
Details

CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053)

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the two for-loops
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

GPU Model Test t/s master t/s osimons/fix_bw_mmq_fixup_kernel Speedup
NVIDIA RTX 6000 Ada Generation gpt-oss 20B MXFP4 MoE pp8096 8404.05 8375.79 1.00
NVIDIA RTX 6000 Ada Generation llama 3B Q4_K_M pp8096 16148.93 16019.60 0.99
NVIDIA RTX 6000 Ada Generation llama 8B Q4_0 pp8096 8008.29 7978.80 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B BF16 pp8096 4263.16 4248.53 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B Q4_K_M pp8096 5165.11 5157.43 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition gpt-oss 20B MXFP4 MoE pp8096 12582.80 12758.37 1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 3B Q4_K_M pp8096 16879.10 17619.47 1.04
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 8B Q4_0 pp8096 10649.90 10982.65 1.03
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B BF16 pp8096 7717.73 7716.22 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B Q4_K_M pp8096 7301.90 7370.38 1.01

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.