Details
CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227)
- CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware
The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.
This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):
Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%)
Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%)
others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged)
Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.
Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.
Selected pp512 throughput (tok/s, ub=8):
Q4_K_S: 559 -> 940 (+68%)
Q5_K_S: 503 -> 884 (+76%)
Q3_K_S: 629 -> 879 (+40%)
Q2_K : 615 -> 809 (+32%)
Q6_K : 582 -> 776 (+33%)
Selected pp512 throughput (tok/s, ub=4):
Q4_K_S: 444 -> 480 (+ 8%)
Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ)
IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ)
-
CUDA: address review — inline MMVQ batch table, drop env hatch & doc block
-
tune kernel selection logic for CDNA1
Co-authored-by: Johannes Gäßler johannesg@5d6.de
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32) DISABLED
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.3 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL) DISABLED
- Windows x64 (HIP)
openEuler:
- DISABLED
- openEuler x86 (310p)
- openEuler x86 (910b, ACL Graph)
- openEuler aarch64 (310p)
- openEuler aarch64 (910b, ACL Graph)
UI: