Details
CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806)
- CUDA: add CDNA3 MFMA support for flash attention MMA kernel
Add MI300X (gfx942) MFMA tensor core flash attention using
v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate).
- Add FATTN_WARP_SIZE=64 for CDNA wavefront64
- Add CDNA config for head sizes 64, 80, 96, 112, 128
- Add FP16 MFMA intrinsic path in mma.cuh
- Add manual V transpose load for MFMA register layout
- Route CDNA to MMA for prompt processing, VEC for token generation
- Fix Q loading and combine stride granularity for non-power-of-2 heads
Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X):
pp512 +7%, pp1024 +13%, pp2048 +23%, pp4096 +39%
tg128 -10% (FA overhead, VEC used for both)
All 2480 flash attention tests pass.
Ref: #17917
- address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch
- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
ggml_cuda_get_physical_warp_size() in each device function - Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
crossover on MI300X @ d32768 with power-of-2 GQA models:
hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
Unified threshold: eff_nq >= 128 for all head sizes. - Remove VEC fallback; small batches fall through to tile kernel
-
Update ggml/src/ggml-cuda/fattn.cu
-
use ggml_cuda_info().devices warp_size instead of hardcoded check
Co-authored-by: Johannes Gäßler johannesg@5d6.de
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: