github ggml-org/llama.cpp b8739

3 hours ago
Details

HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (#21570)

Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support:

  • vendors/hip.h: Add CDNA4 preprocessor define for gfx950
  • common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros
  • mma.cuh: Route CDNA4 to compatible MFMA instructions:
    • f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950)
    • bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3)
    • int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3)
  • mmq.cuh: Include CDNA4 in stream-k kernel dispatch

CDNA4 is largely compatible with CDNA3 except:

  • No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path
  • Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here

Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1:

  • Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950
  • llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU):
    • f16+FA: 40,013 tok/s prefill, 254 tok/s decode
    • q8_0+FA: functional
  • Flash attention: works correctly
  • MMQ: works correctly with stream-k dispatch

Co-authored-by: Andy Luo andyluo7@users.noreply.github.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.