github ROCm/aotriton 0.11b
AOTriton 0.11 Beta

latest releases: 0.12b, 0.11.210b, 0.11.52b...
9 months ago

CAVEAT FOR RADEON RX 7000 SERIES USERS

It is not recommended to upgrade to 0.11b at the moment due to accuracy
problems and numerical errors. Architecture gfx1100 is moved back to
experimental.

Major Changes from Release 0.10 Beta

  • AITER Assembly Kernel integration on gfx942/gfx950
    • Training performance improved from ~300 TFLOPS to 500+ TFLOPS.
    • See Appendix for details.
  • V3 API becomes the default API.
    • V2 API is considered "frozen" for backward compatibility.
    • Lazy Tensor is added to defer the allocation of certain inputs Tensors.
      This breaks the ABI and thus 0.11 is incompatible with 0.10 in both source
      and binary form.
  • Adds Windows support
    • Currently it can only build library runtime.
  • Update Triton Compiler to Aug 05 2025.
  • Experimental support of gfx1102. Fixes GPU classifier for gfx1200/1150/1151.
  • Return natural based logsumexp tensor in flash::attn_fwd, matching the behavior on CUDA.
  • Fix a kernel bug introduced when implementing SWA
  • Re-enable head dimension 48/80 for gfx950
  • Removes forcing PADDED_HEAD=True workaround added in 0.10.

Known Problems

  • Certain AITER Assembly kernels are not enabled.
    • dropout: need to match PRNG implementation b/w forward and backward.
    • SWA/GQA: need more testing to confirm the correctness of integration.
  • Massive accuracy problems on gfx1100. See test/adiffs/gfx1100.txt for details.

Appendix

Training performance measured through test/performance_backward.py.
Note, must set V3_API=1 to auto-select AITER Assembly kernel.

N_CTX=13,14 V3_API=1 PYTHONPATH=build-0.11-test-gfx942/install_dir/lib/ python test/performance_backward.py
fused-attention-batch4-head48-d64-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      458.925228
1  16384.0      478.647878
fused-attention-batch4-head48-d128-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      529.480355
1  16384.0      584.827173
N_CTX=13,14 V3_API=1 PYTHONPATH=build-0.11-test-gfx950/install_dir/lib/ python test/performance_backward.py
fused-attention-batch4-head48-d64-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      551.771459
1  16384.0      568.231547
fused-attention-batch4-head48-d128-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      571.629813
1  16384.0      586.564681

Notes Generated by GitHub

What's Changed

New Contributors

Full Changelog: 0.10b...0.11b

Don't miss a new aotriton release

NewReleases is sending notifications on new releases.