CAVEAT FOR RADEON RX 7000 SERIES USERS

It is not recommended to upgrade to 0.11b at the moment due to accuracy
problems and numerical errors. Architecture gfx1100 is moved back to
experimental.

Major Changes from Release 0.10 Beta

AITER Assembly Kernel integration on gfx942/gfx950
- Training performance improved from ~300 TFLOPS to 500+ TFLOPS.
- See Appendix for details.
V3 API becomes the default API.
- V2 API is considered "frozen" for backward compatibility.
- Lazy Tensor is added to defer the allocation of certain inputs Tensors.
  This breaks the ABI and thus 0.11 is incompatible with 0.10 in both source
  and binary form.
Adds Windows support
- Currently it can only build library runtime.
Update Triton Compiler to Aug 05 2025.
Experimental support of gfx1102. Fixes GPU classifier for gfx1200/1150/1151.
Return natural based logsumexp tensor in flash::attn_fwd, matching the behavior on CUDA.
Fix a kernel bug introduced when implementing SWA
Re-enable head dimension 48/80 for gfx950
Removes forcing PADDED_HEAD=True workaround added in 0.10.

Known Problems

Certain AITER Assembly kernels are not enabled.
- dropout: need to match PRNG implementation b/w forward and backward.
- SWA/GQA: need more testing to confirm the correctness of integration.
Massive accuracy problems on gfx1100. See test/adiffs/gfx1100.txt for details.

Appendix

Training performance measured through test/performance_backward.py.
Note, must set V3_API=1 to auto-select AITER Assembly kernel.

N_CTX=13,14 V3_API=1 PYTHONPATH=build-0.11-test-gfx942/install_dir/lib/ python test/performance_backward.py
fused-attention-batch4-head48-d64-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      458.925228
1  16384.0      478.647878
fused-attention-batch4-head48-d128-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      529.480355
1  16384.0      584.827173
N_CTX=13,14 V3_API=1 PYTHONPATH=build-0.11-test-gfx950/install_dir/lib/ python test/performance_backward.py
fused-attention-batch4-head48-d64-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      551.771459
1  16384.0      568.231547
fused-attention-batch4-head48-d128-V3-causal=False:
	 N_CTX  Triton(TFLOPS)
0   8192.0      571.629813
1  16384.0      586.564681

Notes Generated by GitHub

What's Changed

Start 0.11 development by bumping the version number. by @xinyazhang in #104
test: use l1/l2 norm as softmax_scale by @xinyazhang in #105
Return natural based logsumexp and set TRITON_STORE_BINARY_ONLY=1 by @xinyazhang in #108
Initial AITER ASM kernel Integration by @xinyazhang in #100
Include gfx1200/1150/1151 in GPU string classifier by @jammm in #110
AITER ASM kernel Integration Phase 2: Operator Integration by @xinyazhang in #109
Update README.md by @xinyazhang in #106
Support Windows with noimage mode by @jammm in #112
Disable ccache by default on Windows by @ScottTodd in #115
Add gfx1101 target by @sstamenk in #117
CMakeLists.txt: AOTRITON_INHERIT_SYSTEM_SITE_TRITON flag by @LunNova in #116
AITER ASM kernel Integration Phase 3: Tuning by @xinyazhang in #120
Finalize 0.11b Development by @xinyazhang in #123

New Contributors

@jammm made their first contribution in #110
@ScottTodd made their first contribution in #115
@sstamenk made their first contribution in #117
@LunNova made their first contribution in #116

Full Changelog: 0.10b...0.11b

ROCm/aotriton 0.11b AOTriton 0.11 Beta on GitHub