CAVEAT FOR RADEON RX 7000 SERIES USERS
It is not recommended to upgrade to 0.11b at the moment due to accuracy
problems and numerical errors. Architecture gfx1100 is moved back to
experimental.
Major Changes from Release 0.10 Beta
- AITER Assembly Kernel integration on gfx942/gfx950
- Training performance improved from ~300 TFLOPS to 500+ TFLOPS.
- See Appendix for details.
- V3 API becomes the default API.
- V2 API is considered "frozen" for backward compatibility.
- Lazy Tensor is added to defer the allocation of certain inputs Tensors.
This breaks the ABI and thus 0.11 is incompatible with 0.10 in both source
and binary form.
- Adds Windows support
- Currently it can only build library runtime.
- Update Triton Compiler to Aug 05 2025.
- Experimental support of gfx1102. Fixes GPU classifier for gfx1200/1150/1151.
- Return natural based
logsumexptensor inflash::attn_fwd, matching the behavior on CUDA. - Fix a kernel bug introduced when implementing SWA
- Re-enable head dimension 48/80 for gfx950
- Removes forcing
PADDED_HEAD=Trueworkaround added in 0.10.
Known Problems
- Certain AITER Assembly kernels are not enabled.
- dropout: need to match PRNG implementation b/w forward and backward.
- SWA/GQA: need more testing to confirm the correctness of integration.
- Massive accuracy problems on gfx1100. See
test/adiffs/gfx1100.txtfor details.
Appendix
Training performance measured through test/performance_backward.py.
Note, must set V3_API=1 to auto-select AITER Assembly kernel.
N_CTX=13,14 V3_API=1 PYTHONPATH=build-0.11-test-gfx942/install_dir/lib/ python test/performance_backward.py
fused-attention-batch4-head48-d64-V3-causal=False:
N_CTX Triton(TFLOPS)
0 8192.0 458.925228
1 16384.0 478.647878
fused-attention-batch4-head48-d128-V3-causal=False:
N_CTX Triton(TFLOPS)
0 8192.0 529.480355
1 16384.0 584.827173
N_CTX=13,14 V3_API=1 PYTHONPATH=build-0.11-test-gfx950/install_dir/lib/ python test/performance_backward.py
fused-attention-batch4-head48-d64-V3-causal=False:
N_CTX Triton(TFLOPS)
0 8192.0 551.771459
1 16384.0 568.231547
fused-attention-batch4-head48-d128-V3-causal=False:
N_CTX Triton(TFLOPS)
0 8192.0 571.629813
1 16384.0 586.564681Notes Generated by GitHub
What's Changed
- Start 0.11 development by bumping the version number. by @xinyazhang in #104
- test: use l1/l2 norm as softmax_scale by @xinyazhang in #105
- Return natural based logsumexp and set TRITON_STORE_BINARY_ONLY=1 by @xinyazhang in #108
- Initial AITER ASM kernel Integration by @xinyazhang in #100
- Include gfx1200/1150/1151 in GPU string classifier by @jammm in #110
- AITER ASM kernel Integration Phase 2: Operator Integration by @xinyazhang in #109
- Update README.md by @xinyazhang in #106
- Support Windows with noimage mode by @jammm in #112
- Disable ccache by default on Windows by @ScottTodd in #115
- Add gfx1101 target by @sstamenk in #117
- CMakeLists.txt: AOTRITON_INHERIT_SYSTEM_SITE_TRITON flag by @LunNova in #116
- AITER ASM kernel Integration Phase 3: Tuning by @xinyazhang in #120
- Finalize 0.11b Development by @xinyazhang in #123
New Contributors
- @jammm made their first contribution in #110
- @ScottTodd made their first contribution in #115
- @sstamenk made their first contribution in #117
- @LunNova made their first contribution in #116
Full Changelog: 0.10b...0.11b