Major Changes from Release 0.11 Beta

API Changes

BREAKING Varlen LSE tensor shape changes to (H, Total_seqlen) (#149).
Two new varlen layouts: PaddedVarlen and StridedVarlen for
TransformerEngine compatibility — V3 API only (#150).
BREAKING LazyTensor::acquire now receives self instead of cookie
(#164).
Support hdim_qk != hdim_vo; dispatcher infers both from Q and V (#135).
attn_options::deterministic = true forces the deterministic split-kernel
backward path (#134).
All V2 launch functions marked [[deprecated]]; removal planned for next
feature release (#164). Exception: check_gpu and
debug_simulate_encoded_softmax are un-deprecated in aotriton/flash.h
(#173).

GPU Targets

Experimental: gfx1103, gfx1152, gfx1153 iGPU support added (#138, #142).
gfx1100 and gfx1151 promoted out of experimental (#173).
gfx11xx split into gfx110x (RDNA 3) and gfx115x (RDNA 3.5) build
packs for independent release tarballs (#173).

Performance

gfx950: pipelining + XCD remapping; forward improves from ~753 to
~904 TFLOPS on MI355X with Triton mainline (hdim=128, non-causal) (#162).
Triton compiler bumped to 9c446b40 (ROCm/triton, Apr 2026).
AITER ASM kernels updated to v0.1.11 for gfx942 and gfx950 (#158).
Updated tuning databases for gfx942, gfx950, gfx1100, and gfx1201 (#172).

Bug Fixes

GQA + attention bias backward produced wrong gradients; fixed by moving
bias pointer init inside the Q-head loop (#170).
Unsupported AOTRITON_TARGET_ARCH now fails loudly with a diagnostic
instead of a cryptic downstream argparse error (#171).

Build and Dependency Changes

Tuning database sharded into per-arch files under
v3python/database/<vendor>/<arch>/ (#133).
pybind11 and incbin submodules removed; pybind11 now comes from pip
(#152).
Alternative Triton wheel YAML config mechanism added; pyyaml is a new
build dependency (#132, #153).
__signature__ now includes AOTRITON_GIT_TREESHA1 (root tree SHA1,
injectable via env var) (#173).

Minor Changes from Release 0.11 Beta

Tests default to V3 API; FWD_IMPL env var selects attn_fwd backend
(#158).
pkg-config added as a build dependency (#159).
Windows build fixed for gfx942 affine kernels (wide pstring_view) (#156).

Known Problems

gfx1100: a small number of unit tests fail due to compiler accuracy issues.
gfx1201: a small number of unit tests fail due to a hipblasLt GPU segfault.
gfx950: hdim=48/80 backward kernels disabled pending a compiler fix.
gfx950: attn_fwd with hdim=16 silently rounds up to hdim=32 at
runtime until the upstream compiler bug is resolved.
AITER ASM kernels for dropout, SWA/GQA, and MQA/GQA fall back to Triton.

Notes Generated by GitHub

What's Changed

Use MOLD linker in CI scripts by @xinyazhang in #125
Update README.md to include PyTorch 2.9 by @xinyazhang in #126
Add CI Script to Test Triton Upstream by @xinyazhang in #127
Fix warning 'missing .note.GNU-stack section implies executable stack' by @xinyazhang in #129
Tuner V3 Part 1: Celery Execution Framework by @xinyazhang in #131
Alternative Wheels by @xinyazhang in #132
Sharding the Database by @xinyazhang in #133
Misc. build changes by @xinyazhang in #136
Add deterministic algorithm support to v3::flash::attn_options by @xinyazhang in #134
add gfx1103 iGPU support by @lamikr in #138
Add support of hdim_qk != hdim_vo by @xinyazhang in #135
Add gfx1152 and gfx1153 iGPU support by @roberteg16 in #142
Fix window size in v3 by @alugorey in #144
Port Changes and Fixes from release/0.11 to main by @xinyazhang in #147
API Change: Use Compact LSE Tensor for Varlen Inputs by @xinyazhang in #149
API Change: Support Additional Varlen Memory Layouts by @xinyazhang in #150
[ISSUE-98]: Alternative Wheel Mechanism by @aoguntayo in #153
Fix Windows build for gfx942 affine kernels: use wide string for pstring_view by @jammm in #156
Sunset Git Submodules Other Than Triton by @xinyazhang in #152
Update README to include pkg-config dependency by @xinyazhang in #159
Remove stray line in CMakeLists.txt by @sstamenk in #160
Bump to AITER asm v0.1.11 by @xinyazhang in #158
Enable Pipelining and XCD Remapping on gfx950 GPUs by @xinyazhang in #162
Migrate the Tuner to V3 API by @xinyazhang in #164
fix(gpu_targets): fail loud when arch filter drops every requested target (#169) by @1fanwang in #171
Tuner V3.5 and Tuning Database. by @xinyazhang in #170
Remove unnecessary Windows ccache guard by @astrelsky in #167
0.12b Operator Tuning Database and related Tuner Changes. by @xinyazhang in #172
Misc. Changes for 0.12b Release by @xinyazhang in #173

New Contributors

@lamikr made their first contribution in #138
@alugorey made their first contribution in #144
@aoguntayo made their first contribution in #153
@1fanwang made their first contribution in #171
@astrelsky made their first contribution in #167

Full Changelog: 0.11b...0.12b

ROCm/aotriton 0.12b AOTriton 0.12 Beta on GitHub

Major Changes from Release 0.11 Beta

API Changes

GPU Targets

Performance

Bug Fixes

Build and Dependency Changes

Minor Changes from Release 0.11 Beta

Known Problems

Notes Generated by GitHub

What's Changed

New Contributors

ROCm/aotriton 0.12b
AOTriton 0.12 Beta

on GitHub