Major Changes from Release 0.11 Beta
API Changes
- BREAKING Varlen LSE tensor shape changes to
(H, Total_seqlen)(#149). - Two new varlen layouts:
PaddedVarlenandStridedVarlenfor
TransformerEngine compatibility — V3 API only (#150). - BREAKING
LazyTensor::acquirenow receivesselfinstead ofcookie
(#164). - Support
hdim_qk != hdim_vo; dispatcher infers both from Q and V (#135). attn_options::deterministic = trueforces the deterministic split-kernel
backward path (#134).- All V2 launch functions marked
[[deprecated]]; removal planned for next
feature release (#164). Exception:check_gpuand
debug_simulate_encoded_softmaxare un-deprecated inaotriton/flash.h
(#173).
GPU Targets
- Experimental: gfx1103, gfx1152, gfx1153 iGPU support added (#138, #142).
- gfx1100 and gfx1151 promoted out of experimental (#173).
gfx11xxsplit intogfx110x(RDNA 3) andgfx115x(RDNA 3.5) build
packs for independent release tarballs (#173).
Performance
- gfx950: pipelining + XCD remapping; forward improves from ~753 to
~904 TFLOPS on MI355X with Triton mainline (hdim=128, non-causal) (#162). - Triton compiler bumped to
9c446b40(ROCm/triton, Apr 2026). - AITER ASM kernels updated to v0.1.11 for gfx942 and gfx950 (#158).
- Updated tuning databases for gfx942, gfx950, gfx1100, and gfx1201 (#172).
Bug Fixes
- GQA + attention bias backward produced wrong gradients; fixed by moving
bias pointer init inside the Q-head loop (#170). - Unsupported
AOTRITON_TARGET_ARCHnow fails loudly with a diagnostic
instead of a cryptic downstream argparse error (#171).
Build and Dependency Changes
- Tuning database sharded into per-arch files under
v3python/database/<vendor>/<arch>/(#133). pybind11andincbinsubmodules removed;pybind11now comes from pip
(#152).- Alternative Triton wheel YAML config mechanism added;
pyyamlis a new
build dependency (#132, #153). __signature__now includesAOTRITON_GIT_TREESHA1(root tree SHA1,
injectable via env var) (#173).
Minor Changes from Release 0.11 Beta
- Tests default to V3 API;
FWD_IMPLenv var selectsattn_fwdbackend
(#158). pkg-configadded as a build dependency (#159).- Windows build fixed for gfx942 affine kernels (wide
pstring_view) (#156).
Known Problems
- gfx1100: a small number of unit tests fail due to compiler accuracy issues.
- gfx1201: a small number of unit tests fail due to a hipblasLt GPU segfault.
- gfx950:
hdim=48/80backward kernels disabled pending a compiler fix. - gfx950:
attn_fwdwithhdim=16silently rounds up tohdim=32at
runtime until the upstream compiler bug is resolved. - AITER ASM kernels for dropout, SWA/GQA, and MQA/GQA fall back to Triton.
Notes Generated by GitHub
What's Changed
- Use MOLD linker in CI scripts by @xinyazhang in #125
- Update README.md to include PyTorch 2.9 by @xinyazhang in #126
- Add CI Script to Test Triton Upstream by @xinyazhang in #127
- Fix warning 'missing .note.GNU-stack section implies executable stack' by @xinyazhang in #129
- Tuner V3 Part 1: Celery Execution Framework by @xinyazhang in #131
- Alternative Wheels by @xinyazhang in #132
- Sharding the Database by @xinyazhang in #133
- Misc. build changes by @xinyazhang in #136
- Add deterministic algorithm support to v3::flash::attn_options by @xinyazhang in #134
- add gfx1103 iGPU support by @lamikr in #138
- Add support of hdim_qk != hdim_vo by @xinyazhang in #135
- Add gfx1152 and gfx1153 iGPU support by @roberteg16 in #142
- Fix window size in v3 by @alugorey in #144
- Port Changes and Fixes from release/0.11 to main by @xinyazhang in #147
- API Change: Use Compact LSE Tensor for Varlen Inputs by @xinyazhang in #149
- API Change: Support Additional Varlen Memory Layouts by @xinyazhang in #150
- [ISSUE-98]: Alternative Wheel Mechanism by @aoguntayo in #153
- Fix Windows build for gfx942 affine kernels: use wide string for pstring_view by @jammm in #156
- Sunset Git Submodules Other Than Triton by @xinyazhang in #152
- Update README to include pkg-config dependency by @xinyazhang in #159
- Remove stray line in CMakeLists.txt by @sstamenk in #160
- Bump to AITER asm v0.1.11 by @xinyazhang in #158
- Enable Pipelining and XCD Remapping on gfx950 GPUs by @xinyazhang in #162
- Migrate the Tuner to V3 API by @xinyazhang in #164
- fix(gpu_targets): fail loud when arch filter drops every requested target (#169) by @1fanwang in #171
- Tuner V3.5 and Tuning Database. by @xinyazhang in #170
- Remove unnecessary Windows ccache guard by @astrelsky in #167
- 0.12b Operator Tuning Database and related Tuner Changes. by @xinyazhang in #172
- Misc. Changes for 0.12b Release by @xinyazhang in #173
New Contributors
- @lamikr made their first contribution in #138
- @alugorey made their first contribution in #144
- @aoguntayo made their first contribution in #153
- @1fanwang made their first contribution in #171
- @astrelsky made their first contribution in #167
Full Changelog: 0.11b...0.12b