github ROCm/aotriton 0.12b
AOTriton 0.12 Beta

4 hours ago

Major Changes from Release 0.11 Beta

API Changes

  • BREAKING Varlen LSE tensor shape changes to (H, Total_seqlen) (#149).
  • Two new varlen layouts: PaddedVarlen and StridedVarlen for
    TransformerEngine compatibility — V3 API only (#150).
  • BREAKING LazyTensor::acquire now receives self instead of cookie
    (#164).
  • Support hdim_qk != hdim_vo; dispatcher infers both from Q and V (#135).
  • attn_options::deterministic = true forces the deterministic split-kernel
    backward path (#134).
  • All V2 launch functions marked [[deprecated]]; removal planned for next
    feature release (#164). Exception: check_gpu and
    debug_simulate_encoded_softmax are un-deprecated in aotriton/flash.h
    (#173).

GPU Targets

  • Experimental: gfx1103, gfx1152, gfx1153 iGPU support added (#138, #142).
  • gfx1100 and gfx1151 promoted out of experimental (#173).
  • gfx11xx split into gfx110x (RDNA 3) and gfx115x (RDNA 3.5) build
    packs for independent release tarballs (#173).

Performance

  • gfx950: pipelining + XCD remapping; forward improves from ~753 to
    ~904 TFLOPS on MI355X with Triton mainline (hdim=128, non-causal) (#162).
  • Triton compiler bumped to 9c446b40 (ROCm/triton, Apr 2026).
  • AITER ASM kernels updated to v0.1.11 for gfx942 and gfx950 (#158).
  • Updated tuning databases for gfx942, gfx950, gfx1100, and gfx1201 (#172).

Bug Fixes

  • GQA + attention bias backward produced wrong gradients; fixed by moving
    bias pointer init inside the Q-head loop (#170).
  • Unsupported AOTRITON_TARGET_ARCH now fails loudly with a diagnostic
    instead of a cryptic downstream argparse error (#171).

Build and Dependency Changes

  • Tuning database sharded into per-arch files under
    v3python/database/<vendor>/<arch>/ (#133).
  • pybind11 and incbin submodules removed; pybind11 now comes from pip
    (#152).
  • Alternative Triton wheel YAML config mechanism added; pyyaml is a new
    build dependency (#132, #153).
  • __signature__ now includes AOTRITON_GIT_TREESHA1 (root tree SHA1,
    injectable via env var) (#173).

Minor Changes from Release 0.11 Beta

  • Tests default to V3 API; FWD_IMPL env var selects attn_fwd backend
    (#158).
  • pkg-config added as a build dependency (#159).
  • Windows build fixed for gfx942 affine kernels (wide pstring_view) (#156).

Known Problems

  • gfx1100: a small number of unit tests fail due to compiler accuracy issues.
  • gfx1201: a small number of unit tests fail due to a hipblasLt GPU segfault.
  • gfx950: hdim=48/80 backward kernels disabled pending a compiler fix.
  • gfx950: attn_fwd with hdim=16 silently rounds up to hdim=32 at
    runtime until the upstream compiler bug is resolved.
  • AITER ASM kernels for dropout, SWA/GQA, and MQA/GQA fall back to Triton.

Notes Generated by GitHub

What's Changed

New Contributors

Full Changelog: 0.11b...0.12b

Don't miss a new aotriton release

NewReleases is sending notifications on new releases.