github ROCm/aotriton 0.10b
AOTriton 0.10 Beta

latest releases: 0.12b, 0.11.210b, 0.11.52b...
11 months ago

Major Changes from Release 0.9 Beta

  • Official support of gfx950/gfx1201
  • Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
  • Reduce libaotriton.so binary size by over 80%
  • Separate the concept of "GPU" and "Architecture"
    • New naming scheme: gfxNNN to select architecture, gfxAAA_modBBB to
      select concrete GPUs. For example, gfx942 selects MI300X/MI300A/MI325X,
      gfx942_mod0 selects MI300X only.
    • New cmake options: AOTRITON_TARGET_ARCH and
      AOTRITON_OVERRIDE_TARGET_GPUS
    • Old cmake option TARGET_GPUS is OBSOLETE. Using it is prevented by the
      CMake build script since this release.
  • Revised dispatcher and code generator.
    • Unused Triton kernel arguments will be compiled as tl.constexpr(1)
  • Causal masks are implemented with Generalized sliding window attention (SWA)
    • We call it "Generalized" because negative values are allowed for both
      window_left and window_right, and used to simulate bottom-right
      aligned causal masks.
  • Update Triton Compiler to May 21 2025.
    • This compiler also changes the tl.randint4x from Philox64x4 to
      Philox32x4. The Triton kernel also changes accordingly.
  • [Technology Preview] V3 API is added to the library, which looks like:
    hipError_t AOTRITON_API
    AOTRITON_NS::v3::flash::attn_fwd(const attn_fwd_params& params,
                                     int32_t params_version,
                                     AOTRITON_NS::Stream stream,
                                     const attn_options* options = nullptr);
    • This will be the main API to support in the future, which allows multiple
      backends being selected automatically according to their measured
      performance, e.g., split/fused backward, Triton kernel vs AITER ASM
      kernel.
    • V2 API is considered "frozen" for backward compatibility. New features
      will only be exposed through V3 API, e.g., Generalized SWA.

Known Problems

  • NPOT optimization for head dimension 48/80 is disabled for fused backward
    kernel on gfx950. No compiler options can generate correct GPU kernel.
  • For attn_fwd with bias (a.k.a. attn_mask) tensor inputs,
    PADDED_HEAD=True must be enforced to ensure the correctness. This will
    have negative impact on the forward performance with bias enabled.
  • See "Known Problems" Section from #97
    for other problems that will not affect users.

What's Changed (Automatically Generated)

  • build: Fix the linker script by @xinyazhang in #77
  • Start the development of version 0.10.0, and update the README.md by @xinyazhang in #79
  • Cut final binary size by @xinyazhang in #84
  • [BUG]Fix fuse bwd kernel compiling issue by @binding7012 in #83
  • Add pytest-xdist support to test/test_backward.py by @xinyazhang in #85
  • Dispatcher V3 Part 1 - Separate Concept "Arch(itecture)" and "GPU" by @xinyazhang in #87
  • CMakeLists.txt: don't build Triton's Proton profiler by @scottt in #90
  • v2src/CMakeLists.txt: stop if v2python.gpu_targets fails by @scottt in #94
  • template/autotune_table_entry.cc: include iostream for cerr by @scottt in #93
  • Support the gfx1151 by @scottt in #91
  • [Bug]fix table tool bug of different column name of causal for backward kernel by @binding7012 in #88
  • Dispatcher V3 Part 2 - Kernel Overloading by @xinyazhang in #89
  • Dispatcher V3 Part 3: Operators by @xinyazhang in #95
  • Re-implement Causal Masks with Windowed Attention by @xinyazhang in #96
  • Misc Changes and Triton Kernel Performance Tuning for 0.10b release by @xinyazhang in #97
  • Finalize 0.10b Development by @xinyazhang in #102
  • 0.10b Document Updates by @xinyazhang in #103

New Contributors

Full Changelog: 0.9.2b...0.10b

Don't miss a new aotriton release

NewReleases is sending notifications on new releases.