ROCm/aotriton 0.10b on GitHub

Major Changes from Release 0.9 Beta

Official support of gfx950/gfx1201
Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
Reduce libaotriton.so binary size by over 80%
Separate the concept of "GPU" and "Architecture"
- New naming scheme: gfxNNN to select architecture, gfxAAA_modBBB to
  select concrete GPUs. For example, gfx942 selects MI300X/MI300A/MI325X,
  gfx942_mod0 selects MI300X only.
- New cmake options: AOTRITON_TARGET_ARCH and
  AOTRITON_OVERRIDE_TARGET_GPUS
- Old cmake option TARGET_GPUS is OBSOLETE. Using it is prevented by the
  CMake build script since this release.
Revised dispatcher and code generator.
- Unused Triton kernel arguments will be compiled as tl.constexpr(1)
Causal masks are implemented with Generalized sliding window attention (SWA)
- We call it "Generalized" because negative values are allowed for both
  window_left and window_right, and used to simulate bottom-right
  aligned causal masks.
Update Triton Compiler to May 21 2025.
- This compiler also changes the tl.randint4x from Philox64x4 to
  Philox32x4. The Triton kernel also changes accordingly.
[Technology Preview] V3 API is added to the library, which looks like:
```
hipError_t AOTRITON_API
AOTRITON_NS::v3::flash::attn_fwd(const attn_fwd_params& params,
                                 int32_t params_version,
                                 AOTRITON_NS::Stream stream,
                                 const attn_options* options = nullptr);
```
- This will be the main API to support in the future, which allows multiple
  backends being selected automatically according to their measured
  performance, e.g., split/fused backward, Triton kernel vs AITER ASM
  kernel.
- V2 API is considered "frozen" for backward compatibility. New features
  will only be exposed through V3 API, e.g., Generalized SWA.

Known Problems

NPOT optimization for head dimension 48/80 is disabled for fused backward
kernel on gfx950. No compiler options can generate correct GPU kernel.
For attn_fwd with bias (a.k.a. attn_mask) tensor inputs,
PADDED_HEAD=True must be enforced to ensure the correctness. This will
have negative impact on the forward performance with bias enabled.
See "Known Problems" Section from #97
for other problems that will not affect users.

What's Changed (Automatically Generated)

build: Fix the linker script by @xinyazhang in #77
Start the development of version 0.10.0, and update the README.md by @xinyazhang in #79
Cut final binary size by @xinyazhang in #84
[BUG]Fix fuse bwd kernel compiling issue by @binding7012 in #83
Add pytest-xdist support to test/test_backward.py by @xinyazhang in #85
Dispatcher V3 Part 1 - Separate Concept "Arch(itecture)" and "GPU" by @xinyazhang in #87
CMakeLists.txt: don't build Triton's Proton profiler by @scottt in #90
v2src/CMakeLists.txt: stop if v2python.gpu_targets fails by @scottt in #94
template/autotune_table_entry.cc: include iostream for cerr by @scottt in #93
Support the gfx1151 by @scottt in #91
[Bug]fix table tool bug of different column name of causal for backward kernel by @binding7012 in #88
Dispatcher V3 Part 2 - Kernel Overloading by @xinyazhang in #89
Dispatcher V3 Part 3: Operators by @xinyazhang in #95
Re-implement Causal Masks with Windowed Attention by @xinyazhang in #96
Misc Changes and Triton Kernel Performance Tuning for 0.10b release by @xinyazhang in #97
Finalize 0.10b Development by @xinyazhang in #102
0.10b Document Updates by @xinyazhang in #103

New Contributors

@scottt made their first contribution in #90

Full Changelog: 0.9.2b...0.10b

ROCm/aotriton 0.10b AOTriton 0.10 Beta on GitHub

Major Changes from Release 0.9 Beta

Known Problems

What's Changed (Automatically Generated)

New Contributors

ROCm/aotriton 0.10b
AOTriton 0.10 Beta

on GitHub