Major Changes from Release 0.9 Beta
- Official support of gfx950/gfx1201
- Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
- Reduce
libaotriton.sobinary size by over 80% - Separate the concept of "GPU" and "Architecture"
- New naming scheme:
gfxNNNto select architecture,gfxAAA_modBBBto
select concrete GPUs. For example,gfx942selects MI300X/MI300A/MI325X,
gfx942_mod0selects MI300X only. - New cmake options:
AOTRITON_TARGET_ARCHand
AOTRITON_OVERRIDE_TARGET_GPUS - Old cmake option
TARGET_GPUSis OBSOLETE. Using it is prevented by the
CMake build script since this release.
- New naming scheme:
- Revised dispatcher and code generator.
- Unused Triton kernel arguments will be compiled as
tl.constexpr(1)
- Unused Triton kernel arguments will be compiled as
- Causal masks are implemented with Generalized sliding window attention (SWA)
- We call it "Generalized" because negative values are allowed for both
window_leftandwindow_right, and used to simulate bottom-right
aligned causal masks.
- We call it "Generalized" because negative values are allowed for both
- Update Triton Compiler to May 21 2025.
- This compiler also changes the
tl.randint4xfrom Philox64x4 to
Philox32x4. The Triton kernel also changes accordingly.
- This compiler also changes the
- [Technology Preview] V3 API is added to the library, which looks like:
hipError_t AOTRITON_API AOTRITON_NS::v3::flash::attn_fwd(const attn_fwd_params& params, int32_t params_version, AOTRITON_NS::Stream stream, const attn_options* options = nullptr);
- This will be the main API to support in the future, which allows multiple
backends being selected automatically according to their measured
performance, e.g., split/fused backward, Triton kernel vs AITER ASM
kernel. - V2 API is considered "frozen" for backward compatibility. New features
will only be exposed through V3 API, e.g., Generalized SWA.
- This will be the main API to support in the future, which allows multiple
Known Problems
- NPOT optimization for head dimension 48/80 is disabled for fused backward
kernel on gfx950. No compiler options can generate correct GPU kernel. - For
attn_fwdwithbias(a.k.a.attn_mask) tensor inputs,
PADDED_HEAD=Truemust be enforced to ensure the correctness. This will
have negative impact on the forward performance with bias enabled. - See "Known Problems" Section from #97
for other problems that will not affect users.
What's Changed (Automatically Generated)
- build: Fix the linker script by @xinyazhang in #77
- Start the development of version 0.10.0, and update the README.md by @xinyazhang in #79
- Cut final binary size by @xinyazhang in #84
- [BUG]Fix fuse bwd kernel compiling issue by @binding7012 in #83
- Add pytest-xdist support to test/test_backward.py by @xinyazhang in #85
- Dispatcher V3 Part 1 - Separate Concept "Arch(itecture)" and "GPU" by @xinyazhang in #87
- CMakeLists.txt: don't build Triton's Proton profiler by @scottt in #90
- v2src/CMakeLists.txt: stop if v2python.gpu_targets fails by @scottt in #94
- template/autotune_table_entry.cc: include iostream for cerr by @scottt in #93
- Support the gfx1151 by @scottt in #91
- [Bug]fix table tool bug of different column name of causal for backward kernel by @binding7012 in #88
- Dispatcher V3 Part 2 - Kernel Overloading by @xinyazhang in #89
- Dispatcher V3 Part 3: Operators by @xinyazhang in #95
- Re-implement Causal Masks with Windowed Attention by @xinyazhang in #96
- Misc Changes and Triton Kernel Performance Tuning for 0.10b release by @xinyazhang in #97
- Finalize 0.10b Development by @xinyazhang in #102
- 0.10b Document Updates by @xinyazhang in #103
New Contributors
Full Changelog: 0.9.2b...0.10b