ROCm/aotriton 0.9.2b on GitHub

What's Changed from Release 0.9 Beta

Fix a linker script problem in 0.9b which did not import the default linker script.
Fix the version number problem in 0.9.1b which still uses 0.9.0 and causes potential confusions.

What's Changed from Release 0.8 Beta

Initial support gfx950 by @xinyazhang in #64
- Note: gfx950 support is fully experimental, not built by default, and not shipped in the release binary packages unless explicitly stated in the package name
Non Power of Two (NPOT) head dimension Optimization by @BinDinAMD and @xinyazhang in #66
- Newly added optimized NPOT head dimensions: 48, 80, 96, 160, 192, 224
- Note: older AOTriton does support these head dimensions, but previously inputs with these head dimensions are padded loaded/accumulated to power of two in-register tensors, cause performance issues.
Port 20250128 main perf kernel by @xinyazhang in #70
Use Philox64x4 PRNG, and remove RETURN_ENCODED_SOFTMAX=True variant from compiled forward kernel by @xinyazhang in #71
- Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with idropout_p which is converted from fp32 dropout_p to i32.
- debug_fill_dropout_rng and debug_fill_dropout_rng_tensor are deprecated and will be removed in 0.10, since they still use Philox32x1 PRNG, and their outputs are not matching the actual PRNG values used in the dropout process.
- debug_simulate_encoded_softmax is the new API to get the outputs of the Philox64x4 PRNG. It will write 0.5 if the PRNG is greater than the dropout_p, and -0.5 otherwise.
Add "AOTriton .." string to .comment section of libaotriton_v2.so by @xinyazhang in #74
- Now you can verify the precise AOTriton version with readelf -p .comment libaotriton_v2.so. An example output is:

[     0]  GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[    2c]  AOTriton 0.9.0

Enable Persistent Dynamic for Causal if input is not varlen by @xinyazhang in #73
- For v2::flash::attn_fwd API, atomic_for_causal must be a one-element GPU tensor with zero value if is_causal is true.
Fused BWD Kernel by @BinDinAMD in #69
- No support for hdim > 256 due to register pressure.
- Empirically, this kernel outperforms the split kernel when hdim*seqlen <= 64 * 512.
- When use fused bwd kernel, the delta tensor is not needed.
Misc changes and performance tuning for 0.9b release by @xinyazhang in #76
- Initial RX 9070XT support
- The softmax_lse tensor now is optional in forward kernel. For inference only process this tensor is not needed.

Known Problems

There are unidentified memory alignment requirements to input/output tensors. If possible, please pad the input tensor shapes to multiple of 8 for safety (except for the batch dimension).
FA kernel built for 9070XT and gfx950 may cause GPU segfaults on certain unidentified conditions.

Full Changelog: 0.8b...0.9.2b

ROCm/aotriton 0.9.2b AOTriton 0.9.2 Beta on GitHub

What's Changed from Release 0.9 Beta

What's Changed from Release 0.8 Beta

Known Problems

ROCm/aotriton 0.9.2b
AOTriton 0.9.2 Beta

on GitHub