What's Changed from Release 0.9 Beta
- Fix a linker script problem in 0.9b which did not import the default linker script.
- Fix the version number problem in 0.9.1b which still uses 0.9.0 and causes potential confusions.
What's Changed from Release 0.8 Beta
- Initial support gfx950 by @xinyazhang in #64
- Note: gfx950 support is fully experimental, not built by default, and not shipped in the release binary packages unless explicitly stated in the package name
- Non Power of Two (NPOT) head dimension Optimization by @BinDinAMD and @xinyazhang in #66
- Newly added optimized NPOT head dimensions: 48, 80, 96, 160, 192, 224
- Note: older AOTriton does support these head dimensions, but previously inputs with these head dimensions are padded loaded/accumulated to power of two in-register tensors, cause performance issues.
- Port 20250128 main perf kernel by @xinyazhang in #70
- Use Philox64x4 PRNG, and remove RETURN_ENCODED_SOFTMAX=True variant from compiled forward kernel by @xinyazhang in #71
- Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with
idropout_pwhich is converted from fp32dropout_pto i32. debug_fill_dropout_rnganddebug_fill_dropout_rng_tensorare deprecated and will be removed in 0.10, since they still use Philox32x1 PRNG, and their outputs are not matching the actual PRNG values used in the dropout process.debug_simulate_encoded_softmaxis the new API to get the outputs of the Philox64x4 PRNG. It will write 0.5 if the PRNG is greater than the dropout_p, and -0.5 otherwise.
- Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with
- Add "AOTriton .." string to .comment section of libaotriton_v2.so by @xinyazhang in #74
- Now you can verify the precise AOTriton version with
readelf -p .comment libaotriton_v2.so. An example output is:
- Now you can verify the precise AOTriton version with
[ 0] GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[ 2c] AOTriton 0.9.0
- Enable Persistent Dynamic for Causal if input is not varlen by @xinyazhang in #73
- For
v2::flash::attn_fwdAPI,atomic_for_causalmust be a one-element GPU tensor with zero value ifis_causalistrue.
- For
- Fused BWD Kernel by @BinDinAMD in #69
- No support for hdim > 256 due to register pressure.
- Empirically, this kernel outperforms the split kernel when hdim*seqlen <= 64 * 512.
- When use fused bwd kernel, the
deltatensor is not needed.
- Misc changes and performance tuning for 0.9b release by @xinyazhang in #76
- Initial RX 9070XT support
- The
softmax_lsetensor now is optional in forward kernel. For inference only process this tensor is not needed.
Known Problems
- There are unidentified memory alignment requirements to input/output tensors. If possible, please pad the input tensor shapes to multiple of 8 for safety (except for the batch dimension).
- FA kernel built for 9070XT and gfx950 may cause GPU segfaults on certain unidentified conditions.
Full Changelog: 0.8b...0.9.2b