github ROCm/aotriton 0.9.2b
AOTriton 0.9.2 Beta

latest releases: 0.12b, 0.11.210b, 0.11.52b...
14 months ago

What's Changed from Release 0.9 Beta

  • Fix a linker script problem in 0.9b which did not import the default linker script.
  • Fix the version number problem in 0.9.1b which still uses 0.9.0 and causes potential confusions.

What's Changed from Release 0.8 Beta

  • Initial support gfx950 by @xinyazhang in #64
    • Note: gfx950 support is fully experimental, not built by default, and not shipped in the release binary packages unless explicitly stated in the package name
  • Non Power of Two (NPOT) head dimension Optimization by @BinDinAMD and @xinyazhang in #66
    • Newly added optimized NPOT head dimensions: 48, 80, 96, 160, 192, 224
    • Note: older AOTriton does support these head dimensions, but previously inputs with these head dimensions are padded loaded/accumulated to power of two in-register tensors, cause performance issues.
  • Port 20250128 main perf kernel by @xinyazhang in #70
  • Use Philox64x4 PRNG, and remove RETURN_ENCODED_SOFTMAX=True variant from compiled forward kernel by @xinyazhang in #71
    • Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with idropout_p which is converted from fp32 dropout_p to i32.
    • debug_fill_dropout_rng and debug_fill_dropout_rng_tensor are deprecated and will be removed in 0.10, since they still use Philox32x1 PRNG, and their outputs are not matching the actual PRNG values used in the dropout process.
    • debug_simulate_encoded_softmax is the new API to get the outputs of the Philox64x4 PRNG. It will write 0.5 if the PRNG is greater than the dropout_p, and -0.5 otherwise.
  • Add "AOTriton .." string to .comment section of libaotriton_v2.so by @xinyazhang in #74
    • Now you can verify the precise AOTriton version with readelf -p .comment libaotriton_v2.so. An example output is:
[     0]  GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[    2c]  AOTriton 0.9.0
  • Enable Persistent Dynamic for Causal if input is not varlen by @xinyazhang in #73
    • For v2::flash::attn_fwd API, atomic_for_causal must be a one-element GPU tensor with zero value if is_causal is true.
  • Fused BWD Kernel by @BinDinAMD in #69
    • No support for hdim > 256 due to register pressure.
    • Empirically, this kernel outperforms the split kernel when hdim*seqlen <= 64 * 512.
    • When use fused bwd kernel, the delta tensor is not needed.
  • Misc changes and performance tuning for 0.9b release by @xinyazhang in #76
    • Initial RX 9070XT support
    • The softmax_lse tensor now is optional in forward kernel. For inference only process this tensor is not needed.

Known Problems

  • There are unidentified memory alignment requirements to input/output tensors. If possible, please pad the input tensor shapes to multiple of 8 for safety (except for the batch dimension).
  • FA kernel built for 9070XT and gfx950 may cause GPU segfaults on certain unidentified conditions.

Full Changelog: 0.8b...0.9.2b

Don't miss a new aotriton release

NewReleases is sending notifications on new releases.