facebookresearch/xformers v0.0.16 on GitHub

This release contain many improvements to memory_efficient_attention, along with pip wheels now available on windows and linux!

Added support for pip wheels [#588, #573, #534, #523, ...] big thanks to @AbdBarho!
fMHA: Added Triton operator for forward pass from Flash-Attention authored by @TriDao, will be automatically used on A100 when compatible
fMHA: Added xformers.ops.memory_efficient_attention_forward, xformers.ops.memory_efficient_attention_forward_requires_grad, xformers.ops.memory_efficient_attention_backward for power-users who write custom autograd functions [#560]
fMHA: Support for custom scaling for the CUTLASS-based kernel [#530] - contribution from @comaniac
fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [#560]

Stripe lineinfo from binaries, reducing the binary size [#549]
fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [#592]
fMHA/Flash-Attention: Updated to Dao-AILab/flash-attention@a1f49a2 with multiple changes from @TriDao that make the operator up to 20% faster
Updated triton dependency [#418]

Fixed compatibility with Python 3.7 [#541] - thanks to @susumuota
fMHA: Fixed strides for QKV gradients for cutlass attention [#535]
fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [#548]

facebookresearch/xformers v0.0.16 Pip wheels, improvements to mem-eff and more on GitHub