[0.0.21] - 2023-08-18
Improved
- fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available
Bug fixes
- fMHA/cutlass: Fix potential race condition in the FW/BW passes
- fMHA/cutlass: Fix
attn_bias
stride overflow for very long sequences (>32k) LowerTriangularMask
is now backward compatible with older xformers versions
Breaking changes
memory_efficient_attention
now expects theattn_bias
argument to have a head dimensionmemory_efficient_attention
no longer broadcasts the batch/head dimensions ofattn_bias
. Please use.expand
if you need to broadcast the bias- Remove
causal_diagonal
argument fromBlockDiagonalCausalWithOffsetPaddedKeysMask
Added
- Binary wheels on pypi/conda now contain H100 kernels
- fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery
NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.