Pre-built binary wheels require PyTorch 2.5.1
Improved:
- [fMHA] Creating a
LowerTriangularMask
no longer creates a CUDA tensor - [fMHA] Updated Flash-Attention to
v2.7.2.post1
- [fMHA] Flash-Attention v3 will now be used by
memory_efficient_attention
by default when available, unless the operator is enforced with theop
keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s - [fMHA] Fixed a performance regression with the
cutlass
backend for the backward pass (#1176) - mostly used on older GPUs (eg V100) - Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
- Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)
Removed:
- Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
- Removed unmaintained/deprecated components in
xformers.components.*
(see #848)