Pre-built binary wheels require PyTorch 2.5.1

Improved:

[fMHA] Creating a LowerTriangularMask no longer creates a CUDA tensor
[fMHA] Updated Flash-Attention to v2.7.2.post1
[fMHA] Flash-Attention v3 will now be used by memory_efficient_attention by default when available, unless the operator is enforced with the op keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s
[fMHA] Fixed a performance regression with the cutlass backend for the backward pass (#1176) - mostly used on older GPUs (eg V100)
Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)

Removed:

Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
Removed unmaintained/deprecated components in xformers.components.* (see #848)