Fixed
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
Added
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to
memory_efficient_attention
, see the documentation for more details - fMHA: Added experimental support for Local Attention biases to
memory_efficient_attention
- Added an example of efficient LLaMa decoding using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info
now indicates the Flash-Attention version used
Removed
- fMHA: Removed
smallK
backend support for CPU.memory_efficient_attention
only works for CUDA/GPU tensors now - DEPRECATION: Many classes in
xformers.factory
,xformers.triton
andxformers.components
have been or will be deprecated soon (see tracking issue #848)