facebookresearch/xformers v0.0.22 on GitHub

fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
Added an example of efficient LLaMa decoding using xformers operators
Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
Added an efficient rope implementation in triton, to be used in LLM decoding
Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info now indicates the Flash-Attention version used

fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue #848)

facebookresearch/xformers v0.0.22 Faster LLM inference with Flash-Decoding, Local attention on GitHub