github facebookresearch/xformers v0.0.22
Faster LLM inference with Flash-Decoding, Local attention

latest releases: v0.0.29.post1, v0.0.29, v0.0.28.post3...
15 months ago

Fixed

  • fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

  • fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
  • fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
  • Added an example of efficient LLaMa decoding using xformers operators
  • Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
  • Added an efficient rope implementation in triton, to be used in LLM decoding
  • Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
  • xformers.info now indicates the Flash-Attention version used

Removed

  • fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
  • DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue #848)

Don't miss a new xformers release

NewReleases is sending notifications on new releases.