github huggingface/transformers untagged-96ceb499938398068174
v4.16.0

latest releases: v4.48.0, v4.47.1, v4.47.0...
2 years ago

What's Changed

New models

Nyströmformer

The Nyströmformer model was proposed in Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.

The Nyströmformer model overcomes the quadratic complexity of self-attention on the input sequence length by adapting the Nyström method to approximate standard self-attention, enabling longer sequences with thousands of tokens as input.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=nystromformer

REALM

The REALM model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.

It’s a retrieval-augmented language model that firstly retrieves documents from a textual knowledge corpus and then utilizes retrieved documents to process question answering tasks.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=realm

ViTMAE

The ViTMAE model was proposed in Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.

The paper shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=vit_mae

ViLT

The ViLT model was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim.

ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP).

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=vilt

SwinTransformer

The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

The Swin Transformer serves as a general-purpose backbone for computer vision. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=swin

Add model like

To help contributors add new models more easily to Transformers, there is a new command that will clone an existing model and set the various hooks in the library, so that you only have to write the tweaks needed to the modeling file. Just run transformers-cli add-new-model-like and fill the questionnaire!

Training scripts

New training scripts were introduced, for speech seq2seq models and an image pre-training script leveraging the ViTMAE models.
Finally, an image captioning example in Flax gets added to the library.

Pipelines

Increased overall homogeneity in pipeline handling and arguments.
Also added new functionality specifically for automatic-speech-recognition (ASR)

  • Large audio chunking for the existing ASR pipeline by @anton-l in #14896
  • Enabling TF on image-classification pipeline. by @Narsil in #15030
  • Pipeline ASR with LM. by @Narsil in #15071
  • ChunkPipeline: batch_size enabled on zero-cls and qa pipelines. by @Narsil in #14225

PyTorch improvements

The ELECTRA model can now be used as a decoder, enabling an ELECTRA encoder-decoder model.

  • Add ElectraForCausalLM -> Enable Electra encoder-decoder model by @stancld in #14729

TensorFlow improvements

The vision encoder decoder model can now be used in TensorFlow.

CLIP gets ported to TensorFlow.

Flax improvements

RoFormer gets ported to Flax.

Deprecations

Documentation

The documentation has been fully migrated to MarkDown, if you are making contribution, make sure to read the upgraded guide on how to write good docstrings.

Bugfixes and improvements

Impressive community contributors

The community contributors below have significantly contributed to the v4.16.0 release. Thank you!

  • @novice03, for contributing Nyströmformer and Swin Transformer
  • @qqaatw, for contributing REALM
  • @stancld, for adding support for ELECTRA as a decoder, and porting RoFormer to Flax
  • @ydshieh, for a myriad of documentation fixes, the port of CLIP to TensorFlow, the addition of the TensorFlow vision encoder-decoder model, and the contribution of an image captioning example in Flax.

New Contributors

Full Changelog: v4.15.0...v4.16.0

Don't miss a new transformers release

NewReleases is sending notifications on new releases.