pypi transformers 4.25.1
PyTorch 2.0 support, Audio Spectogram Transformer, Jukebox, Switch Transformers and more

latest releases: 4.44.2, 4.44.1, 4.44.0...
21 months ago

PyTorch 2.0 stack support

We are very excited by the newly announced PyTorch 2.0 stack. You can enable torch.compile on any of our models, and get support with the Trainer (and in all our PyTorch examples) by using the torchdynamo training argument. For instance, just add --torchdynamo inductor when launching those examples from the command line.

This API is still experimental and may be subject to changes as the PyTorch 2.0 stack matures.

Note that to get the best performance, we recommend:

  • using an Ampere GPU (or more recent)
  • sticking to fixed shaped for now (so use --pad_to_max_length in our examples)
  • Repurpose torchdynamo training args towards torch._dynamo by @sgugger in #20498

Audio Spectrogram Transformer

The Audio Spectrogram Transformer model was proposed in AST: Audio Spectrogram Transformer by Yuan Gong, Yu-An Chung, James Glass. The Audio Spectrogram Transformer applies a Vision Transformer to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results for audio classification.

Jukebox

The Jukebox model was proposed in Jukebox: A generative model for music by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on an artist, genres and lyrics.

Switch Transformers

The SwitchTransformers model was proposed in Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity by William Fedus, Barret Zoph, Noam Shazeer.

It is the first MoE model supported in transformers, with the largest checkpoint currently available currently containing 1T parameters.

RocBert

The RoCBert model was proposed in RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. It’s a pretrained Chinese language model that is robust under various forms of adversarial attacks.

CLIPSeg

The CLIPSeg model was proposed in Image Segmentation Using Text and Image Prompts by Timo Lüddecke and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen CLIP model for zero- and one-shot image segmentation.

NAT and DiNAT

NAT

NAT was proposed in Neighborhood Attention Transformer by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.

It is a hierarchical vision transformer based on Neighborhood Attention, a sliding-window self attention pattern.

DiNAT

DiNAT was proposed in Dilated Neighborhood Attention Transformer by Ali Hassani and Humphrey Shi.

It extends NAT by adding a Dilated Neighborhood Attention pattern to capture global context, and shows significant performance improvements over it.

  • Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models by @alihassanijr in #20219

MobileNetV2

The MobileNet model was proposed in MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.

MobileNetV1

The MobileNet model was proposed in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.

Image processors

Image processors replace feature extractors as the processing class for computer vision models.

Important changes:

  • size parameter is now a dictionary of {"height": h, "width": w}, {"shortest_edge": s}, {"shortest_egde": s, "longest_edge": l} instead of int or tuple.
  • Addition of data_format flag. You can now specify if you want your images to be returned in "channels_first" - NCHW - or "channels_last" - NHWC - format.
  • Processing flags e.g. do_resize can be passed directly to the preprocess method instead of modifying the class attribute: image_processor([image_1, image_2], do_resize=False, return_tensors="pt", data_format="channels_last")
  • Leaving return_tensors unset will return a list of numpy arrays.

The classes are backwards compatible and can be created using existing feature extractor configurations - with the size parameter converted.

Backbone for computer vision models

We're adding support for a general AutoBackbone class, which turns any vision model (like ConvNeXt, Swin Transformer) into a backbone to be used with frameworks like DETR and Mask R-CNN. The design is in early stages and we welcome feedback.

Support for safetensors offloading

If the model you are using has a safetensors checkpoint and you have the library installed, offload to disk will take advantage of this to be more memory efficient and roughly 33% faster.

Contrastive search in the generate method

  • Generate: TF contrastive search with XLA support by @gante in #20050
  • Generate: contrastive search with full optional outputs by @gante in #19963

Breaking changes

  • 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string by @beneyal in #15775

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

Don't miss a new transformers release

NewReleases is sending notifications on new releases.