pypi transformers 4.11.0
v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading

latest releases: 4.41.0, 4.40.2, 4.40.1...
2 years ago

v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading

GPT-J

Three new models are released as part of the GPT-J implementation: GPTJModel, GPTJForCausalLM, GPTJForSequenceClassification, in PyTorch.

The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.

It was contributed by @StellaAthena, @kurumuz, @EricHallahan, and @leogao2.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gptj

SpeechEncoderDecoder & Speech2Text2

One new model is released as part of the Speech2Text2 implementation: Speech2Text2ForCausalLM, in PyTorch.

The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.

Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only model.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=speech2text2

FNet

Eight new models are released as part of the FNet implementation: FNetModel, FNetForPreTraining, FNetForMaskedLM, FNetForNextSentencePrediction, FNetForSequenceClassification, FNetForMultipleChoice, FNetForTokenClassification, FNetForQuestionAnswering, in PyTorch.

The FNet model was proposed in FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT model with a fourier transform which returns only the real parts of the transform. The model is significantly faster than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97% accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=fnet

TensorFlow improvements

Several bug fixes and UX improvements for Tensorflow:

  • Users should notice much fewer unnecessary warnings and less 'console spam' in general while using Transformers with TensorFlow.
  • TensorFlow models should be less picky about the specific integer dtypes (int32/int64) that are passed as input

Changes to compile() and train_step()

  • You can now compile our TensorFlow models without passing a loss argument! If you do, the model will compute loss internally during the forward pass and then use this value to fit() on. This makes it much more convenient to get the right loss, particularly since many models have unique losses for certain tasks that are easy to overlook and annoying to reimplement. Remember to pass your labels as the "labels" key of your input dict when doing this, so that they're accessible to the model during the forward pass. There is no change to the behavior if you pass a loss argument, so all old code should remain unaffected by this change.

Associated PRs:

Pipelines

Pipeline refactor

The pipelines underwent a large refactor that should make contributing pipelines much simpler, and much less error-prone. As part of this refactor, PyTorch-based pipelines are now optimized for GPU performance based on PyTorch's Datasets and DataLoaders.

See below for an example leveraging the superb dataset.

pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")

# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
# as we're not interested in the `target` part of the dataset.
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

Audio classification pipeline

Additionally, an additional pipeline is available, for audio classification.

  • Add the AudioClassificationPipeline #13342 (@anton-l)
  • Enabling automatic loading of tokenizer with pipeline for audio-classification. #13376 (@Narsil)

Setters for common properties

Version v4.11.0 introduces setters for common configuration properties. Different configurations have different properties as coming from different implementations.

One such example is the BertConfig having the hidden_size attribute, while the GPT2Config has the n_embed attribute, which are essentially the same.

The newly introduced setters allow setting such properties through a standardized naming scheme, even on configuration objects that do not have them by default.

See the following code sample for an example:

from transformers import GPT2Config
config = GPT2Config()

config.hidden_size = 4  # Failed previously
config = GPT2Config(hidden_size =4)  # Failed previously

config.n_embed  # returns 4
config.hidden_size  # returns 4
  • Update model configs - Allow setters for common properties #13026 (@nreimers)

Dynamic model code loading

An experimental feature adding support for model files hosted on the hub is added as part of this release. A walkthrough is available in the PR description.

⚠️ This means that code files will be fetched from the hub to be executed locally. An additional argument, trust_remote_code is required when instantiating the model from the hub. We heavily encourage you to also specify a revision if using code from another user's or organization's repository.

Trainer

The Trainer has received several new features, the main one being that models are uploaded to the Hub each time you save them locally (you can specify another strategy). This push is asynchronous, so training continues normally without interruption.

Also:

  • The SigOpt optimization framework is now integrated in the Trainer API as an opt-in component.
  • The Trainer API now supports fine-tuning on distributed CPUs.

Associated PRs:

  • Push to hub when saving checkpoints #13503 (@sgugger)
  • Add SigOpt HPO to transformers trainer api #13572 (@kding1)
  • Add cpu distributed fine-tuning support for transformers Trainer API #13574 (@kding1)

Model size CPU memory usage reduction

The memory required to load a model in memory using PyTorch's torch.load requires twice the amount of memory necessary. An experimental feature allowing model loading while requiring only the model size in terms of memory usage is out in version v4.11.0.

It can be used by using the low_cpu_mem_usage=True argument with PyTorch pretrained models.

  • 1x model size CPU memory usage for from_pretrained #13466 (@stas00)

GPT-Neo: simplified local attention

The GPT-Neo local attention was greatly simplified with no loss of performance.

Breaking changes

We strive for no breaking changes between releases - however, some bugs are not discovered for long periods of time, and users may eventually rely on such bugs. We document here such changes that may affect users when updating to a recent version.

Order of overflowing tokens

The overflowing tokens returned by the slow tokenizers were returned in the wrong order. This is changed in the PR below.

Non-prefixed tokens for token classification pipeline

Updates the behavior of aggregation_strategy to more closely mimic the deprecated grouped_entities pipeline argument.

  • Fixing backward compatiblity for non prefixed tokens (B-, I-). #13493 (@Narsil)

Inputs normalization for Wav2Vec2 feature extractor

The changes in v4.10 (#12804) introduced a bug in inputs normalization for non-padded tensors that affected Wav2Vec2 fine-tuning.
This is fixed in the PR below.

General bug fixes and improvements

Don't miss a new transformers release

NewReleases is sending notifications on new releases.