v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading

GPT-J

Three new models are released as part of the GPT-J implementation: GPTJModel, GPTJForCausalLM, GPTJForSequenceClassification, in PyTorch.

The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.

It was contributed by @StellaAthena, @kurumuz, @EricHallahan, and @leogao2.

GPT-J-6B #13022 (@StellaAthena)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gptj

SpeechEncoderDecoder & Speech2Text2

One new model is released as part of the Speech2Text2 implementation: Speech2Text2ForCausalLM, in PyTorch.

The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.

Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only model.

Add SpeechEncoderDecoder & Speech2Text2 #13186 (@patrickvonplaten)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=speech2text2

FNet

Eight new models are released as part of the FNet implementation: FNetModel, FNetForPreTraining, FNetForMaskedLM, FNetForNextSentencePrediction, FNetForSequenceClassification, FNetForMultipleChoice, FNetForTokenClassification, FNetForQuestionAnswering, in PyTorch.

The FNet model was proposed in FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT model with a fourier transform which returns only the real parts of the transform. The model is significantly faster than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97% accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model.

Add FNet #13045 (@gchhablani)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=fnet

TensorFlow improvements

Several bug fixes and UX improvements for Tensorflow:

Users should notice much fewer unnecessary warnings and less 'console spam' in general while using Transformers with TensorFlow.
TensorFlow models should be less picky about the specific integer dtypes (int32/int64) that are passed as input

Changes to compile() and train_step()

You can now compile our TensorFlow models without passing a loss argument! If you do, the model will compute loss internally during the forward pass and then use this value to fit() on. This makes it much more convenient to get the right loss, particularly since many models have unique losses for certain tasks that are easy to overlook and annoying to reimplement. Remember to pass your labels as the "labels" key of your input dict when doing this, so that they're accessible to the model during the forward pass. There is no change to the behavior if you pass a loss argument, so all old code should remain unaffected by this change.

Associated PRs:

Modified TF train_step #13678 (@Rocketknight1)
Fix Tensorflow T5 with int64 input #13479 (@Rocketknight1)
MarianMT int dtype fix #13496 (@Rocketknight1)
Removed console spam from misfiring warnings #13625 (@Rocketknight1)

Pipelines

Pipeline refactor

The pipelines underwent a large refactor that should make contributing pipelines much simpler, and much less error-prone. As part of this refactor, PyTorch-based pipelines are now optimized for GPU performance based on PyTorch's Datasets and DataLoaders.

See below for an example leveraging the superb dataset.

pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")

# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
# as we're not interested in the `target` part of the dataset.
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

[Large PR] Entire rework of pipelines. #13308 (@Narsil)

Audio classification pipeline

Additionally, an additional pipeline is available, for audio classification.

Add the AudioClassificationPipeline #13342 (@anton-l)
Enabling automatic loading of tokenizer with pipeline for audio-classification. #13376 (@Narsil)

Setters for common properties

Version v4.11.0 introduces setters for common configuration properties. Different configurations have different properties as coming from different implementations.

One such example is the BertConfig having the hidden_size attribute, while the GPT2Config has the n_embed attribute, which are essentially the same.

The newly introduced setters allow setting such properties through a standardized naming scheme, even on configuration objects that do not have them by default.

See the following code sample for an example:

from transformers import GPT2Config
config = GPT2Config()

config.hidden_size = 4  # Failed previously
config = GPT2Config(hidden_size =4)  # Failed previously

config.n_embed  # returns 4
config.hidden_size  # returns 4

Update model configs - Allow setters for common properties #13026 (@nreimers)

Dynamic model code loading

An experimental feature adding support for model files hosted on the hub is added as part of this release. A walkthrough is available in the PR description.

⚠️ This means that code files will be fetched from the hub to be executed locally. An additional argument, trust_remote_code is required when instantiating the model from the hub. We heavily encourage you to also specify a revision if using code from another user's or organization's repository.

Dynamically load model code from the Hub #13467 (@sgugger)

Trainer

The Trainer has received several new features, the main one being that models are uploaded to the Hub each time you save them locally (you can specify another strategy). This push is asynchronous, so training continues normally without interruption.

Also:

The SigOpt optimization framework is now integrated in the Trainer API as an opt-in component.
The Trainer API now supports fine-tuning on distributed CPUs.

Associated PRs:

Push to hub when saving checkpoints #13503 (@sgugger)
Add SigOpt HPO to transformers trainer api #13572 (@kding1)
Add cpu distributed fine-tuning support for transformers Trainer API #13574 (@kding1)

Model size CPU memory usage reduction

The memory required to load a model in memory using PyTorch's torch.load requires twice the amount of memory necessary. An experimental feature allowing model loading while requiring only the model size in terms of memory usage is out in version v4.11.0.

It can be used by using the low_cpu_mem_usage=True argument with PyTorch pretrained models.

1x model size CPU memory usage for from_pretrained #13466 (@stas00)

GPT-Neo: simplified local attention

The GPT-Neo local attention was greatly simplified with no loss of performance.

[GPT-Neo] Simplify local attention #13491 (@finetuneanon, @patil-suraj)

Breaking changes

We strive for no breaking changes between releases - however, some bugs are not discovered for long periods of time, and users may eventually rely on such bugs. We document here such changes that may affect users when updating to a recent version.

Order of overflowing tokens

The overflowing tokens returned by the slow tokenizers were returned in the wrong order. This is changed in the PR below.

Correct order of overflowing_tokens for slow tokenizer #13179 (@Apoorvgarg-creator)

Non-prefixed tokens for token classification pipeline

Updates the behavior of aggregation_strategy to more closely mimic the deprecated grouped_entities pipeline argument.

Fixing backward compatiblity for non prefixed tokens (B-, I-). #13493 (@Narsil)

Inputs normalization for Wav2Vec2 feature extractor

The changes in v4.10 (#12804) introduced a bug in inputs normalization for non-padded tensors that affected Wav2Vec2 fine-tuning.
This is fixed in the PR below.

[Wav2Vec2] Fix normalization for non-padded tensors #13512 (@patrickvonplaten)

General bug fixes and improvements

Fixes for the documentation #13361 (@sgugger)
fix wrong 'cls' masking for bigbird qa model output #13143 (@donggyukimc)
Improve T5 docs #13240 (@NielsRogge)
Fix tokenizer saving during training with Trainer #12806 (@SaulLu)
Fix DINO #13369 (@NielsRogge)
Properly register missing submodules in main init #13372 (@sgugger)
Add Hubert to the AutoFeatureExtractor #13366 (@anton-l)
Add missing feature extractors #13374 (@LysandreJik)
Fix RemBERT tokenizer initialization #13375 (@LysandreJik)
[Flax] Fix BigBird #13380 (@patrickvonplaten)
[GPU Tests] Fix SpeechEncoderDecoder GPU tests #13383 (@patrickvonplaten)
Fix name and get_class method in AutoFeatureExtractor #13385 (@sgugger)
[Flax/run_hybrid_clip] Fix duplicating images when captions_per_image exceeds the number of captions, enable truncation #12752 (@edugp)
Move Flax self-push to test machine #13364 (@patrickvonplaten)
Torchscript test #13350 (@LysandreJik)
Torchscript test for DistilBERT #13351 (@LysandreJik)
Torchscript test for ConvBERT #13352 (@LysandreJik)
Torchscript test for Flaubert #13353 (@LysandreJik)
Fix GPT-J _CHECKPOINT_FOR_DOC typo #13368 (@LysandreJik)
Update clip loss calculation #13217 (@sachinruk)
Add LayoutXLM tokenizer docs #13373 (@NielsRogge)
[doc] fix mBART example #13387 (@patil-suraj)
[docs] Update perplexity.rst to use negative log likelihood #13386 (@madaan)
[Tests] Fix SpeechEncoderDecoder tests #13395 (@patrickvonplaten)
[SpeechEncoderDecoder] Fix final test #13396 (@patrickvonplaten)
✨ Add PyTorch image classification example #13134 (@nateraw)
Fix tests without any real effect in EncoderDecoderMixin #13406 (@ydshieh)
Fix scheduled tests for SpeechEncoderDecoderModel #13422 (@anton-l)
add torchvision in example test requirements #13438 (@patil-suraj)
[EncoderDecoder] Fix torch device in tests #13448 (@patrickvonplaten)
Adding a test for multibytes unicode. #13447 (@Narsil)
skip image classification example test #13451 (@patil-suraj)
Add TAPAS MLM-only models #13408 (@NielsRogge)
Fix scheduled TF Speech tests #13403 (@anton-l)
Update version of packaging package #13454 (@shivdhar)
Update setup.py #13421 (@anukaal)
Fix img classification tests #13456 (@nateraw)
Making it raise real errors on ByT5. #13449 (@Narsil)
Optimized bad word ids #13433 (@guillaume-be)
Use powers of 2 in download size calculations #13468 (@anton-l)
[docs] update dead quickstart link on resuing past for GPT2 #13455 (@shabie)
fix CLIP conversion script. #13474 (@patil-suraj)
Deprecate Mirror #13470 (@JetRunner)
[CLIP] fix logit_scale init #13436 (@patil-suraj)
Don't modify labels inplace in LabelSmoother #13464 (@sgugger)
Enable automated model list copying for localized READMEs #13465 (@qqaatw)
Better error raised when cloned without lfs #13401 (@LysandreJik)
Throw ValueError for mirror downloads #13478 (@JetRunner)
Fix Tensorflow T5 with int64 input #13479 (@Rocketknight1)
Object detection pipeline #12886 (@mishig25)
Typo in "end_of_word_suffix" #13477 (@KoichiYasuoka)
Fixed the MultilabelTrainer document, which would cause a potential bug when executing the code originally documented. #13414 (@Mohan-Zhang-u)
Fix integration tests for TFWav2Vec2 and TFHubert #13480 (@anton-l)
Fix typo in deepspeed documentation #13482 (@apohllo)
flax ner example #13365 (@kamalkraj)
Fix typo in documentation #13494 (@apohllo)
MarianMT int dtype fix #13496 (@Rocketknight1)
[Tentative] Moving slow tokenizer to the Trie world. #13220 (@Narsil)
Refactor internals for Trainer push_to_hub #13486 (@sgugger)
examples: minor fixes in flax example readme #13502 (@stefan-it)
[Wav2Vec2] Fix normalization for non-padded tensors #13512 (@patrickvonplaten)
TF multiple choice loss fix #13513 (@Rocketknight1)
[Wav2Vec2] Fix dtype 64 bug #13517 (@patrickvonplaten)
fix PhophetNet 'use_cache' assignment of no effect #13532 (@holazzer)
Ignore past_key_values during GPT-Neo inference #13521 (@aphedges)
Fix attention mask size checking for CLIP #13535 (@Renovamen)
[Speech2Text2] Skip newly added tokenizer test #13536 (@patrickvonplaten)
[Speech2Text] Give feature extraction higher tolerance #13538 (@patrickvonplaten)
[tokenizer] use use_auth_token for config #13523 (@stas00)
Small changes in perplexity.rstto make the notebook executable on google collaboratory #13541 (@SaulLu)
[Feature Extractors] Return attention mask always in int32 #13543 (@patrickvonplaten)
Nightly torch ci #13550 (@LysandreJik)
Add long overdue link to the Google TRC project #13501 (@avital)
Fixing #13381 #13400 (@Narsil)
fixing BC in fill-mask (wasn't tested in theses test suites apparently). #13540 (@Narsil)
add flax mbart in auto seq2seq lm #13560 (@patil-suraj)
[Flax] Addition of FlaxPegasus #13420 (@bhadreshpsavani)
Add checks to build cleaner model cards #13542 (@sgugger)
separate model card git push from the rest #13514 (@elishowk)
Fix test_fetcher when setup is updated #13566 (@sgugger)
[Flax] Fixes typo in Bart based Flax Models #13565 (@bhadreshpsavani)
Fix GPTNeo onnx export #13524 (@patil-suraj)
upgrade sentencepiece version #13564 (@elishowk)
[Pretrained Model] Add resize_position_embeddings #13559 (@patrickvonplaten)
[ci] nightly: add deepspeed master #13589 (@stas00)
[Tests] Disable flaky s2t test #13585 (@patrickvonplaten)
Correct device when resizing position embeddings #13593 (@patrickvonplaten)
Fix DataCollatorForSeq2Seq when labels are supplied as Numpy array instead of list #13582 (@Rocketknight1)
Fix a pipeline test with the newly updated weights #13608 (@LysandreJik)
Fix make fix-copies with type annotations #13586 (@sgugger)
DataCollatorForTokenClassification numpy fix #13609 (@Rocketknight1)
Feature Extractor: Wav2Vec2 & Speech2Text - Allow truncation + padding=longest #13600 (@patrickvonplaten)
[deepspeed] replaced deprecated init arg #13587 (@stas00)
Properly use test_fetcher for examples #13604 (@sgugger)
XLMR tokenizer is fully picklable #13577 (@ben-davidson-6)
Optimize Token Classification models for TPU #13096 (@ibraheem-moosa)
[Trainer] Add nan/inf logging filter #13619 (@patrickvonplaten)
Fix special tokens not correctly tokenized #13489 (@qqaatw)
Removed console spam from misfiring warnings #13625 (@Rocketknight1)
Use config_dict_or_path for deepspeed.zero.Init #13614 (@aphedges)
Fixes issues with backward pass in LED/Longformer Self-attention #13613 (@aleSuglia)
fix some docstring in encoder-decoder models #13611 (@ydshieh)
Updated tiny distilbert models #13631 (@LysandreJik)
Fix GPT2Config parameters in GPT2ModelTester #13630 (@calpt)
[run_summarization] fix typo #13647 (@patil-suraj)
[Fix]Make sure the args tb_writer passed to the TensorBoardCallback works #13636 (@iamlockelightning)
Fix mT5 documentation #13639 (@ayaka14732)
Update modeling_tf_deberta.py #13654 (@kamalkraj)
[megatron_gpt2] checkpoint v3 #13508 (@stas00)
Change https:/ to https:// to dataset GitHub repo #13644 (@flozi00)
fix research_projects/mlm_wwm readme.md examples #13646 (@LowinLi)
Fix typo distilbert doc to code link #13643 (@flozi00)
Add Speech AutoModels #13655 (@patrickvonplaten)
beit-flax #13515 (@kamalkraj)
[FLAX] Question Answering Example #13649 (@kamalkraj)
Typo "UNKWOWN" -> "UNKNOWN" #13675 (@kamalkraj)
[SequenceFeatureExtractor] Rewrite padding logic from pure python to numpy #13650 (@anton-l)
[SinusoidalPositionalEmbedding] incorrect dtype when resizing in forward #13665 (@stas00)
Add push_to_hub to no_trainer examples #13659 (@sgugger)
Layoutlm onnx support (Issue #13300) #13562 (@nishprabhu)
Update modeling_flax_wav2vec2.py #13680 (@kamalkraj)
[FlaxWav2Vec2] Revive Test #13688 (@patrickvonplaten)
[AutoTokenizer] Allow creation of tokenizers by tokenizer type #13668 (@patrickvonplaten)
[Wav2Vec2FeatureExtractor] Fix extractor.pad() dtype backwards compatibility #13693 (@anton-l)
Make gradient_checkpointing a training argument #13657 (@sgugger)
Assertions to exceptions #13692 (@MocktaiLEngineer)
Fix non-negligible difference between GPT2 and TFGP2 #13679 (@ydshieh)
Allow only textual inputs to VisualBert #13687 (@gchhablani)
Patch training arguments issue #13699 (@LysandreJik)
Patch training arguments issue #13700 (@LysandreJik)
[GPT-J] Use the float16 checkpoints in integration tests #13676 (@anton-l)
[docs/gpt-j] add a note about tokenizer #13696 (@patil-suraj)
Fix FNet reference to tpu short seq length #13686 (@gchhablani)
Add BlenderBot small tokenizer to the init #13367 (@LysandreJik)
Fix typo in torchscript tests #13701 (@LysandreJik)
Handle UnicodeDecodeError when loading config file #13717 (@qqaatw)
Add FSNER example in research_projects #13712 (@sayef)
Replace torch.set_grad_enabled by torch.no_grad #13703 (@LysandreJik)
[ASR] Add official ASR CTC example to examples/pytorch/speech-recognition #13620 (@patrickvonplaten)
Make assertions only if actually chunking forward #13598 (@joshdevins)
Use torch.unique_consecutive to check elements are same #13637 (@oToToT)
Fixing zero-shot backward compatiblity #13725 (@Narsil)
[Tests] FNetTokenizer #13729 (@patrickvonplaten)
Warn for unexpected argument combinations #13509 (@shirayu)
Add model card creation snippet to example scripts #13730 (@gchhablani)
[Examples] speech recognition - remove gradient checkpointing #13733 (@patrickvonplaten)
Update test dependence for torch examples #13738 (@sgugger)
[Tests] Add decorator to FlaxBeit #13743 (@patrickvonplaten)
Update requirements for speech example #13745 (@sgugger)
[Trainer] Make sure shown loss in distributed training is correctly averaged over all workers #13681 (@patrickvonplaten)
[megatron gpt checkpoint conversion] causal mask requires pos_embed dimension #13735 (@stas00)
[Tests] Cast Hubert model tests to fp16 #13755 (@anton-l)
Fix type annotations for distributed_concat() #13746 (@Renovamen)
Fix loss computation in Trainer #13760 (@sgugger)
Silence warning in gradient checkpointing when it's False #13734 (@sgugger)

transformers 4.11.0 v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading on Python PyPI