v4.2.0: LED from AllenAI, encoder-decoder templates, fast imports

LED from AllenAI (@patrickvonplaten)

Four new models are released as part of the LED implementation: LEDModel, LEDForConditionalGeneration, LEDForSequenceClassification, LEDForQuestionAnswering, in PyTorch. The first two models have a TensorFlow version.

LED is the encoder-decoder variant of the Longformer model by allenai.

The LED model was proposed in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=led

Available notebooks:

Contributions:

LED #9278 (@patrickvonplaten)
[LED Test] fix common inputs pt for flaky pt-tf led test #9459 (@SBrandeis, @patrickvonplaten)
[TF Led] Fix flaky TF Led test #9513 (@patrickvonplaten)

Generation Scores & other outputs (@patrickvonplaten)

The PyTorch generation function now allows to return:

scores - the logits generated at each step
attentions - all attention weights at each generation step
hidden_states - all hidden states at each generation step

by simply adding return_dict_in_generate to the config or as an input to .generate()

Tweet:

https://twitter.com/SimonBrandeis/status/1346858472000937984

Notebooks for a better explanation:

PR:

Add flags to return scores, hidden states and / or attention weights in GenerationMixin #9150 (@SBrandeis)

TensorFlow improvements

TensorFlow BERT-like model improvements (@jplu)

The TensorFlow version of the BERT-like models have been updated and are now twice as fast as the previous versions.

Improve BERT-like models performance with better self attention #9124 (@jplu)

Better integration in TensorFlow Serving (@jplu)

This version introduces a new API for TensorFlow saved models, which can now be exported with model.save_pretrained("path", saved_model=True) and easily loaded into a TensorFlow Serving environment.

New serving #9419 (@jplu)

DeepSpeed integration (@stas00)

Initial support for DeepSpeed to accelerate distributed training on several GPUs. This is an experimental feature that hasn't been fully tested yet, but early results are very encouraging (see this comment). Stay tuned for more details in the coming weeks!

[trainer] deepspeed integration #9211 (@stas00)

Model templates (@patrickvonplaten)

The encoder-decoder version of the templates is now part of Transformers! Adding an encoder-decoder model is made very easy with this addition. More information can be found in the README.

Model Templates for Seq2Seq #9251 (@patrickvonplaten)
[Seq2Seq Templates] Add embedding scale to templates #9342 (@patrickvonplaten)
[Seq2Seq Templates] Add forgotten imports to templates #9346 (@patrickvonplaten)

Faster import (@sgugger)

The initialization process has been changed to only import what is required. Therefore, when using only PyTorch models, TensorFlow will not be imported and vice-versa. In the best situations the import of a transformers model now takes only a few hundreds of milliseconds (~200ms) compared to more than a few seconds (~3s) in previous versions.

Fast transformers import part 1 #9441 (@sgugger)
Transformers fast import part 2 #9446 (@sgugger)
Fast imports part 3 #9474 (@sgugger)

Documentation highlights (@Qbiwan, @NielsRogge)

Some models now have improved documentation. The LayoutLM model has seen a general overhaul in its documentation thanks to @NielsRogge.

The tokenizer-only models Bertweet, Herbert and Phobert now have their own documentation pages thanks to @Qbiwan.

Improve LayoutLM #9476 (@NielsRogge)
Improve documentation coverage for Bertweet #9379 (@Qbiwan)
Improve documentation coverage for Herbert #9428 (@Qbiwan)
Improve documentation coverage for Phobert #9427 (@Qbiwan)

Breaking changes

There are no breaking changes between the previous version and this one.
This will be the first version to require TensorFlow >= 2.3.

General improvements and bugfixes

add tests for the new sharded ddp fairscale integration #9177 (@stas00)
Added TF CTRL Sequence Classification #9151 (@spatil6)
[trainer] apex fixes and tests #9180 (@stas00)
Fix link to old NER fine-tuning script #9182 (@mrm8488)
fixed not JSON serializable error in run_qa.py with fp16 #9186 (@WissamAntoun)
[setup] correct transformers version format #9176 (@stas00)
Fix link to old SQUAD fine-tuning script #9181 (@mrm8488)
Add new run_swag example #9175 (@sgugger)
Add timing inside Trainer #9196 (@sgugger)
GPT-model attention heads pruning example #9189 (@altsoph)
[t5 doc] typos #9199 (@stas00)
[run_glue] add speed metrics #9198 (@stas00)
Added TF TransfoXL Sequence Classification #9169 (@spatil6)
[finetune trainer] better logging and help #9203 (@stas00)
[RAG] Add Ray implementation for distributed retrieval #9197 (@amogkam)
[T5] Fix warning for changed EncDec Attention Bias weight #9231 (@patrickvonplaten)
Improve BERT-like models performance with better self attention #9124 (@jplu)
Fix TF template #9234 (@jplu)
Fix beam search generation for GPT2 and T5 on model parallelism #9219 (@TobiasNorlund)
add base model classes to bart subclassed models #9230 (@patil-suraj)
[MPNet] Add slow to fast tokenizer converter #9233 (@patrickvonplaten)
Adding performer fine-tuning research exampke #9239 (@TevenLeScao)
Update the README of the text classification example #9237 (@sgugger)
[EncoderDecoder] Make tests more aggressive #9256 (@patrickvonplaten)
Fix script that check objects are documented #9259 (@sgugger)
Seq2seq trainer #9241 (@sgugger)
Fix link to old language modeling script #9254 (@mrm8488)
Fix link to bertabs/README.md #9255 (@mrm8488)
Fix TF BART for saved model creation #9252 (@jplu)
Add speed metrics to all example scripts + template #9260 (@sgugger)
Revert renaming in finetune_trainer #9262 (@sgugger)
Fix gpt2 document #9272 (@xu-song)
Fix param error #9273 (@xu-song)
[Seq2Seq Templates] Fix check_repo.py templates file #9277 (@patrickvonplaten)
Minor documentation revisions from copyediting #9266 (@connorbrinton)
Adapt to new name of label_smoothing_factor training arg #9282 (@sgugger)
Add caching mechanism to BERT, RoBERTa #9183 (@patil-suraj)
[Templates] Adapt Bert #9284 (@patrickvonplaten)
allow integer device for BatchEncoding #9271 (@jethrokuan)
Fix typo in file_utils.py #9289 (@jungwhank)
[bert_generation] enable cache by default #9296 (@patil-suraj)
Proposed Fix : [RagSequenceForGeneration] generate "without" input_ids #9220 (@ratthachat)
fix typo in modeling_encoder_decoder.py #9297 (@daniele-sartiano)
Update tokenization_utils_base.py #9293 (@BramVanroy)
[Bart doc] Fix outdated statement #9299 (@patrickvonplaten)
add translation example #9303 (@vasudevgupta7)
[GPT2] Correct gradient checkpointing #9308 (@patrickvonplaten)
[Seq2SeqTrainer] Fix Typo #9320 (@patrickvonplaten)
[Seq2Seq Templates] Correct some TF-serving errors and add gradient checkpointing to PT by default. #9334 (@patrickvonplaten)
Fix TF T5 #9301 (@jplu)
Fix TF TransfoXL #9302 (@jplu)
[prophetnet] wrong import #9349 (@stas00)
[apex.normalizations.FusedLayerNorm] torch.cuda.is_available() is redundant as apex handles that internally #9350 (@stas00)
Make sure to use return dict for the encoder call inside RagTokenForGeneration #9363 (@dblakely)
[Docs] past_key_values return a tuple of tuple as a default #9381 (@patrickvonplaten)
[docs] Fix TF base model examples: outputs.last_hidden_states -> state #9382 (@ck37)
Fix typos in README and bugs in RAG example code for end-to-end evaluation and finetuning #9355 (@yoshitomo-matsubara)
Simplify marian distillation script #9394 (@sshleifer)
Add utility function for retrieving locally cached models #8836 (@cdpierse)
Fix TF CTRL #9291 (@jplu)
Put back LXMert example #9401 (@sgugger)
Bump notebook from 6.1.4 to 6.1.5 in /examples/research_projects/lxmert #9402 (@dependabot[bot])
Fix TF Flaubert #9292 (@jplu)
[trainer] parametrize default output_dir #9352 (@stas00)
Fix utils on Windows #9368 (@jplu)
Fix TF DPR #9283 (@jplu)
[Docs] Tokenizer Squad 2.0 example #9378 (@patrickvonplaten)
replace apex.normalization.FusedLayerNorm with torch.nn.LayerNorm #9386 (@stas00)
[test_model_parallelization] multiple fixes #9354 (@stas00)
Fix TF Longformer #9348 (@jplu)
[logging] autoflush #9385 (@stas00)
TF >= 2.3 cleaning #9369 (@jplu)
[trainer] --model_parallel hasn't been implemented for most models #9347 (@stas00)
Fix TF Funnel #9300 (@jplu)
Fix documentation links always pointing to master. #9217 (@sugeeth14)
[examples/text-classification] Fix a bug for using own regression dataset #9411 (@forest1988)
[trainer] group fp16 args together #9409 (@stas00)
[model parallel] add experimental warning #9412 (@stas00)
improve readme text to private models/versioning/api #9424 (@clmnt)
[PyTorch Bart] Split Bart into different models #9343 (@patrickvonplaten)
[docs] outline sharded ddp doc #9208 (@stas00)
[Refactor] Splitting pipelines.py into its own module. #9279 (@Narsil)
Fix link to Evaluate TAPAS Notebook #9414 (@mrm8488)
Fix link to Notebook to fine-tune TAPAS #9413 (@mrm8488)
Allow example to use a revision and work with private models #9407 (@sgugger)
[trainer] self.model_wrapped + _model_unwrap #9390 (@stas00)
Fix URLs to TAPAS notebooks #9435 (@NielsRogge)
Upgrade styler to better handle lists #9423 (@sgugger)
[Docs] Add useful links to model sharing #9431 (@patrickvonplaten)
Store transformers version info when saving the model #9421 (@JetRunner)
[GenerationOutputs] Fix GenerationOutputs Tests #9443 (@patrickvonplaten)
Remove nested lxmert #9440 (@sgugger)
[make fixup] a more reliable version of branching point discovery #9449 (@stas00)
Prophetnet optimization #9453 (@guillaume-be)
New serving #9419 (@jplu)
[Docs] Improve model sharing doc #9454 (@patrickvonplaten)
[TFGPT2] - Fix flaky past_key_values test #9460 (@patrickvonplaten)
Removing duplicated code for Translation,Summarization and Text2TextGeneration pipelines #9433 (@Narsil)
[README] Add new models #9465 (@patrickvonplaten)
[Generation] Fix bug for manual decoder_input_ids + warning message #9472 (@patrickvonplaten)
Makes HfArgumentParser compatible with Python 3.9 #9479 (@Tpt)
Fix TF input for np.ndarray #9294 (@jplu)
Making Conversation possible to create directly a full conversation #9434 (@Narsil)
fix(wandb): fix config #9489 (@borisdayma)
Fixing tests. It seems master changed something in the warnings. #9483 (@Narsil)
Reformat the TF serving outputs #9482 (@jplu)
[ray] add maintainers for Ray / Tune #9499 (@richardliaw)
Fix template #9504 (@jplu)
Full rework of the TF input/output embeddings and bias resizing #9193 (@jplu)
Remove tolerance + drop_rows_to_fit by default #9507 (@LysandreJik)
Fix template #9512 (@jplu)
New Updated DistilGPT-2 Finetuning and Generation #9494 (@tripathiaakash)
Make doc styler detect lists on rst and better support for Windows #9488 (@sgugger)
Enable TruncationStrategy override for pipelines #9432 (@Narsil)
[doc] How To Request Support document stab #9288 (@stas00)
[trainer] remove --model_parallel #9451 (@stas00)
Fix cardinality #9505 (@jplu)
Make doc styler behave properly on Windows #9516 (@sgugger)
[trainer] round numbers in trainer state #9491 (@stas00)
[make docs] parallel build #9522 (@stas00)
[TFBart] Split TF-Bart #9497 (@patrickvonplaten)
[ProphetNet] Fix naming and wrong config #9514 (@patrickvonplaten)
Update 'Develop on Windows' guidelines #9519 (@SBrandeis)
Shouldn't stale issues/PRs with feature request label #9511 (@LysandreJik)
[Blenderbot] Fix Links #9532 (@patrickvonplaten)
[T5] enable T5 fp16 #9487 (@patil-suraj)
LayoutLM Config #9539 (@LysandreJik)
Fix fill mask pipeline slow test using deprecated argument #9541 (@LysandreJik)
Refactor prepare_seq2seq_batch #9524 (@sgugger)
Use the right version of tokenizers #9550 (@sgugger)
fix BlenderbotSmallTokenizer #9538 (@patil-suraj)
Doc: Update pretrained_models wording #9545 (@julien-c)
Fix barthez tokenizer #9562 (@LysandreJik)
Fix classification script: enable dynamic padding with truncation #9554 (@pashok3d)
Speed up TopKLogitsWarper and TopPLogitsWarper (pytorch) #9557 (@LSinev)
Update run_glue for do_predict with local test data (#9442) #9486 (@forest1988)
[CI] use correct deps for torchhub #9552 (@stas00)
Fix data parallelism in Trainer #9566 (@sgugger)
Fix slow tests v4.2.0 #9561 (@LysandreJik)

transformers 4.2.0 v4.2.0: LED from AllenAI, Generation Scores, TensorFlow 2x speedup, faster import on Python PyPI