v4.4.0: S2T, M2M100, I-BERT, mBART-50, DeBERTa-v2, XLSR-Wav2Vec2

SpeechToText

Two new models are released as part of the S2T implementation: Speech2TextModel and Speech2TextForConditionalGeneration, in PyTorch.

Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech signal. It’s a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively.

The Speech2Text model was proposed in fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

Speech2TextTransformer #10175 (@patil-suraj)

M2M100

Two new models are released as part of the M2M100 implementation: M2M100Model and M2M100ForConditionalGeneration, in PyTorch.

M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks.

The M2M100 model was proposed in Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=m2m_100

Add m2m100 #10236 (@patil-suraj)

I-BERT

Six new models are released as part of the I-BERT implementation: IBertModel, IBertForMaskedLM, IBertForSequenceClassification, IBertForMultipleChoice, IBertForTokenClassification and IBertForQuestionAnswering, in PyTorch.

I-BERT is a quantized version of RoBERTa running inference up to four times faster.

The I-BERT framework in PyTorch allows to identify the best parameters for quantization. Once the model is exported in a framework that supports int8 execution (such as TensorRT), a speedup of up to 4x is visible, with no loss in performance thanks to the parameter search.

The I-BERT model was proposed in I-BERT: Integer-only BERT Quantization by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=ibert

I-BERT model support #10153 (@kssteven418)
[IBert] Correct link to paper #10445 (@patrickvonplaten)
Add I-BERT to README #10462 (@LysandreJik)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

mBART-50

MBart-50 is created using the original mbart-large-cc25 checkpoint by extending its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50 languages.

The MBart model was presented in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=mbart-50

Add mBART-50 #10154 (@patil-suraj)

DeBERTa-v2

Fixe new models are released as part of the DeBERTa-v2 implementation: DebertaV2Model, DebertaV2ForMaskedLM, DebertaV2ForSequenceClassification, DeberaV2ForTokenClassification and DebertaV2ForQuestionAnswering, in PyTorch.

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deberta-v2

Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… #10018 (@BigBird01)
DeBERTa-v2 fixes #10328 (@LysandreJik)

Wav2Vec2

XLSR-Wav2Vec2

The XLSR-Wav2Vec2 model was proposed in Unsupervised Cross-Lingual Representation Learning For Speech Recognition by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.

The checkpoint corresponding to that model is added to the model hub: facebook/
wav2vec2-large-xlsr-53

[XLSR-Wav2Vec2] Add multi-lingual Wav2Vec2 models #10648 (@patrickvonplaten)

Training script

A fine-tuning script showcasing how the Wav2Vec2 model can be trained has been added.

Add Fine-Tuning for Wav2Vec2 #10145 (@patrickvonplaten)

Further improvements

The Wav2Vec2 architecture becomes more stable as several changes are done to its architecture. This introduces feature extractors and feature processors as the pre-processing aspect of multi-modal speech models.

Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC #10089 (@patrickvonplaten)
Fix example in Wav2Vec2 documentation #10096 (@abhishekkrthakur)
[Wav2Vec2] Remove unused config #10457 (@patrickvonplaten)
[Wav2Vec2FeatureExtractor] smal fixes #10455 (@patil-suraj)
[Wav2Vec2] Improve Tokenizer & Model for batched inference #10117 (@patrickvonplaten)
[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324 (@patrickvonplaten)
[Wav2Vec2 Example Script] Typo #10547 (@patrickvonplaten)
[Wav2Vec2] Make wav2vec2 test deterministic #10714 (@patrickvonplaten)
[Wav2Vec2] Fix documentation inaccuracy #10694 (@MikeG112)

AMP & XLA Support for TensorFlow models

Most of the TensorFlow models are now compatible with automatic mixed precision and have XLA support.

Add AMP for TF Albert #10141 (@jplu)
Unlock XLA test for TF ConvBert #10207 (@jplu)
Making TF BART-like models XLA and AMP compliant #10191 (@jplu)
Making TF XLM-like models XLA and AMP compliant #10211 (@jplu)
Make TF CTRL compliant with XLA and AMP #10209 (@jplu)
Making TF GPT2 compliant with XLA and AMP #10230 (@jplu)
Making TF Funnel compliant with AMP #10216 (@jplu)
Making TF Lxmert model compliant with AMP #10257 (@jplu)
Making TF MobileBert model compliant with AMP #10259 (@jplu)
Making TF MPNet model compliant with XLA #10260 (@jplu)
Making TF T5 model compliant with AMP and XLA #10262 (@jplu)
Making TF TransfoXL model compliant with AMP #10264 (@jplu)
Making TF OpenAI GPT model compliant with AMP and XLA #10261 (@jplu)
Rework the AMP for TF XLNet #10274 (@jplu)
Making TF Longformer-like models compliant with AMP #10233 (@jplu)

SageMaker Trainer for model parallelism

We are rolling out experimental support for model parallelism on SageMaker with a new SageMakerTrainer that can be used in place of the regular Trainer. This is a temporary class that will be removed in a future version, the end goal is to have Trainer support this feature out of the box.

Add SageMakerTrainer for model paralellism #10122 (@sgugger)
Extend trainer logging for sm #10633 (@philschmid)
Sagemaker Model Parallel tensoboard writing fix #10403 (@mansimane)
Multiple fixes in SageMakerTrainer #10687 (@sgugger)
Add DistributedSamplerWithLoop #10746 (@sgugger)

General improvements and bugfixes

[trainer] deepspeed bug fixes and tests #10039 (@stas00)
Removing run_pl_glue.py from text classification docs, include run_xnli.py & run_tf_text_classification.py #10066 (@cbjuan)
remove token_type_ids from TokenizerBertGeneration output #10070 (@sadakmed)
[deepspeed tests] transition to new tests dir #10080 (@stas00)
Added integration tests for Pytorch implementation of the ELECTRA model #10073 (@spatil6)
Fix naming in TF MobileBERT #10095 (@jplu)
[examples/s2s] add test set predictions #10085 (@patil-suraj)
Logging propagation #10092 (@LysandreJik)
Fix some edge cases in report_to and add deprecation warnings #10100 (@sgugger)
Add head_mask and decoder_head_mask to TF LED #9988 (@stancld)
Replace strided slice with tf.expand_dims #10078 (@jplu)
Fix Faiss Import #10103 (@patrickvonplaten)
[RAG] fix generate #10094 (@patil-suraj)
Fix TFConvBertModelIntegrationTest::test_inference_masked_lm Test #10104 (@abhishekkrthakur)
doc: update W&B related doc #10086 (@borisdayma)
Remove speed metrics from default compute objective [WIP] #10107 (@shiva-z)
Fix tokenizers training in notebooks #10110 (@n1t0)
[DeepSpeed docs] new information #9610 (@stas00)
[CI] build docs faster #10115 (@stas00)
[scheduled github CI] add deepspeed fairscale deps #10116 (@stas00)
Line endings should be LF across repo and not CRLF #10119 (@LysandreJik)
Fix TF LED/Longformer attentions computation #10007 (@jplu)
remove adjust_logits_during_generation method #10087 (@patil-suraj)
[DeepSpeed] restore memory for evaluation #10114 (@stas00)
Update run_xnli.py to use Datasets library #9829 (@Qbiwan)
Add new community notebook - Blenderbot #10126 (@lordtt13)
[DeepSpeed in notebooks] Jupyter + Colab #10130 (@stas00)
[examples/run_s2s] remove task_specific_params and update rouge computation #10133 (@patil-suraj)
Fix typo in GPT2DoubleHeadsModel docs #10148 (@M-Salti)
[hf_api] delete deprecated methods and tests #10159 (@julien-c)
Revert propagation #10171 (@LysandreJik)
Conversion from slow to fast for BPE spm vocabs contained an error. #10120 (@Narsil)
Fix typo in comments #10157 (@mrm8488)
Fix typo in comment #10156 (@mrm8488)
[Doc] Fix version control in internal pages #10124 (@sgugger)
[t5 tokenizer] add info logs #9897 (@stas00)
Fix v2 model loading issue #10129 (@BigBird01)
Fix datasets set_format #10178 (@sgugger)
Fixing NER pipeline for list inputs. #10184 (@Narsil)
Add new model to labels that should not stale #10187 (@LysandreJik)
Check TF ops for ONNX compliance #10025 (@jplu)
[RAG] fix tokenizer #10167 (@patil-suraj)
Fix TF template #10189 (@jplu)
fix run_seq2seq.py; porting trainer tests to it #10162 (@stas00)
Specify dataset dtype #10195 (@LysandreJik)
[CI] make the examples sub-group of tests run always #10196 (@stas00)
[WIP][examples/seq2seq] move old s2s scripts to legacy #10136 (@patil-suraj)
set tgt_lang of MBart Tokenizer for summarization #10205 (@HeroadZ)
Store FLOS as floats to avoid overflow. #10213 (@sgugger)
Fix add_token_positions in custom datasets tutorial #10217 (@joeddav)
[trainer] fix ignored columns logger #10219 (@stas00)
Factor out methods #10215 (@LysandreJik)
Fix head masking for TFT5 models #9877 (@stancld)
[CI] 2 fixes #10248 (@stas00)
[trainer] refactor place_model_on_device logic, add deepspeed #10243 (@stas00)
[Trainer] doc update #10241 (@stas00)
Reduce the time spent for the TF slow tests #10152 (@jplu)
Introduce warmup_ratio training argument #10229 (@tanmay17061)
[Trainer] memory tracker metrics #10225 (@stas00)
Script for distilling zero-shot classifier to more efficient student #10244 (@joeddav)
[test] fix func signature #10271 (@stas00)
[trainer] implement support for full fp16 in evaluation/predict #10268 (@stas00)
[ISSUES.md] propose using google colab to reproduce problems #10270 (@stas00)
Introduce logging_strategy training argument #10267 (@tanmay17061)
[CI] Kill any run-away pytest processes #10281 (@stas00)
Patch zero shot distillation script cuda issue #10284 (@joeddav)
Move the TF NER example #10276 (@jplu)
Fix example links in the task summary #10291 (@sgugger)
fixes #10303 #10304 (@cronoik)
[ci] don't fail when there are no zombies #10308 (@stas00)
fix typo in conversion script #10316 (@tagucci)
Add note to resize token embeddings matrix when adding new tokens to voc #10331 (@LysandreJik)
Deprecate prepare_seq2seq_batch #10287 (@sgugger)
[examples/seq2seq] defensive programming + expand/correct README #10295 (@stas00)
[Trainer] implement gradient_accumulation_steps support in DeepSpeed integration #10310 (@stas00)
Loading from last checkpoint functionality in Trainer.train #10334 (@tanmay17061)
[trainer] add Trainer methods for metrics logging and saving #10266 (@stas00)
Fix evaluation with label smoothing in Trainer #10338 (@sgugger)
Fix broken examples/seq2seq/README.md markdown #10344 (@Wikidepia)
[bert-base-german-cased] use model repo, not external bucket #10353 (@julien-c)
[Trainer/Deepspeed] handle get_last_lr() before first step() #10362 (@stas00)
ConvBERT fix torch <> tf weights conversion #10314 (@abhishekkrthakur)
fix deprecated reference tokenizer.max_len in glue.py #10220 (@poedator)
[trainer] move secondary methods into a separate file #10363 (@stas00)
Run GA on every push even on forks #10383 (@LysandreJik)
GA: only run model templates once #10388 (@LysandreJik)
Bugfix: Removal of padding_idx in BartLearnedPositionalEmbedding #10200 (@mingruimingrui)
Remove unused variable in example for Q&A #10392 (@abhishekkrthakur)
Ignore unexpected weights from PT conversion #10397 (@LysandreJik)
Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354 (@sgugger)
Fix None in add_token_positions - issue #10210 #10374 (@andreabac3)
Make Barthez tokenizer tests a bit faster #10399 (@sgugger)
Fix run_glue evaluation when model has a label correspondence #10401 (@sgugger)
[ci, flax] non-existing models are unlikely to pass tests #10409 (@julien-c)
[LED] Correct Docs #10419 (@patrickvonplaten)
Add Ray Tune hyperparameter search integration test #10414 (@krfricke)
Ray Tune Integration Bug Fixes #10406 (@amogkam)
[examples] better model example #10427 (@stas00)
Fix conda-build #10431 (@LysandreJik)
[run_seq2seq.py] restore functionality: saving to test_generations.txt #10428 (@stas00)
updated logging and saving metrics #10436 (@bhadreshpsavani)
Introduce save_strategy training argument #10286 (@tanmay17061)
Adds terms to Glossary #10443 (@darigovresearch)
Fixes compatibility bug when using grouped beam search and constrained decoding together #10475 (@mnschmit)
Generate can return cross-attention weights too #10493 (@Mehrad0711)
Fix typos #10489 (@WybeKoper)
[T5] Fix speed degradation bug t5 #10496 (@patrickvonplaten)
feat(docs): navigate with left/right arrow keys #10481 (@ydcjeff)
Refactor checkpoint name in BERT and MobileBERT #10424 (@sgugger)
remap MODEL_FOR_QUESTION_ANSWERING_MAPPING classes to names auto-generated file #10487 (@stas00)
Fix the bug in constructing the all_hidden_states of DeBERTa v2 #10466 (@felixgwu)
Smp grad accum #10488 (@sgugger)
Remove unsupported methods from ModelOutput doc #10505 (@sgugger)
Not always consider a local model a checkpoint in run_glue #10517 (@sgugger)
Removes overwrites for output_dir #10521 (@philschmid)
Rework TPU checkpointing in Trainer #10504 (@sgugger)
[ProphetNet] Bart-like Refactor #10501 (@patrickvonplaten)
Fix example of custom Trainer to reflect signature of compute_loss #10537 (@lewtun)
Fixing conversation test for torch 1.8 #10545 (@Narsil)
Fix torch 1.8.0 segmentation fault #10546 (@LysandreJik)
Fixed dead link in Trainer documentation #10554 (@jwa018)
Typo correction. #10531 (@cliang1453)
Fix embeddings for PyTorch 1.8 #10549 (@sgugger)
Stale Bot #10509 (@LysandreJik)
Refactoring checkpoint names for multiple models #10527 (@danielpatrickhug)
offline mode for firewalled envs #10407 (@stas00)
fix tf doc bug #10570 (@Sniper970119)
[run_seq2seq] fix nltk lookup #10585 (@stas00)
Fix typo in docstring for pipeline #10591 (@silvershine157)
wrong model used for BART Summarization example #10582 (@orena1)
[M2M100] fix positional embeddings #10590 (@patil-suraj)
Enable torch 1.8.0 on GPU CI #10593 (@LysandreJik)
tokenization_marian.py: use current_spm for decoding #10357 (@Mehrad0711)
[trainer] fix double wrapping + test #10583 (@stas00)
Fix version control with anchors #10595 (@sgugger)
offline mode for firewalled envs (part 2) #10569 (@stas00)
[examples tests] various fixes #10584 (@stas00)
Added max_sample_ arguments #10551 (@bhadreshpsavani)
[examples tests on multigpu] resolving require_torch_non_multi_gpu_but_fix_me #10561 (@stas00)
Check layer types for Optimizer construction #10598 (@sgugger)
Speedup tf tests #10601 (@LysandreJik)
[docs] How to solve "Title level inconsistent" sphinx error #10600 (@stas00)
[FeatureExtractorSavingUtils] Refactor PretrainedFeatureExtractor #10594 (@patrickvonplaten)
fix flaky m2m100 test #10604 (@patil-suraj)
[examples template] added max_sample args and metrics changes #10602 (@bhadreshpsavani)
Fairscale FSDP fix model save #10596 (@sgugger)
Fix tests of TrainerCallback #10615 (@sgugger)
Fixes an issue in text-classification where MNLI eval/test datasets are not being preprocessed. #10621 (@allenwang28)
[M2M100] remove final_logits_bias #10606 (@patil-suraj)
Add new GLUE example with no Trainer. #10555 (@sgugger)
Copy tokenizer files in each of their repo #10624 (@sgugger)
Document Trainer limitation on custom models #10635 (@sgugger)
Fix Longformer tokenizer filename #10653 (@LysandreJik)
Update README.md #10647 (@Arvid-pku)
Ensure metric results are JSON-serializable #10632 (@sgugger)
S2S + M2M100 should be available in tokenization_auto #10657 (@LysandreJik)
Remove special treatment for custom vocab files #10637 (@sgugger)
[S2T] fix example in docs #10667 (@patil-suraj)
W2v2 test require torch #10665 (@LysandreJik)
Fix Marian/TFMarian tokenization tests #10661 (@LysandreJik)
Fixes Pegasus tokenization tests #10671 (@LysandreJik)
Onnx fix test #10663 (@mfuntowicz)
Fix integration slow tests #10670 (@sgugger)
Specify minimum version for sacrebleu #10662 (@LysandreJik)
Add DeBERTa to MODEL_FOR_PRETRAINING_MAPPING #10668 (@jeswan)
Fix broken link #10656 (@WybeKoper)
fix typing error for HfArgumentParser for Optional[bool] #10672 (@bfineran)
MT5 integration test: adjust loss difference #10669 (@LysandreJik)
Adding new parameter to generate: max_time. #9846 (@Narsil)
TensorFlow tests: having from_pt set to True requires torch to be installed. #10664 (@LysandreJik)
Add auto_wrap option in fairscale integration #10673 (@sgugger)
fix: #10628 expanduser path in TrainingArguments #10660 (@PaulLerner)
Pass encoder outputs into GenerationMixin #10599 (@ymfa)
[wip] [deepspeed] AdamW is now supported by default #9624 (@stas00)
[Tests] RAG #10679 (@patrickvonplaten)
enable loading Mbart50Tokenizer with AutoTokenizer #10690 (@patil-suraj)
Wrong link to super class #10709 (@cronoik)
Distributed barrier before loading model #10685 (@sgugger)
GPT2DoubleHeadsModel made parallelizable #10658 (@ishalyminov)
split seq2seq script into summarization & translation #10611 (@theo-m)
Adding required flags to non-default arguments in hf_argparser #10688 (@Craigacp)
Fix backward compatibility with EvaluationStrategy #10718 (@sgugger)
Tests run on Docker #10681 (@LysandreJik)
Rename zero-shot pipeline multi_class argument #10727 (@joeddav)
Add minimum version check in examples #10724 (@sgugger)
independent training / eval with local files #10710 (@riklopfer)
Flax testing should not run the full torch test suite #10725 (@patrickvonplaten)

huggingface/transformers v4.4.0 v4.4.0: S2T, M2M100, I-BERT, mBART-50, DeBERTa-v2, XLSR-Wav2Vec2 on GitHub