v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework

ONNX rework

This version introduces a new package, transformers.onnx, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export.

python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/

Validating ONNX model...
        -[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
        - Validating ONNX Model output "last_hidden_state":
                -[✓] (2, 8, 768) matchs (2, 8, 768)
                -[✓] all values close (atol: 0.0001)
        - Validating ONNX Model output "pooler_output":
                -[✓] (2, 768) matchs (2, 768)
                -[✓] all values close (atol: 0.0001)
All good, model saved at: onnx/bert-base-cased/model.onnx

[RFC] Laying down building stone for more flexible ONNX export capabilities #11786 (@mfuntowicz)

CANINE model

Four new models are released as part of the CANINE implementation: CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification and CanineForQuestionAnswering, in PyTorch.

The CANINE model was proposed in CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It’s among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder.

Add CANINE #12024 (@NielsRogge)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine

Tokenizer training

This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
# We train on batch of texts, 1000 at a time here.
batch_size = 1000
corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size))

tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000)

Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) #12420 (@SaulLu)
Easily train a new fast tokenizer from a given one #12361 (@sgugger)

TensorFlow examples

The TFTrainer is now entering deprecation - and it is replaced by Keras. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust.

NER example for Tensorflow #12469 (@Rocketknight1)
TF summarization example #12617 (@Rocketknight1)
Adding TF translation example #12667 (@Rocketknight1)
Deprecate TFTrainer #12706 (@Rocketknight1)

TensorFlow implementations

HuBERT is now implemented in TensorFlow:

Add TFHubertModel #12206 (@will-rice)

Breaking changes

When load_best_model_at_end was set to True in the TrainingArguments, having a different save_strategy and eval_strategy was accepted but the save_strategy was overwritten by the eval_strategy (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set save_strategy and eval_strategy to the same values, and in the case that value is "steps", save_steps must be a round multiple of eval_steps.

General improvements and bugfixes

UpdateDescription of TrainingArgs param save_strategy #12328 (@sam-qordoba)
[Deepspeed] new docs #12077 (@stas00)
[ray] try fixing import error #12338 (@richardliaw)
[examples/Flax] move the examples table up #12341 (@patil-suraj)
Fix torchscript tests #12336 (@LysandreJik)
Add flax/jax quickstart #12342 (@marcvanzee)
Fixed a typo in readme #12356 (@MichalPitr)
Fix exception in prediction loop occurring for certain batch sizes #12350 (@jglaser)
Add FlaxBigBird QuestionAnswering script #12233 (@vasudevgupta7)
Replace NotebookProgressReporter by ProgressReporter in Ray Tune run #12357 (@krfricke)
[examples] remove extra white space from log format #12360 (@stas00)
fixed multiplechoice tokenization #12362 (@cronoik)
[trainer] add main_process_first context manager #12351 (@stas00)
[Examples] Replicates the new --log_level feature to all trainer-based pytorch #12359 (@bhadreshpsavani)
[Examples] Update Example Template for --log_level feature #12365 (@bhadreshpsavani)
[Examples] Replace print statement with logger.info in QA example utils #12368 (@bhadreshpsavani)
Onnx export v2 fixes #12388 (@LysandreJik)
[Documentation] Warn that DataCollatorForWholeWordMask is limited to BertTokenizer-like tokenizers #12371 (@ionicsolutions)
Update run_mlm.py #12344 (@TahaAslani)
Add possibility to maintain full copies of files #12312 (@sgugger)
[CI] add dependency table sync verification #12364 (@stas00)
[Examples] Added context manager to datasets map #12367 (@bhadreshpsavani)
[Flax community event] Add more description to readme #12398 (@patrickvonplaten)
Remove the need for einsum in Albert's attention computation #12394 (@mfuntowicz)
[Flax] Adapt flax examples to include push_to_hub #12391 (@patrickvonplaten)
Tensorflow LM examples #12358 (@Rocketknight1)
[Deepspeed] match the trainer log level #12401 (@stas00)
[Flax] Add T5 pretraining script #12355 (@patrickvonplaten)
[models] respect dtype of the model when instantiating it #12316 (@stas00)
Rename detr targets to labels #12280 (@NielsRogge)
Add out of vocabulary error to ASR models #12288 (@will-rice)
Fix TFWav2Vec2 SpecAugment #12289 (@will-rice)
[example/flax] add summarization readme #12393 (@patil-suraj)
[Flax] Example scripts - correct weight decay #12409 (@patrickvonplaten)
fix ids_to_tokens naming error in tokenizer of deberta v2 #12412 (@hjptriplebee)
Minor fixes in original RAG training script #12395 (@shamanez)
Added talks #12415 (@suzana-ilic)
[modelcard] fix #12422 (@stas00)
Add option to save on each training node #12421 (@sgugger)
Added to talks section #12433 (@suzana-ilic)
Fix default bool in argparser #12424 (@sgugger)
Add default bos_token and eos_token for tokenizer of deberta_v2 #12429 (@hjptriplebee)
fix typo in mt5 configuration docstring #12432 (@fcakyon)
Add to talks section #12442 (@suzana-ilic)
[JAX/Flax readme] add philosophy doc #12419 (@patil-suraj)
[Flax] Add wav2vec2 #12271 (@patrickvonplaten)
Add test for a WordLevel tokenizer model #12437 (@SaulLu)
[Flax community event] How to use hub during training #12447 (@patrickvonplaten)
[Wav2Vec2, Hubert] Fix ctc loss test #12458 (@patrickvonplaten)
Comment fast GPU TF tests #12452 (@LysandreJik)
Fix training_args.py barrier for torch_xla #12464 (@jysohn23)
Added talk details #12465 (@suzana-ilic)
Add TPU README #12463 (@patrickvonplaten)
Import check_inits handling of duplicate definitions. #12467 (@Iwontbecreative)
Validation split added: custom data files @sgugger, @patil-suraj #12407 (@Souvic)
Fixing bug with param count without embeddings #12461 (@TevenLeScao)
[roberta] fix lm_head.decoder.weight ignore_key handling #12446 (@stas00)
Rework notebooks and move them to the Notebooks repo #12471 (@sgugger)
fixed typo in flax-projects readme #12466 (@mplemay)
Fix TAPAS test uncovered by #12446 #12480 (@LysandreJik)
Add guide on how to build demos for the Flax sprint #12468 (@osanseviero)
Add Repository import to the FLAX example script #12501 (@LysandreJik)
[examples/flax] clip style image-text training example #12491 (@patil-suraj)
[Flax] Fix wav2vec2 pretrain arguments #12498 (@Wikidepia)
[Flax] ViT training example #12300 (@patil-suraj)
Fix order of state and input in Flax Quickstart README #12510 (@navjotts)
[Flax] Dataset streaming example #12470 (@patrickvonplaten)
[Flax] Correct flax training scripts #12514 (@patrickvonplaten)
[Flax] Correct logging steps flax #12515 (@patrickvonplaten)
[Flax] Fix another bug in logging steps #12516 (@patrickvonplaten)
[Wav2Vec2] Flax - Adapt wav2vec2 script #12520 (@patrickvonplaten)
[Flax] Fix hybrid clip #12519 (@patil-suraj)
[RoFormer] Fix some issues #12397 (@JunnYu)
FlaxGPTNeo #12493 (@patil-suraj)
Updated README #12540 (@suzana-ilic)
Edit readme #12541 (@SaulLu)
implementing tflxmertmodel integration test #12497 (@sadakmed)
[Flax] Adapt examples to be able to use eval_steps and save_steps #12543 (@patrickvonplaten)
[examples/flax] add adafactor optimizer #12544 (@patil-suraj)
[Flax] Add FlaxMBart #12236 (@stancld)
Add a warning for broken ProphetNet fine-tuning #12511 (@JetRunner)
[trainer] add option to ignore keys for the train function too (#11719) #12551 (@shabie)
MLM training fails with no validation file(same as #12406 for pytorch now) #12517 (@Souvic)
[Flax] Allow retraining from save checkpoint #12559 (@patrickvonplaten)
Adding prepare_decoder_input_ids_from_labels methods to all TF ConditionalGeneration models #12560 (@Rocketknight1)
Remove tf.roll wherever not needed #12512 (@szutenberg)
Double check for attribute num_examples #12562 (@sgugger)
[examples/hybrid_clip] fix loading clip vision model #12566 (@patil-suraj)
Remove logging of GPU count etc from run_t5_mlm_flax.py #12569 (@ibraheem-moosa)
raise exception when arguments to pipeline are incomplete #12548 (@hwijeen)
Init pickle #12567 (@sgugger)
Fix group_lengths for short datasets #12558 (@sgugger)
Don't stop at num_epochs when using IterableDataset #12561 (@sgugger)
Fixing the pipeline optimization by reindexing targets (V2) #12330 (@Narsil)
Fix MT5 init #12591 (@sgugger)
[model.from_pretrained] raise exception early on failed load #12574 (@stas00)
[doc] fix broken ref #12597 (@stas00)
Add Flax sprint project evaluation section #12592 (@osanseviero)
This will reduce "Already borrowed error": #12550 (@Narsil)
[Flax] Add flax marian #12595 (@patrickvonplaten)
[Flax] Fix cur step flax examples #12608 (@patrickvonplaten)
Simplify unk token #12582 (@sgugger)
Fix arg count for partial functions #12609 (@sgugger)
Pass model_kwargs when loading a model in pipeline() #12449 (@aphedges)
[Flax] Fix mt5 auto #12612 (@patrickvonplaten)
[Flax Marian] Add marian flax example #12614 (@patrickvonplaten)
[FLax] Fix marian docs 2 #12615 (@patrickvonplaten)
[debugging utils] minor doc improvements #12525 (@stas00)
[doc] DP/PP/TP/etc parallelism #12524 (@stas00)
[doc] fix anchor #12620 (@stas00)
[Examples][Flax] added test file in summarization example #12630 (@bhadreshpsavani)
Point to the right file for hybrid CLIP #12599 (@edugp)
[flax]fix jax array type check #12638 (@patil-suraj)
Add tokenizer_file parameter to PreTrainedTokenizerFast docstring #12624 (@lewisbails)
Skip TestMarian_MT_EN #12649 (@LysandreJik)
The extended trainer tests should require torch #12650 (@LysandreJik)
Pickle auto models #12654 (@sgugger)
Pipeline should be agnostic #12656 (@LysandreJik)
Fix transfo xl integration test #12652 (@LysandreJik)
Remove SageMaker documentation #12657 (@philschmid)
Fixed docs #12646 (@KickItLikeShika)
fix typo in modeling_t5.py docstring #12640 (@PhilipMay)
Translate README.md to Simplified Chinese #12596 (@JetRunner)
Fix typo in README_zh-hans.md #12663 (@JetRunner)
Updates timeline for project evaluation #12660 (@osanseviero)
[WIP] Patch BigBird tokenization test #12653 (@LysandreJik)
**encode_plus() shouldn't run for W2V2CTC #12655 (@LysandreJik)
Add ByT5 option to example run_t5_mlm_flax.py #12634 (@mapmeld)
Wrong model is used in example, should be character instead of subword model #12676 (@jsteggink)
[Blenderbot] Fix docs #12227 (@patrickvonplaten)
Add option to load a pretrained model with mismatched shapes #12664 (@sgugger)
Fix minor docstring typos. #12682 (@qqaatw)
[tokenizer.prepare_seq2seq_batch] change deprecation to be easily actionable #12669 (@stas00)
[Flax Generation] Correct inconsistencies PyTorch/Flax #12662 (@patrickvonplaten)
[Deepspeed] adapt multiple models, add zero_to_fp32 tests #12477 (@stas00)
Add timeout to CI. #12684 (@LysandreJik)
Fix Tensorflow Bart-like positional encoding #11897 (@JunnYu)
[Deepspeed] non-native optimizers are mostly ok with zero-offload #12690 (@stas00)
Fix multiple choice doc examples #12679 (@sgugger)
Provide mask_time_indices to _mask_hidden_states to avoid double masking #12692 (@mfuntowicz)
Update TF examples README #12703 (@Rocketknight1)
Fix uninitialized variables when config.mask_feature_prob > 0 #12705 (@mfuntowicz)
Only test the files impacted by changes in the diff #12644 (@sgugger)
flax model parallel training #12590 (@patil-suraj)
[test] split test into 4 sub-tests to avoid timeout #12710 (@stas00)
[trainer] release tmp memory in checkpoint load #12718 (@stas00)
[Flax] Correct shift labels for seq2seq models in Flax #12720 (@patrickvonplaten)
Fix typo in Speech2TextForConditionalGeneration example #12716 (@will-rice)
Init adds its own files as impacted #12709 (@sgugger)
LXMERT integration test typo #12736 (@LysandreJik)
Fix AutoModel tests #12733 (@LysandreJik)
Skip test while the model is not available #12739 (@LysandreJik)
Skip test while the model is not available #12740 (@LysandreJik)
Translate README.md to Traditional Chinese #12701 (@qqaatw)
Fix MBart failing test #12737 (@LysandreJik)
Patch T5 device test #12742 (@LysandreJik)
Fix DETR integration test #12734 (@LysandreJik)
Fix led torchscript #12735 (@LysandreJik)
Remove framework mention #12731 (@LysandreJik)
[doc] parallelism: Which Strategy To Use When #12712 (@stas00)
[doc] performance: batch sizes #12725 (@stas00)
Replace specific tokenizer in log message by AutoTokenizer #12745 (@SaulLu)
[Wav2Vec2] Correctly pad mask indices for PreTraining #12748 (@patrickvonplaten)
[doc] testing: how to trigger a self-push workflow #12724 (@stas00)
add intel-tensorflow-avx512 to the candidates #12751 (@zzhou612)
[flax/model_parallel] fix typos #12757 (@patil-suraj)
Turn on eval mode when exporting to ONNX #12758 (@mfuntowicz)
Preserve list type of additional_special_tokens in special_token_map #12759 (@SaulLu)
[Wav2Vec2] Padded vectors should not allowed to be sampled #12764 (@patrickvonplaten)
Add tokenizers class mismatch detection between cls and checkpoint #12619 (@europeanplaice)
Fix push_to_hub docstring and make it appear in doc #12770 (@sgugger)
[ray] Fix datasets_modules ImportError with Ray Tune #12749 (@Yard1)
Longer timeout for slow tests #12779 (@LysandreJik)
Enforce eval and save strategies are compatible when --load_best_model_at_end #12786 (@sgugger)
[CIs] add troubleshooting docs #12791 (@stas00)
Fix Padded Batch Error 12282 #12487 (@will-rice)
Flax MLM: Allow validation split when loading dataset from local file #12689 (@fgaim)
[Longformer] Correct longformer docs #12809 (@patrickvonplaten)
[CLIP/docs] add and fix examples #12810 (@patil-suraj)
[trainer] sanity checks for save_steps=0|None and logging_steps=0 #12796 (@stas00)
Expose get_config() on ModelTesters #12812 (@LysandreJik)
Refactor slow sentencepiece tokenizers. #11716 (@PhilipMay)
Refer warmup_ratio when setting warmup_num_steps. #12818 (@tsuchm)
Add versioning system to fast tokenizer files #12713 (@sgugger)
Add _CHECKPOINT_FOR_DOC to all models #12811 (@LysandreJik)

huggingface/transformers v4.9.0 v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework on GitHub