Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the generate
method
Model versioning
We host more and more of the community's models which is awesome ❤️. To scale this sharing, we needed to change the infra to both support more models, and unlock new powerful features.
To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.
The benefits of this switch are:
- built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
- access control (will unlock private models, private datasets, etc)
- scalability (our usage of S3 to maintain lists of models was starting to bottleneck)
Let's dive in to the actual changes:
I. On the website
You'll now see a "Browse files and versions" tab or button on each model page. (design is not final, we'll make it more prominent/streamlined in the near future)
This is what this page looks like:
The UX should look familiar and self-explanatory, but we'll add more ML-specific features in the future.
You can:
- see commit histories and diffs of changes made to any text file, like config.json:
- changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you'll be able to opt out from those changes)
- Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
- Ability to update your text files, like your README.md model card, directly on the website!
- with instant preview 🔥
II. In the transformers library
The PR to enable this new storage mode in the transformers
library is available here: #8324
This PR has two parts:
1. changes to the file downloading code used in from_pretrained()
methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.
In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.
For instance:
tokenizer = AutoTokenizer.from_pretrained(
"julien-c/EsperBERTo-small",
revision="v2.0.1" # tag name, or branch name, or commit hash
)
Finally, the networking code is more robust and doesn't gobble up errors anymore, so in case you have trouble downloading a specific file you'll know exactly why.
2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git
too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.
To create a repo:
transformers-cli repo create your-model-name
Then you'll get a repo url that you'll be able to clone:
git clone https://huggingface.co/username/your-model-name
# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"
A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.
By the way, again, every model is its own repo. So you can git clone any public model if you'd like:
git clone https://huggingface.co/gpt2
But you won't be able to push unless it's one of your models (or one of your orgs').
III. Backward compatibility
- Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
- ⚠️ Model uploads using the current system won't work anymore: you'll need to upgrade your transformers installation to the next release,
v3.5.0
, or to build frommaster
.
Alternatively, in the next week or so we'll add the ability to create a repo from the website directly so you'll be able to push even without the transformers library.
TFMarian, TFMbart, TFPegasus, TFBlenderbot
- Add tensorflow 2.0 functionality for SOTA seq2seq transformers #7987 (@sshleifer)
New and updated scripts
We'working on giving examples on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps. The following tasks are now covered:
- Text classification : New run glue script #7917 (@sgugger)
- Causal Language Modeling: New run_clm script #8105 (@sgugger)
- Masked Language Modeling: Add line by line option to mlm/plm scripts #8240 (@sgugger)
- Token classification: Add new token classification example #8340 (@sgugger)
Seq2Seq Trainer
A child of Trainer
specialized for training seq2seq models, from @patil-suraj, @stas00 and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py
. API is similar to examples/seq2seq/finetune.py
, but API support is better. Example scripts are in examples/seq2seq/builtin_trainer
.
- [seq2seq testing] multigpu test run via subprocess #7281 (@stas00)
- [s2s trainer] tests to use distributed on multi-gpu machine #7965 (@stas00)
- [Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809 (@patrickvonplaten)
- [Seq2Seq Trainer] Make sure padding is implemented for models without pad_token #8043 (@patrickvonplaten)
- [Seq2SeqTrainer] Move import to init to make file self-contained #8194 (@patrickvonplaten)
- [s2s test] cleanup #8131 (@stas00)
- [Seq2Seq] Correct import in Seq2Seq Trainer #8254 (@patrickvonplaten)
- [Seq2Seq] Make Seq2SeqArguments an independent file #8267 (@patrickvonplaten)
- [Seq2SeqDataCollator] dont pass add_ prefix_space=False to all tokenizers #8329 (@sshleifer)
Seq2Seq Testing and Documentation Improvements
- [s2s] create doc for pegasus/fsmt replication #7934 (@stas00)
- [s2s] test_distributed_eval #8315 (@stas00)
- [s2s] test_bash_script.py - actually learn something #8318 (@stas00)
- [s2s examples test] fix data path #8398 (@stas00)
- [s2s test_finetune_trainer] failing multigpu test #8400 (@stas00)
- [s2s/distill] remove run_distiller.sh, fix xsum script #8412 (@sshleifer)
Docs for DistillBART Paper Replication
Re-run experiments from the paper here
- [s2s] distillBART docs for paper replication #8150 (@sshleifer)
Refactoring the generate()
function
The generate()
method now has a new design so that the user can directly call upon the methods
sample()
, greedy_search()
, beam_search()
and beam_sample()
. The code was made more readable, and beam search was sped-up by ca. 5-10%.
Refactoring the generate() function #6949 (@patrickvonplaten)
Notebooks
- added qg evaluation notebook #7958 (@zolekode)
- adding beginner-friendly notebook on text classification with DistilBERT/TF #7964 (@peterbayerle)
- [Notebooks] Add new encoder-decoder notebooks #8246 (@patrickvonplaten)
General improvements and bugfixes
- Respect the 119 line chars #7928 (@LysandreJik)
- PPL guide code snippet minor fix #7938 (@joeddav)
- [ProphetNet] Add Question Generation Model + Test #7942 (@patrickvonplaten)
- [multiple models] skip saving/loading deterministic state_dict keys #7878 (@stas00)
- Add missing comma #7870 (@mrm8488)
- TensorBoard/Wandb/optuna/raytune integration improvements. #7935 (@madlag)
- [ProphetNet] Correct Doc string example #7944 (@patrickvonplaten)
- [GPT2 batch generation] Make test clearer.
do_sample=True
is not deterministic. #7947 (@patrickvonplaten) - fix 'encode_plus' docstring for 'special_tokens_mask' (0s and 1s were reversed) #7949 (@epwalsh)
- Herbert tokenizer auto load #7968 (@rmroczkowski)
- [testing] slow tests should be marked as slow #7895 (@stas00)
- support relative path for best_model_checkpoint #7973 (@HaebinShin)
- Disable inference API for t5-11b #7978 (@julien-c)
- [fsmt test] basic config test with online model + super tiny model #7860 (@stas00)
- Add whole word mask support for lm fine-tune #7925 (@wlhgtc)
- [PretrainedConfig] Fix save pretrained config for edge case #7943 (@patrickvonplaten)
- GPT2 - Remove else branch adding 0 to the hidden state if token_type_embeds is None. #7977 (@mfuntowicz)
- Fixing the "translation", "translation_XX_to_YY" pipelines. #7975 (@Narsil)
- FillMaskPipeline: support passing top_k on call #7971 (@julien-c)
- Only log total_flos at the end of training #7981 (@sgugger)
- add zero shot pipeline tags & examples #7983 (@joeddav)
- Reload checkpoint #7984 (@sgugger)
- [gh ci] less output ( --durations=50) #7989 (@sshleifer)
- Move NoLayerEmbedTokens #7945 (@sshleifer)
- update zero shot default widget example #7992 (@joeddav)
- [RAG] Handle the case when title is None while loading own datasets #7941 (@lalitpagaria)
- [tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups #7970 (@thomwolf)
- [Reformer] remove reformer pad_token_id #7991 (@patrickvonplaten)
- Fix BatchEncoding.word_to_tokens for removed tokens #7939 (@n1t0)
- Handling longformer model_type #7990 (@ethanjperez)
- [doc prepare_seq2seq_batch] fix docs #8013 (@patil-suraj)
- [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006 (@thomwolf)
- Add mixed precision evaluation #8036 (@luyug)
- [docs] [testing] distributed training #7993 (@stas00)
- [fix] FSMT slow test uses lists instead of torch tensors #8031 (@sshleifer)
- update version for scipy #7998 (@suliuzh)
- [cleanup] pegasus,marian,mbart pytorch tests #8033 (@sshleifer)
- Fix label name in DataCollatorForNextSentencePrediction test #8048 (@sgugger)
- Tiny TF Bart fixes #8023 (@LysandreJik)
- Mlflow integration callback #8016 (@noise-field)
- Minor error fix of 'bart-large-cnn' details in the pretrained_models doc #8053 (@forest1988)
- invalid argument wwm passed to the run_language_modeling.py file #8050 (@mohammadreza-Banaei73)
- Fix + Test #8049 (@LysandreJik)
- [testing] fixing crash in deberta #8057 (@stas00)
- [TF] from_pt should respect authorized_unexpected_keys #8056 (@sshleifer)
- Fix TF training arguments instantiation #8063 (@LysandreJik)
- Doc fixes in preparation for the docstyle PR #8061 (@sgugger)
- Doc styling #8067 (@sgugger)
- Fix doc examples #8082 (@mymusise)
- Fix comet_ml import and add ensure availability #7933 (@dsblank)
- Doc styling fixes #8074 (@sgugger)
- Fix DeBERTa docs #8092 (@LysandreJik)
- [CI] generate separate report files as artifacts #7995 (@stas00)
- Move style_doc to extra_quality_checks #8081 (@sshleifer)
- Fix IterableDataset with len in Trainer #8095 (@cccntu)
- Fix assertion error message for MLflowCallback #8091 (@harupy)
- Fix a bug for
CallbackHandler.callback_list
#8052 (@harupy) - [setup] update/add setup targets #8076 (@stas00)
- DEP: pinned sentencepiece to 0.1.91 in setup.py to fix build issues with newer versions #8069 (@jmwoloso)
- infer entailment label id on zero shot pipeline #8059 (@joeddav)
- Fully remove codecov #8093 (@LysandreJik)
- Add AzureML in integrations via dedicated callback #8062 (@davidefiocco)
- Adjust setup so that all extras run on Windows #8102 (@sgugger)
- Move installation instructions to the top #8106 (@sgugger)
- [gh actions] run artifacts job always #8110 (@stas00)
- [testing] port test_trainer_distributed to distributed pytest + TestCasePlus enhancements #8107 (@stas00)
- [DOC] Improve pipeline() docstrings for config and tokenizer #8123 (@BramVanroy)
- Document the various LM Auto models #8118 (@sgugger)
- Rename add_start_docstrings_to_callable #8120 (@sgugger)
- Update CI cache #8126 (@LysandreJik)
- Upgrade PyTorch Lightning to 1.0.2 #7852 (@SeanNaren)
- Document tokenizer_class in configurations #8152 (@sgugger)
- Smarter prediction loop and no- -> no_ in console args #8151 (@sgugger)
- Add a template for examples and apply it for mlm and plm examples #8153 (@sgugger)
- [testing] distributed: correct subprocess output checking #8157 (@stas00)
- Fix eval ref miss in Chinese WWM. #8115 (@wlhgtc)
- [CI] Better reports #2 #8163 (@stas00)
- Fixing some warnings in DeBerta #8176 (@Narsil)
- Ci test tf super slow #8007 (@LysandreJik)
- Doc fixes and filter warning in wandb #8189 (@sgugger)
- Finalize lm examples #8188 (@sgugger)
- Replace swish with silu #8166 (@TFUsers)
- Remove deprecated arguments from new run_clm #8197 (@sgugger)
- Minor style improvements for the Flax BERT and RoBERTa examples #8178 (@avital)
- Fix two bugs with --logging_first_step #8193 (@abisee)
- [Bug fix] Fixed value for BlenderBot pad token #8205 (@guillaume-be)
- Fix the behaviour of DefaultArgumentHandler (removing it). #8180 (@Narsil)
- Fix ignore files behavior in doctests #8213 (@bryant1410)
- Patch reports #8238 (@LysandreJik)
- Fix bad import with PyTorch <= 1.4.1 #8237 (@sgugger)
- Fix TensorBoardCallback for older versions of PyTorch #8239 (@sgugger)
- Add XLMProphetNetTokenizer to tokenization auto #8245 (@LysandreJik)
- [EncoderDecoder] fix encoder decoder config model type bug #8243 (@patrickvonplaten)
- [bart] 2 SinusoidalPositionalEmbedding fixes #8226 (@stas00)
- [fix] Skip tatoeba tests if Tatoeba-Challenge not cloned #8260 (@sshleifer)
- [FIX] TextGenerationPipeline is currently broken. #8256 (@Narsil)
- Updated ConversationalPipeline to work with encoder-decoder models #8207 (@guillaume-be)
- [distributed testing] forward the worker stderr to the parent process #8262 (@stas00)
- [examples] minimal version requirement run-time check in PL #8133 (@stas00)
- Clean Trainer tests and datasets dep #8268 (@sgugger)
- improve documentation of training_args.py #8270 (@PhilipMay)
- Data collator for token classification #8274 (@sgugger)
- [CIs] Better reports everywhere #8275 (@stas00)
- [blenderbot] regex fix #8282 (@stas00)
- [Generate Test] fix greedy generate test #8293 (@patrickvonplaten)
- Fix validation file loading in scripts #8298 (@sgugger)
- Improve QA pipeline error handling #8286 (@Narsil)
- Speedup doc build #8301 (@sgugger)
- Fix path to old run_language_modeling.py script #8302 (@mrm8488)
- Clean up data collators and datasets #8308 (@sgugger)
- change TokenClassificationTask class methods to static methods #7902 (@donchev7)
- Output global_attentions in Longformer models #7562 (@gui11aume)
- Make Trainer evaluation handle dynamic seq_length #8336 (@sgugger)
- Docs bart training ref #8330 (@lvwerra)
- Some added tests for TokenClassificationArgumentHandler #8366 (@Narsil)
- [All Seq2Seq model + CLM models that can be used with EncoderDecoder] Add cross-attention weights to outputs #8071 (@ysgit)
- [TF generate] Cut encoder outptus to just last hidden states for now #8368 (@patrickvonplaten)
- [make] rewrite modified_py_files in python to be cross-platform #8371 (@stas00)
- Fix DataCollatorForWholeWordMask #8379 (@cccntu)
- Fix DataCollatorForWholeWordMask again #8397 (@cccntu)
- comet_ml init weirdness #8410 (@stas00)
- updating tag for exbert viz #8408 (@smanjil)
- Fix some tooling for windows #8359 (@jplu)
- examples/docs: caveat that PL examples don't work on TPU #8309 (@sshleifer)
- add evaluate doc - trainer.evaluate returns 'epoch' from training #8273 (@PhilipMay)
- Bug fix for permutation language modelling #8409 (@shngt)
- [fsmt tokenizer] support lowercase tokenizer #8389 (@stas00)
- Bump tokenizers #8419 (@sgugger)
- [fsmt convert script] fairseq broke chkpt data - fixing that #8377 (@stas00)
- Deprecate old data/metrics functions #8420 (@sgugger)
- [Tests] Add Common Test for Training + Fix a couple of bugs #8415 (@patrickvonplaten)
- [docs] remove sshleifer from issue-template :( #8418 (@sshleifer)
- Fix bart shape comment #8423 (@sshleifer)
- [docs] [testing] gpu decorators table #8422 (@stas00)
- Check all models are in an auto class #8425 (@sgugger)
- [github CI] add a multi-gpu job for all example tests #8341 (@stas00)
- Changing XLNet default from not using memories to 512 context size following paper #8417 (@TevenLeScao)
- Patch token classification pipeline #8364 (@LysandreJik)