github huggingface/transformers v3.5.0
v3.5.0: Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the `generate` method

latest releases: v4.42.3, v4.42.2, v4.42.1...
3 years ago

Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the generate method

Model versioning

We host more and more of the community's models which is awesome ❤️. To scale this sharing, we needed to change the infra to both support more models, and unlock new powerful features.

To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.

The benefits of this switch are:

  • built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
  • access control (will unlock private models, private datasets, etc)
  • scalability (our usage of S3 to maintain lists of models was starting to bottleneck)

Let's dive in to the actual changes:

I. On the website


You'll now see a "Browse files and versions" tab or button on each model page. (design is not final, we'll make it more prominent/streamlined in the near future)

This is what this page looks like:

The UX should look familiar and self-explanatory, but we'll add more ML-specific features in the future.

You can:

  • see commit histories and diffs of changes made to any text file, like config.json:
    • changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you'll be able to opt out from those changes)
  • Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
  • Ability to update your text files, like your README.md model card, directly on the website!
    • with instant preview 🔥

II. In the transformers library


The PR to enable this new storage mode in the transformers library is available here: #8324

This PR has two parts:

1. changes to the file downloading code used in from_pretrained() methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.

In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.

For instance:

tokenizer = AutoTokenizer.from_pretrained(
  "julien-c/EsperBERTo-small",
  revision="v2.0.1" # tag name, or branch name, or commit hash
)

Finally, the networking code is more robust and doesn't gobble up errors anymore, so in case you have trouble downloading a specific file you'll know exactly why.

2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.

To create a repo:

transformers-cli repo create your-model-name

Then you'll get a repo url that you'll be able to clone:

git clone https://huggingface.co/username/your-model-name

# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"

A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.

By the way, again, every model is its own repo. So you can git clone any public model if you'd like:

git clone https://huggingface.co/gpt2

But you won't be able to push unless it's one of your models (or one of your orgs').

III. Backward compatibility


  • Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
  • ⚠️ Model uploads using the current system won't work anymore: you'll need to upgrade your transformers installation to the next release, v3.5.0, or to build from master.
    Alternatively, in the next week or so we'll add the ability to create a repo from the website directly so you'll be able to push even without the transformers library.

TFMarian, TFMbart, TFPegasus, TFBlenderbot

  • Add tensorflow 2.0 functionality for SOTA seq2seq transformers #7987 (@sshleifer)

New and updated scripts

We'working on giving examples on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps. The following tasks are now covered:

  • Text classification : New run glue script #7917 (@sgugger)
  • Causal Language Modeling: New run_clm script #8105 (@sgugger)
  • Masked Language Modeling: Add line by line option to mlm/plm scripts #8240 (@sgugger)
  • Token classification: Add new token classification example #8340 (@sgugger)

Seq2Seq Trainer

A child of Trainer specialized for training seq2seq models, from @patil-suraj, @stas00 and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py. API is similar to examples/seq2seq/finetune.py, but API support is better. Example scripts are in examples/seq2seq/builtin_trainer.

Seq2Seq Testing and Documentation Improvements

Docs for DistillBART Paper Replication

Re-run experiments from the paper here

Refactoring the generate() function

The generate() method now has a new design so that the user can directly call upon the methods
sample(), greedy_search(), beam_search() and beam_sample(). The code was made more readable, and beam search was sped-up by ca. 5-10%.

Refactoring the generate() function #6949 (@patrickvonplaten)

Notebooks

General improvements and bugfixes

Don't miss a new transformers release

NewReleases is sending notifications on new releases.