FlauBERT, MMBT, UmBERTo
- MMBT was added to the list of available models, as the first multi-modal model to make it in the library. It can accept a transformer model as well as a computer vision model, in order to classify image and text. The MMBT Model is from Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine (https://github.com/facebookresearch/mmbt/)
Added by @suvrat96. - A new Dutch BERT model was added under the
wietsedv/bert-base-dutch-cased
identifier. Added by @wietsedv. Model page - UmBERTo, a Roberta-based Language Model trained on large Italian Corpora. Model page
- A new French model was added, FlauBERT, based on XLM. The FlauBERT model is from FlauBERT: Unsupervised Language Model Pre-training for French (https://github.com/getalp/Flaubert). Four checkpoints are added: small size, base uncased, base cased and large. Model page
New TF architectures (@jplu)
Python best practices (@aaugustin)
- Greatly improved the quality of the source code by leveraging
black
,isort
andflake8
. A test was added,check_code_quality
, which checks that the contributions respect the contribution guidelines related to those tools. - Similarly, optional imports are better handled and raise more precise errors.
- Cleaned up several requirements files, updated the contribution guidelines and rely on
setup.py
for the necessary dev dependencies. - you can clean up your code for a PR with (more details in CONTRIBUTING.md):
make style
make quality
Documentation (@LysandreJik)
The documentation was uniformized and some better guidelines have been defined. This work is part of an ongoing effort of making transformers
accessible to a larger audience. A glossary has been added, adding definitions for most frequently used inputs.
Furthermore, some tips are given concerning each model in their documentation pages.
The code samples are now tested on a weekly basis alongside other slow tests.
Improved repository structure (@aaugustin)
The source code was moved from ./transformers
to ./src/transformers
. Since it changes the location of the source code, contributors must update their local development environment by uninstalling and re-installing the library.
Python 2 is not supported anymore (@aaugustin )
Version 2.3.0 was the last version to support Python 2. As we begin the year 2020, official Python 2 support has been dropped.
Parallel testing (@aaugustin)
Tests can now be run in parallel
Sampling sequence generator (@rlouf, @thomwolf )
An abstract method was added to PreTrainedModel
, which is implemented in all models trained with CLM. This abstract method is generate
, which offers an API for text generation:
- with/without a prompt
- with/without beam search
- with/without greedy decoding/sampling
- with any (and combination) of top-k/top-p/penalized repetitions
Resuming training when interrupted (@bkkaggle )
Previously, when stopping a training the only saved values would be the model weights/configuration. Now the different scripts save several other values: the global step, current epoch, and the steps trained in the current epoch. When resuming a training, all those values will be leveraged to correctly resume the training.
This applies to the following scripts: run_glue
, run_squad
, run_ner
, run_xnli
.
CLI (@julien-c , @mfuntowicz )
Model upload
- The CLI now has better documentation.
- Files can now be removed.
Pipelines
- Expose the number of underlying FastAPI workers
- Async forward methods
- Fixed the environment variables so that they don't fight each other anymore (USE_TF, USE_TORCH)
Training from scratch (@julien-c )
The run_lm_finetuning.py
script now handles training from scratch.
Changes in the configuration (@julien-c )
The configuration files now contain the architecture they're referring to. There is no need to have the architecture in the file name as it was necessary before. This should ease the naming of community models.
New Auto models (@thomwolf )
A new type of AutoModel was added: AutoModelForPreTraining
. This model returns the base model that was used during the pre-training. For most models it is the base model alongside a language modeling head, whereas for others it is a different model, e.g. BertForPreTraining
for BERT.
HANS dataset (@ns-moosavi)
The HANS dataset was added to the examples. It allows for testing a model with adversarial evaluation of natural language.
[BREAKING CHANGES]
Ignored indices in PyTorch loss computing (@LysandreJik)
When using PyTorch, certain values can be ignored when computing the loss. In order for the loss function to understand which indices must be ignored, those have to be set to a certain value. Most of our models required those indices to be set to -1
. We decided to set this value to -100
instead as it is PyTorch's default value. This removes the discrepancy between user-implemented losses and the losses integrated in the models.
Further help from @r0mainK.
Community additions/bug-fixes/improvements
- Can now save and load PreTrainedEncoderDecoder objects (@TheEdoardo93)
- RoBERTa now bears more similarity to the FairSeq implementation (@DomHudson, @thomwolf)
- Examples now better reflect the defaults of the encoding methods (@enzoampil)
- TFXLNet now has a correct input mask (@thomwolf)
- run_squad was fixed to allow better training for XLNet (@importpandas )
- tokenization performance improvement (3-8x) (@mandubian)
- RoBERTa was added to the run_squad script (@erenup)
- Fixed the special and added tokens tokenization (@vitaliyradchenko)
- Fixed an issue with language generation for XLM when having a batch size superior to 1 (@patrickvonplaten)
- Fixed an issue with the
generate
method which did not correctly handle the repetition penalty (@patrickvonplaten) - Completed the documentation for
repeating_words_penalty_for_language_generation
(@patrickvonplaten) run_generation
now leverages cached past input for models that have access to it (@patrickvonplaten)- Finally manage to patch a rarely occurring bug with DistilBERT, eventually named
DistilHeisenBug
orHeisenDistilBug
(@LysandreJik, with the help of @julien-c and @thomwolf). - Fixed an import error in
run_tf_ner
(@karajan1001). - Feature conversion for GLUE now has improved logging messages (@simonepri)
- Patched an issue with GPUs and
run_generation
(@alberduris) - Added support for ALBERT and XLMRoBERTa to
run_glue
- Fixed an issue with the DistilBERT tokenizer not loading correct configurations (@LysandreJik)
- Updated the SQuAD for distillation script to leverage the new SQuAD API (@LysandreJik)
- Fixed an issue with T5 related to its
rp_bucket
(@mschrimpf) - PPLM now supports repetition penalties (@IWillPull)
- Modified the QA pipeline to consider all features for each example (@Perseus14)
- Patched an issue with a file lock (@dimagalat @aaugustin)
- The bias should be resized with the weights when resizing a vocabulary projection layer with a new vocabulary size (@LysandreJik)
- Fixed misleading token type IDs for RoBERTa. It doesn't leverage token type IDs and this has been clarified in the documentation (@LysandreJik ) Same for XLM-R (@maksym-del).
- Fixed the
prepare_for_model
when tensorizing and returning token type IDs (@LysandreJik). - Fixed the XLNet model which wouldn't work with torch 1.4 (@julien-c)
- Fetch all possible files remotely (@julien-c )
- BERT's BasicTokenizer respects
never_split
parameters (@DeNeutoy) - Add lower bound to tqdm dependency @brendan-ai2
- Fixed glue processors failing on tensorflow datasets (@neonbjb)
- XLMRobertaTokenizer can now be serialized (@brandenchan)
- A classifier dropout was added to ALBERT (@peteriz)
- The ALBERT configuration for v2 models were fixed to be identical to those output by Google (@LysandreJik )