New model architectures: CTRL, DistilGPT-2
Two new models have been added since release 2.0.
- CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation, by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher. This model has been added to the library by @keskarnitish with the help of @thomwolf.
- DistilGPT-2 (from HuggingFace), as the second distilled model after DistilBERT in version 1.2.0. Released alongside the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Distillation
Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.
Pytorch TPU support
The run_glue.py
example script can now run on a Pytorch TPU.
Updates to example scripts
Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:
run_multiple_choice.py
has been refactored to includeencode_plus
by @julien-c and @erenuprun_lm_finetuning.py
has been improved with the help of @dennymarcels, @jinoobaek-qz and @LysandreJikrun_glue.py
has been improved with the help of @brian41005
QOL enhancements on the tokenizer
Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask
and truncate_sequences
.
The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.
Both of those methods are called by the encode_plus
method, which itself is called by the encode
method. The encode_plus
now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.
Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.
New German BERT models
- Support for new German BERT models (cased and uncased) from @stefan-it @dbmdz
Breaking changes
-
The two methods
add_special_tokens_single_sequence
andadd_special_tokens_sequence_pair
have been removed. They have been replaced by the single methodbuild_inputs_with_special_tokens
which has a more comprehensible name and manages both sequence singletons and pairs. -
The boolean parameter
truncate_first_sequence
has been removed in tokenizers'encode
andencode_plus
methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies. -
When the
encode
orencode_plus
methods are called with a specifiedmax_length
, the sequences will now always be truncated or throw an error if overflowing.
Guidelines and requirements
New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.
Community additions/bug-fixes/improvements
- GLUE Processors have been refactored to handle inputs for all tasks coming from the
tensorflow_datasets
. This work has been done by @agrinh and @philipp-eisen. - The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada
- The documentation CSS has been adapted to work on older browsers. @TimYagan
- An addition concerning the management of hidden states has been added to the README by @BramVanroy.
- Integration of TF 2.0 models with other Keras modules @thomwolf
- Past values can be opted-out @thomwolf