New model architectures: CTRL, DistilGPT-2

Two new models have been added since release 2.0.

CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation, by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher. This model has been added to the library by @keskarnitish with the help of @thomwolf.
DistilGPT-2 (from HuggingFace), as the second distilled model after DistilBERT in version 1.2.0. Released alongside the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Distillation

Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.

Pytorch TPU support

The run_glue.py example script can now run on a Pytorch TPU.

Updates to example scripts

Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:

run_multiple_choice.py has been refactored to include encode_plus by @julien-c and @erenup
run_lm_finetuning.py has been improved with the help of @dennymarcels, @jinoobaek-qz and @LysandreJik
run_glue.py has been improved with the help of @brian41005

QOL enhancements on the tokenizer

Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences .

The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.

Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.

Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.

New German BERT models

Support for new German BERT models (cased and uncased) from @stefan-it @dbmdz

Breaking changes

The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs.
The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies.
When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing.

Guidelines and requirements

New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.

Community additions/bug-fixes/improvements

GLUE Processors have been refactored to handle inputs for all tasks coming from the tensorflow_datasets. This work has been done by @agrinh and @philipp-eisen.
The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada
The documentation CSS has been adapted to work on older browsers. @TimYagan
An addition concerning the management of hidden states has been added to the README by @BramVanroy.
Integration of TF 2.0 models with other Keras modules @thomwolf
Past values can be opted-out @thomwolf

transformers 2.1.1 CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements on Python PyPI