Rust tokenizers (@mfuntowicz, @n1t0 )

Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging tokenizers library for fast tokenization 🚀
AutoTokenizer now defaults to fast tokenizers implementation when available
Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores.
Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N
An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.

Known Issues:

RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%).
Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one.

DistilBERT base cased (@VictorSanh)

The distilled version of the bert-base-cased BERT checkpoint has been released.

Model cards (@julien-c)

Model cards are now stored directly in the repository

CLI script for environment information (@BramVanroy)

We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.

Contributors visible on repository (@clmnt)

The main contributors as identified by Sourcerer are now visible directly on the repository.

From fine-tuning to pre-training (@julien-c )

The language fine-tuning script has been renamed from run_lm_finetuning to run_language_modeling as it is now also able to train language models from scratch.

Extracting archives now available from `cached_path` (@thomwolf )

Slight modification to cached_path so that zip and tar archives can be automatically extracted.

archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive.
automatic extraction is activated by setting extract_compressed_file=True when calling cached_file.
the extraction directory is re-used to avoid extracting the archive again unless we set force_extract=True, in which case the cached extraction directory is removed and the archive is extracted again.

New activations file (@sshleifer )

Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the activations.py file and be used in the different PyTorch models.

Community additions/bug-fixes/improvements

Remove redundant hidden states that broke encoder-decoder architectures (@LysandreJik )
Cleaner and more readable code in test_attention_weights (@sshleifer)
XLM can be trained on SQuAD in different languages (@yuvalpinter)
Improve test coverage on several models that were ill-tested (@LysandreJik)
Fix issue where TFGPT2 could not be saved (@neonbjb )
Multi-GPU evaluation on run_glue now behaves correctly (@peteriz )
Fix issue with TransfoXL tokenizer that couldn't be saved (@dchurchwell)
More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (@monologg )
FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (@LysandreJik )
Fix CircleCI error with TensorFlow 2.1.0 (@mfuntowicz )
More specific testing advice in contributing (@sshleifer )
BERT decoder: Fix failure with the default attention mask (@asivokon )
Fix a few issues regarding the data preprocessing in run_language_modeling (@LysandreJik )
Fix an issue with leading spaces and the RobertaTokenizer (@joeddav )
Added pipeline: TokenClassificationPipeline, which is an alias over NerPipeline (@julien-c )

transformers 2.5.0 Rust Tokenizers, DistilBERT base cased, Model cards on Python PyPI