Rust tokenizers (@mfuntowicz, @n1t0 )
- Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging tokenizers library for fast tokenization 🚀
- AutoTokenizer now defaults to fast tokenizers implementation when available
- Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores.
- Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N
- An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.
Known Issues:
- RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%).
- Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one.
DistilBERT base cased (@VictorSanh)
The distilled version of the bert-base-cased BERT checkpoint has been released.
Model cards (@julien-c)
Model cards are now stored directly in the repository
CLI script for environment information (@BramVanroy)
We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.
Contributors visible on repository (@clmnt)
The main contributors as identified by Sourcerer are now visible directly on the repository.
From fine-tuning to pre-training (@julien-c )
The language fine-tuning script has been renamed from run_lm_finetuning
to run_language_modeling
as it is now also able to train language models from scratch.
Extracting archives now available from cached_path
(@thomwolf )
Slight modification to cached_path so that zip and tar archives can be automatically extracted.
- archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive.
- automatic extraction is activated by setting
extract_compressed_file=True
when callingcached_file
. - the extraction directory is re-used to avoid extracting the archive again unless we set
force_extract=True
, in which case the cached extraction directory is removed and the archive is extracted again.
New activations file (@sshleifer )
Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the activations.py
file and be used in the different PyTorch models.
Community additions/bug-fixes/improvements
- Remove redundant hidden states that broke encoder-decoder architectures (@LysandreJik )
- Cleaner and more readable code in
test_attention_weights
(@sshleifer) - XLM can be trained on SQuAD in different languages (@yuvalpinter)
- Improve test coverage on several models that were ill-tested (@LysandreJik)
- Fix issue where TFGPT2 could not be saved (@neonbjb )
- Multi-GPU evaluation on run_glue now behaves correctly (@peteriz )
- Fix issue with TransfoXL tokenizer that couldn't be saved (@dchurchwell)
- More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (@monologg )
- FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (@LysandreJik )
- Fix CircleCI error with TensorFlow 2.1.0 (@mfuntowicz )
- More specific testing advice in contributing (@sshleifer )
- BERT decoder: Fix failure with the default attention mask (@asivokon )
- Fix a few issues regarding the data preprocessing in
run_language_modeling
(@LysandreJik ) - Fix an issue with leading spaces and the RobertaTokenizer (@joeddav )
- Added pipeline:
TokenClassificationPipeline
, which is an alias overNerPipeline
(@julien-c )