github huggingface/transformers v2.5.0
Rust Tokenizers, DistilBERT base cased, Model cards

latest releases: v4.46.2, v4.46.1, v4.46.0...
4 years ago

Rust tokenizers (@mfuntowicz, @n1t0 )

  • Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging tokenizers library for fast tokenization 🚀
  • AutoTokenizer now defaults to fast tokenizers implementation when available
  • Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores.
  • Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N
  • An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.

Known Issues:

  • RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%).
  • Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one.

DistilBERT base cased (@VictorSanh)

The distilled version of the bert-base-cased BERT checkpoint has been released.

Model cards (@julien-c)

Model cards are now stored directly in the repository

CLI script for environment information (@BramVanroy)

We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.

Contributors visible on repository (@clmnt)

The main contributors as identified by Sourcerer are now visible directly on the repository.

From fine-tuning to pre-training (@julien-c )

The language fine-tuning script has been renamed from run_lm_finetuning to run_language_modeling as it is now also able to train language models from scratch.

Extracting archives now available from cached_path (@thomwolf )

Slight modification to cached_path so that zip and tar archives can be automatically extracted.

  • archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive.
  • automatic extraction is activated by setting extract_compressed_file=True when calling cached_file.
  • the extraction directory is re-used to avoid extracting the archive again unless we set force_extract=True, in which case the cached extraction directory is removed and the archive is extracted again.

New activations file (@sshleifer )

Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the activations.py file and be used in the different PyTorch models.

Community additions/bug-fixes/improvements

  • Remove redundant hidden states that broke encoder-decoder architectures (@LysandreJik )
  • Cleaner and more readable code in test_attention_weights (@sshleifer)
  • XLM can be trained on SQuAD in different languages (@yuvalpinter)
  • Improve test coverage on several models that were ill-tested (@LysandreJik)
  • Fix issue where TFGPT2 could not be saved (@neonbjb )
  • Multi-GPU evaluation on run_glue now behaves correctly (@peteriz )
  • Fix issue with TransfoXL tokenizer that couldn't be saved (@dchurchwell)
  • More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (@monologg )
  • FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (@LysandreJik )
  • Fix CircleCI error with TensorFlow 2.1.0 (@mfuntowicz )
  • More specific testing advice in contributing (@sshleifer )
  • BERT decoder: Fix failure with the default attention mask (@asivokon )
  • Fix a few issues regarding the data preprocessing in run_language_modeling (@LysandreJik )
  • Fix an issue with leading spaces and the RobertaTokenizer (@joeddav )
  • Added pipeline: TokenClassificationPipeline, which is an alias over NerPipeline (@julien-c )

Don't miss a new transformers release

NewReleases is sending notifications on new releases.