github huggingface/tokenizers python-v0.10.0
Python v0.10.0

latest releases: v0.21.0rc0, v0.20.4, v0.20.4rc0...
4 years ago

Added

  • [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
  • [#519]: Add a WordLevelTrainer used to train a WordLevel model
  • [#533]: Add support for conda builds
  • [#542]: Add Split pre-tokenizer to easily split using a pattern
  • [#544]: Ability to train from memory. This also improves the integration with datasets
  • [#590]: Add getters/setters for components on BaseTokenizer
  • [#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

  • [#509]: Automatically stubbing the .pyi files
  • [#519]: Each Model can return its associated Trainer with get_trainer()
  • [#530]: The various attributes on each component can be get/set (ie.
    tokenizer.model.dropout = 0.1)
  • [#538]: The API Reference has been improved and is now up-to-date.

Fixed

  • [#519]: During training, the Model is now trained in-place. This fixes several bugs that were
    forcing to reload the Model after a training.
  • [#539]: Fix BaseTokenizer enable_truncation docstring

Don't miss a new tokenizers release

NewReleases is sending notifications on new releases.