github huggingface/tokenizers python-v0.9.0
Python v0.9.0

latest releases: v0.20.1, v0.20.1rc1, v0.20.0...
4 years ago

Fixed

  • [#362]: Fix training deadlock with Python components.
  • [#363]: Fix a crash when calling .train with some non-existent files
  • [#355]: Remove a lot of possible crashes
  • [#389]: Improve truncation (crash and consistency)

Added

  • [#379]: Add the ability to call encode/encode_batch with numpy arrays
  • [#292]: Support for the Unigram algorithm
  • [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
  • [#403]: Add TemplateProcessing PostProcessor.
  • [#420]: Ability to fuse the "unk" token in BPE.

Changed

  • [#360]: Lots of improvements related to words/alignment tracking
  • [#426]: Improvements on error messages thanks to PyO3 0.12

Don't miss a new tokenizers release

NewReleases is sending notifications on new releases.