github huggingface/tokenizers python-v0.3.0
Python v0.3.0

latest releases: v0.20.1, v0.20.1rc1, v0.20.0...
4 years ago

Changes:

  • BPETokenizer has been renamed to CharBPETokenizer for clarity.
  • Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
  • Added WordLevel: a new model that simply maps tokens to their ids.
  • Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
  • Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
  • Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

  • Fix a bug with IndexableString
  • Fix a bug with truncation

Don't miss a new tokenizers release

NewReleases is sending notifications on new releases.