huggingface/tokenizers python-v0.3.0 on GitHub

Changes:

BPETokenizer has been renamed to CharBPETokenizer for clarity.
Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
Added WordLevel: a new model that simply maps tokens to their ids.
Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
Provide mapping to the original string offsets using:

output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))