Changes:
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Added
CharDelimiterSplit
: a newPreTokenizer
that allows splitting sequences on the given delimiter (Works like.split(delimiter)
) - Added
WordLevel
: a new model that simply mapstokens
to theirids
. - Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing
Encoding
that are ready to be processed by a language model, just as the mainEncoding
. - Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
Bug fixes:
- Fix a bug with IndexableString
- Fix a bug with truncation