BREACKING CHANGES
- Many improvements on the Trainer (#519).
The files must now be provided first when callingtokenizer.train(files, trainer)
.
Features
- Adding the
TemplateProcessing
- Add
WordLevel
andUnigram
models (#490) - Add
nmtNormalizer
andprecompiledNormalizer
normalizers (#490) - Add
templateProcessing
post-processor (#490) - Add
digitsPreTokenizer
pre-tokenizer (#490) - Add support for mapping to sequences (#506)
- Add
splitPreTokenizer
pre-tokenizer (#542) - Add
behavior
option to thepunctuationPreTokenizer
(#657) - Add the ability to load tokenizers from the Hugging Face Hub using
fromPretrained
(#780)