Fixed
- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between
is_pretokenized
andtrim_offsets
. - [#851] Doc links
Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for
Decoders
.
Changed
- [#850]: Added a feature gate to enable disabling
http
features - [#718]: Fix
WordLevel
tokenizer determinism during training - [#762]: Add a way to specify the unknown token in
SentencePieceUnigramTokenizer
- [#770]: Improved documentation for
UnigramTrainer
- [#780]: Add
Tokenizer.from_pretrained
to load tokenizers from the Hugging Face Hub - [#793]: Saving a pretty JSON file by default when saving a tokenizer