Changes:
BertWordPieceTokenizer
now cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizer
now hasdropout
(thanks @colinclement with #149)- Added a new
Strip
normalizer do_lowercase
has been changed tolowercase
for consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizer
andCharBPETokenizer
)- Expose
__len__
onEncoding
(cf #139) - Improved padding performances.