Changes:
BertWordPieceTokenizernow cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizernow hasdropout(thanks @colinclement with #149)- Added a new
Stripnormalizer do_lowercasehas been changed tolowercasefor consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizerandCharBPETokenizer)- Expose
__len__onEncoding(cf #139) - Improved padding performances.