Changes: Big improvements in speed for BPE (Both training and tokenization) (#165) Fixes: Some default tokens were missing from BertWordPieceTokenizer (cf #160) There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up in multiple bytes. (cf #156) The longest_first truncation strategy had a bug (#174)