Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long. - [#190]: Improved BPE and WordPiece builders
- [#193]:
encode
andencode_batch
now take a new argument, specifying whether we should add the
special tokens - [#197]: The
NormalizedString
has been removed from theEncoding
. It is now possible to
retrieve it by callingnormalize
on theTokenizer
. This brings a reduction of 70% of the memory
footprint - [#197]: The
NormalizedString
API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsets - [#197]: The offsets provided on
Encoding
are now relative to the original string, and not the
normalized one anymore AddedToken
are now used for bothadd_special_tokens
andadd_tokens
. Also, these AddedToken
have more options to allow various behaviors.
Added
- [#188]:
impl PostProcessor for ByteLevel
: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual token - More alignment mappings on the
Encoding
. post_process
can be called on theTokenizer
Fixed
- [#193]: Fix some issues with the offsets being wrong with the
ByteLevel
BPE:- when
add_prefix_space
is activated - [#156]: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
advised, but that's not the question)
How to migrate
- Add the
ByteLevel
PostProcessor
to your byte-level BPE tokenizers if relevant.