github huggingface/tokenizers rust-v0.9.0
Rust v0.9.0

latest releases: v0.20.4-rc0, v0.20.3, v0.20.3rc1...
4 years ago

Changed

  • Only one progress bar while reading files during training. This is better for use-cases with
    a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
    size of each file before starting to actually read these files, as this process could take really
    long.
  • [#190]: Improved BPE and WordPiece builders
  • [#193]: encode and encode_batch now take a new argument, specifying whether we should add the
    special tokens
  • [#197]: The NormalizedString has been removed from the Encoding. It is now possible to
    retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
    footprint
  • [#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both
    strings using both "normalized" or "original" offsets
  • [#197]: The offsets provided on Encoding are now relative to the original string, and not the
    normalized one anymore
  • AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
    have more options to allow various behaviors.

Added

  • [#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
    the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
    part of the actual token
  • More alignment mappings on the Encoding.
  • post_process can be called on the Tokenizer

Fixed

  • [#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
    • when add_prefix_space is activated
    • [#156]: when a Unicode character gets split-up in multiple byte-level characters
  • Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
  • [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
    advised, but that's not the question)

How to migrate

  • Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.

Don't miss a new tokenizers release

NewReleases is sending notifications on new releases.