github huggingface/tokenizers python-v0.7.0
Python v0.7.0

latest releases: v0.20.4-rc0, v0.20.3, v0.20.3rc1...
4 years ago

Changed

  • Only one progress bar while reading files during training. This is better for use-cases with
    a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
    size of each file before starting to actually read these files, as this process could take really
    long.
  • [#193]: encode and encode_batch now take a new optional argument, specifying whether we
    should add the special tokens. This is activated by default.
  • [#197]: original_str and normalized_str have been removed from the Encoding returned by
    encode and encode_batch. This brings a reduction of 70% of the memory footprint.
  • [#197]: The offsets provided on Encoding are now relative to the original string, and not the
    normalized one anymore.
  • The added token given to add_special_tokens or add_tokens on a Tokenizer, or while using
    train(special_tokens=...) can now be instances of AddedToken to provide more control over these
    tokens.
  • [#136]: Updated Pyo3 version
  • [#136]: Static methods Model.from_files and Model.empty are removed in favor of using
    constructors.
  • [#239]: CharBPETokenizer now corresponds to OpenAI GPT BPE implementation by default.

Added

  • [#188]: ByteLevel is also a PostProcessor now and handles trimming the offsets if activated.
    This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
    whitespaces are part of the actual token.
    It has been added to ByteLevelBPETokenizer but it is off by default (trim_offsets=False).
  • [#236]: RobertaProcessing also handles trimming the offsets.
  • [#234]: New alignment mappings on the Encoding. Provide methods to easily convert between char
    or word (input space) and token (output space).
  • post_process can be called on the Tokenizer
  • [#208]: Ability to retrieve the vocabulary from the Tokenizer with
    get_vocab(with_added_tokens: bool)
  • [#136] Models can now be instantiated through object constructors.

Fixed

  • [#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
    • when add_prefix_space=True
    • [#156]: when a Unicode character gets split-up in multiple byte-level characters
  • Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
  • [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
    not advised, but that's not the question).
  • [#205]: Trim the decoded string in BPEDecoder used by CharBPETokenizer

How to migrate

  • Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant. If you are
    using ByteLevelBPETokenizer, this option is disabled by default (trim_offsets=False).
  • BertWordPieceTokenizer option to add_special_tokens must now be given to encode or
    encode_batch
  • Access to the original_str on the Encoding has been removed. The original string is the input
    of encode so it didn't make sense to keep it here.
  • No need to call original_str.offsets(offsets[N]) to convert offsets to the original string. They
    are now relative to the original string by default.
  • Access to the normalized_str on the Encoding has been removed. Can be retrieved by calling
    normalize(sequence) on the Tokenizer
  • Change Model.from_files and Model.empty to use constructor. The model constructor should take
    the same arguments as the old methods. (ie BPE(vocab, merges) or BPE())
  • If you were using the CharBPETokenizer and want to keep the same behavior as before, set
    bert_normalizer=False and split_on_whitespace_only=True.

Don't miss a new tokenizers release

NewReleases is sending notifications on new releases.