github huggingface/tokenizers node-v0.8.0
Node v0.8.0

latest releases: v0.21.0rc0, v0.20.4, v0.20.4rc0...
3 years ago

BREACKING CHANGES

  • Many improvements on the Trainer (#519).
    The files must now be provided first when calling tokenizer.train(files, trainer).

Features

  • Adding the TemplateProcessing
  • Add WordLevel and Unigram models (#490)
  • Add nmtNormalizer and precompiledNormalizer normalizers (#490)
  • Add templateProcessing post-processor (#490)
  • Add digitsPreTokenizer pre-tokenizer (#490)
  • Add support for mapping to sequences (#506)
  • Add splitPreTokenizer pre-tokenizer (#542)
  • Add behavior option to the punctuationPreTokenizer (#657)
  • Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

  • Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
  • Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

Don't miss a new tokenizers release

NewReleases is sending notifications on new releases.