Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:

>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:

from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

What's Changed

remove enforcement of non special when adding tokens by @ArthurZucker in #1521
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in #1513
Make USED_PARALLELISM atomic by @nathaniel-daniel in #1532
Fixing for clippy 1.78 by @Narsil in #1548
feat(ci): add trufflehog secrets detection by @McPatate in #1551
Switch from cached_download to hf_hub_download in tests by @Wauplin in #1547
Fix "dictionnary" typo by @nprisbrey in #1511
make sure we don't warn on empty tokens by @ArthurZucker in #1554
Enable dropout = 0.0 as an equivalent to none in BPE by @mcognetta in #1550
Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in #1569
Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in #1555
Fix clippy + feature test management. by @Narsil in #1580
Bump spm_precompiled to 0.1.3 by @MikeIvanichev in #1571
Add benchmark vs tiktoken by @Narsil in #1582
Fixing the benchmark. by @Narsil in #1583
Tiny improvement by @Narsil in #1585
Enable fancy regex by @Narsil in #1586
Fixing release CI strict (taken from safetensors). by @Narsil in #1593
Adding some serialization testing around the wrapper. by @Narsil in #1594
Add-legacy-tests by @ArthurZucker in #1597
Adding a few tests for decoder deserialization. by @Narsil in #1598
Better serialization error by @Narsil in #1595
Add test normalizers by @ArthurZucker in #1600
Improve decoder deserialization by @Narsil in #1599
Using serde (serde_pyo3) to get str and repr easily. by @Narsil in #1588
Merges cannot handle tokens containing spaces. by @Narsil in #909
Fix doc about split by @ArthurZucker in #1591
Support None to reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in #1590
Fix strip python type by @ArthurZucker in #1602
Tests + Deserialization improvement for normalizers. by @Narsil in #1604
add deserialize for pre tokenizers by @ArthurZucker in #1603
Perf improvement 16% by removing offsets. by @Narsil in #1587

New Contributors

@nathaniel-daniel made their first contribution in #1532
@nprisbrey made their first contribution in #1511
@mcognetta made their first contribution in #1550
@MikeIvanichev made their first contribution in #1571

Full Changelog: v0.19.1...v0.20.0rc1

huggingface/tokenizers v0.20.0 Release v0.20.0: faster encode, better python support on GitHub

Release v0.20.0

Performances:

Python API

What's Changed

New Contributors

huggingface/tokenizers v0.20.0
Release v0.20.0: faster encode, better python support

on GitHub