Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x. Just way faster and more accurate!
Highlights:
- MIT license (previous versions were LGPL)
- 96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
- 41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
- Language detection for every result (90.5% accuracy across 49 languages)
- 99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
- 12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
- Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
- Optional mypyc compilation — 1.49x additional speedup on CPython
- Thread-safe
detect()anddetect_all()with no measurable overhead; scales on free-threaded Python 3.13t+ - Negligible import memory (96 B)
- Zero runtime dependencies
Breaking changes vs 6.0.0:
detect()anddetect_all()now default toencoding_era=EncodingEra.ALL(6.0.0 defaulted toMODERN_WEB)- Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilteris accepted but ignored (deprecation warning emitted)chunk_sizeis accepted but ignored (deprecation warning emitted)