Features
- Added PEP 263 encoding declaration detection —
# -*- coding: ... -*-and# coding=...declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (#249) - Added
chardet.universaldetectorbackward-compatibility stub so thatfrom chardet.universaldetector import UniversalDetectorworks with a deprecation warning (#341)
Fixes
- Fixed false UTF-7 detection of ASCII text containing
++or+wordpatterns (#332) - Fixed 0.5s startup cost on first
detect()call — model norms are now computed during loading instead of lazily iterating 21M entries (#333) - Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
detect()now returns chardet 5.x-compatible names by default (#338) - Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
- Fixed silent truncation of corrupt model data (
iter_unpackyielded fewer tuples instead of raising) - Fixed incorrect date in LICENSE
Performance
- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of
load_models() - ~40% faster model parsing via
struct.iter_unpackfor bulk entry extraction (eliminates ~305K individualunpackcalls)
New API parameters
- Added
compat_namesparameter (defaultTrue) todetect(),detect_all(), andUniversalDetector— set toFalseto get raw Python codec names instead of chardet 5.x/6.x compatible display names - Added
prefer_supersetparameter (defaultFalse) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default toTruein the next major version (8.0). - Deprecated
should_rename_legacyin favor ofprefer_superset— a deprecation warning is emitted when used
Improvements
- Switched internal canonical encoding names to Python codec names (e.g.,
"utf-8"instead of"UTF-8"), withcompat_namescontrolling the public output format - Added
lookup_encoding()toregistryfor case-insensitive resolution of arbitrary encoding name input to canonical names - Achieved 100% line coverage across all source modules (+31 tests)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files
- Pinned test-data cloning to chardet release version tags for reproducible builds
Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html