pypi chardet 7.1.0
chardet 7.1.0

9 hours ago

Features

  • Added PEP 263 encoding declaration detection — # -*- coding: ... -*- and # coding=... declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (#249)
  • Added chardet.universaldetector backward-compatibility stub so that from chardet.universaldetector import UniversalDetector works with a deprecation warning (#341)

Fixes

  • Fixed false UTF-7 detection of ASCII text containing ++ or +word patterns (#332)
  • Fixed 0.5s startup cost on first detect() call — model norms are now computed during loading instead of lazily iterating 21M entries (#333)
  • Fixed undocumented encoding name changes between chardet 5.x and 7.0 — detect() now returns chardet 5.x-compatible names by default (#338)
  • Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
  • Fixed silent truncation of corrupt model data (iter_unpack yielded fewer tuples instead of raising)
  • Fixed incorrect date in LICENSE

Performance

  • 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of load_models()
  • ~40% faster model parsing via struct.iter_unpack for bulk entry extraction (eliminates ~305K individual unpack calls)

New API parameters

  • Added compat_names parameter (default True) to detect(), detect_all(), and UniversalDetector — set to False to get raw Python codec names instead of chardet 5.x/6.x compatible display names
  • Added prefer_superset parameter (default False) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to True in the next major version (8.0).
  • Deprecated should_rename_legacy in favor of prefer_superset — a deprecation warning is emitted when used

Improvements

  • Switched internal canonical encoding names to Python codec names (e.g., "utf-8" instead of "UTF-8"), with compat_names controlling the public output format
  • Added lookup_encoding() to registry for case-insensitive resolution of arbitrary encoding name input to canonical names
  • Achieved 100% line coverage across all source modules (+31 tests)
  • Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files
  • Pinned test-data cloning to chardet release version tags for reproducible builds

Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html

Don't miss a new chardet release

NewReleases is sending notifications on new releases.