Features
- Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case
Latin1ProberandMacRomanProberheuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings. - 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
EncodingErafiltering: Newencoding_eraparameter todetectallows filtering by anEncodingEraflag enum (MODERN_WEB,LEGACY_ISO,LEGACY_MAC,LEGACY_REGIONAL,DOS,MAINFRAME,ALL) allows callers to restrict detection to encodings from a specific era.detect()anddetect_all()default toMODERN_WEB. The newMODERN_WEBdefault should drastically improve accuracy for users who are not working with legacy data. The tiers are:MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
--encoding-eraCLI flag: ThechardetectCLI now accepts-e/--encoding-erato control which encoding eras are considered during detection.max_bytesandchunk_sizeparameters:detect(),detect_all(), andUniversalDetectornow acceptmax_bytes(default 200KB) andchunk_size(default 64KB) parameters for controlling how much data is examined. (#314, @bysiber)- Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
- Charset metadata registry: New
chardet.metadata.charsetsmodule provides structured metadata about all supported encodings, including their era classification and language filter. should_rename_legacynow defaults intelligently: When set toNone(the new default), legacy renaming is automatically enabled whenencoding_eraisMODERN_WEB.- Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.
- EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
- Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.
- Python 3.12, 3.13, and 3.14 support (#283, @hugovk; #311)
- GitHub Codespace support (#312, @oxygen-dioxide)
Fixes
- Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#268, @nenw)
- Fix SJIS distribution analysis: Fixed
SJISDistributionAnalysisdiscarding valid second-byte range >= 0x80. (#315, @bysiber) - Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a
MIN_RATIOthreshold alongside the existingEXPECTED_RATIO. - Fix
get_charsetcrash: Resolved a crash when looking up unknown charset names. - Fix GB18030
char_len_table: Corrected the character length table for GB18030 multi-byte sequences. - Fix UTF-8 state machine: Updated to be more spec-compliant.
- Fix
detect_all()returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded. - Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.
- Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.
Breaking changes
- Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#283, @hugovk)
- Removed
Latin1ProberandMacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected bySingleByteCharSetProberwith trained language models, giving better accuracy and language identification. - Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.
LanguageFilter.NONEremoved: Use specific language filters orLanguageFilter.ALLinstead.- Enum types changed:
InputState,ProbingState,MachineState,SequenceLikelihood, andCharacterCategoryare nowIntEnum(previously plain classes orEnum).LanguageFiltervalues changed from hardcoded hex toauto(). detect()default behavior change:detect()now defaults toencoding_era=EncodingEra.MODERN_WEBandshould_rename_legacy=None(auto-enabled forMODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.
Misc changes
- Switched from Poetry/setuptools to uv + hatchling: Build system modernized with
hatch-vcsfor version management. - License text updated: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. (#304, #307, @musicinmybrain)
- CulturaX-based model training: The
create_language_model.pytraining script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models. Languageclass converted to frozen dataclass: The language metadata class now uses@dataclass(frozen=True)withnum_training_docsandnum_training_charsfields replacingwiki_start_pages.- Test infrastructure: Added
pytest-timeoutandpytest-xdistfor faster parallel test execution. Reorganized test data directories.
Contributors
Thank you to everyone who contributed to this release!
- @dan-blanchard (Dan Blanchard)
- @bysiber (Kadir Can Ozden)
- @musicinmybrain (Ben Beasley)
- @hugovk (Hugo van Kemenade)
- @oxygen-dioxide
- @nenw
And a special thanks to @helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.