This package is reaching its two years of existence, now is a good time for a nice refresh.
Changes: See PR #45
- Performance: ⚡ 4x to 5 times faster than the previous 1.4.0 release.
- Performance: ⚡ At least 2x faster than Chardet.
- Performance: ⚡ Accent has been made on UTF-8 detection, should perform rather instantaneous.
- Improvement: 🔙 The backward compatibility with Chardet has been greatly improved. The legacy
detect
function returns an identical charset name whenever possible. - Improvement: ❇️ The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)
- Code: 🎨 The program has been rewritten to ease the readability and maintainability. (+Using static typing)
- Tests: ✔️ New workflows are now in place to verify the following aspects:
Performance
,Backward-Compatibility with Chardet
, andDetection Coverage
in addition to currents tests. (+CodeQL) - Dependency: ➖ This package no longer require anything when used with Python 3.5 (Dropped
cached_property
) - Docs: ✏️ Performance claims have been updated, the guide to contributing, and the issue template.
- Improvement: ❇️ Add
--version
argument to CLI - Bugfix: 🐛 The CLI output used the relative path of the file(s). Should be absolute.
- Deprecation: 🔴 Methods
coherence_non_latin
,w_counter
,chaos_secondary_pass
of the classCharsetMatch
are now deprecated and scheduled for removal in v3.0 - Improvement: ❇️ If no language was detected in content, trying to infer it using the encoding name/alphabets used.
- Removal: 🔥 Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.
- Improvement: ❇️
utf_7
detection has been reinstated. - Removal: 🔥 The exception hook on UnicodeDecodeError has been removed.
After much consideration, this release won't drop Python 3.5 in v2.