This is the biggest release in BinaryOrNot's history. I rebuilt the detection engine from the ground up. The original used byte ratio heuristics with chardet as a second opinion for ambiguous files. I replaced all of that with a trained decision tree operating on 23 features, covering 49 binary formats and 37 text encodings, with zero external dependencies. It's backed by 211 tests and a training pipeline you can re-run yourself. If you've ever had BinaryOrNot misidentify a UTF-16 file, choke on a CJK-encoded document, or crash because chardet changed its API, this release is for you.
BinaryOrNot now has zero dependencies. The chardet library (2.1 MB installed) is gone, replaced by a decision tree that reads 128 bytes of a file and classifies it as binary or text using 23 features computed from those bytes alone. The API is unchanged: is_binary("file.png") still returns True.
pip install --upgrade binaryornotBy the numbers
| Before (0.4.4) | After (0.5.0) |
|---|---|
| 1 dependency (chardet, 2.1 MB) | 0 dependencies |
| 1024 bytes read per file | 128 bytes read per file |
| Byte ratio heuristics + chardet | Trained classifier, 23 features |
| ~12 binary formats | 49 binary formats |
| ASCII + whatever chardet detected | 37 text encodings |
| 48 tests | 211 tests |
What's new
-
CLI tool. Run
binaryornot myfile.pngfrom the command line and getTrueorFalse. Thanks @moluwole! (#49) -
49 binary formats recognized. PNG, JPEG, GIF, BMP, TIFF, ICO, WebP, PSD, HEIF, PDF, OLE2 (.doc/.xls), SQLite, ZIP, gzip, xz, bzip2, 7z, RAR, Zstandard, ELF, Mach-O, MZ/PE, Java class, WebAssembly, Dalvik DEX, RIFF, Ogg, FLAC, MP4/MOV, MP3, Matroska/WebM, MIDI, WOFF, WOFF2, OTF, TTF, EOT, Apache Parquet, .pyc, .DS_Store, LLVM bitcode, Git packfiles, and more. Every format cites its specification and is verified by magic-byte tests and real file fixtures.
-
37 text encodings covered. UTF-8, UTF-16, UTF-32, all major single-byte encodings (ISO-8859, Windows code pages, KOI8-R, Mac encodings), and CJK encodings (GB2312, GBK, GB18030, Big5, Shift-JIS, EUC-JP, EUC-KR, ISO-2022-JP). A Big5-encoded Chinese document is correctly identified as text, not binary.
-
Encoding and format coverage tracked in CSVs.
encodings.csvandbinary_formats.csvare the single source of truth, feeding training data, parametrized tests, and documentation. Four gaps are documented with reasons (ISO-2022-KR and three EBCDIC code pages).
What's better
-
8x fewer bytes read per file. The detector reads 128 bytes instead of 1024. The decision tree's features stabilize well within that range.
-
211 tests, up from 48. Encoding round-trips, binary format magic bytes, real file fixtures for 16 formats, tiny-chunk edge cases, and boundary conditions. The decision tree is trained with balanced class weights and 5 targeted Hypothesis strategies (structured binary, binary with embedded strings, compressed binary, CJK text, whitespace-heavy text).
-
SQLite databases correctly detected as binary. Thanks @pombredanne! (#44)
-
Proper error logging for file I/O issues. Uses
logger.exception()for better diagnostics when a file can't be read. Thanks @MarshalX! (#629)
What's fixed
-
chardet 7.0.0 crash (#634). chardet 7 returns
{'encoding': None, 'confidence': 0.99}, which crashedis_binary_string()with aTypeError, then crashed the error handler with aNameErrorfrom a Python 2unicode()call. Both crash paths are structurally impossible now because chardet is gone. Thanks @wesleybl for the report! -
Unreadable files raise instead of returning False.
is_binary()on a nonexistent or permission-denied file now raisesFileNotFoundErrororPermissionError. Previously it silently returnedFalse, making broken paths indistinguishable from text files.
What's changed
- Zero dependencies.
pip install binaryornotinstalls nothing else. chardet is no longer needed. - Python 3.12+ only. Python 2 and older Python 3 versions are no longer supported. All Python 2 compatibility code has been removed.
- MIT license (previously BSD).
- src/ layout with hatchling build system, replacing setup.py/setup.cfg.
Contributors
@audreyfeldroy (Audrey M. Roy Greenfeld) designed and built this release: the trained decision tree, encoding and binary format coverage matrices, Hypothesis-based training pipeline, fixture generation, documentation, and the complete modernization from Cookiecutter PyPackage.
Thanks to @pombredanne (Philippe Ombredanne) for SQLite detection and binary stream improvements, @moluwole for the CLI tool, @MarshalX (Ilya Siamionau) for better error logging, @thebaptiste for pyproject.toml migration (#633), @wesleybl for reporting the chardet 7 crash (#634), @alcuin2 for binary detection improvements (#48), @olaoluwa-98 for CI updates (#50), and @cosmic-byte for test fixes (#52).