mhx/dwarfs v0.7.0 on GitHub

This release took much longer than anticipated, but comes with a rather big surprise (for me, at least): Windows support! I didn't expect this to happen just yet, especially given that I haven't really used Windows over the past two decades. My biggest worries were all the dependencies, but fortunately I came across vcpkg and all of a sudden, porting DwarFS to Windows seemed feasible. So here we are, and all the different tools (mkdwarfs, dwarfsck, dwarfsextract and the FUSE driver dwarfs) are now working on Windows.

As of this release, in addition to the "classic" statically linked binaries, DwarFS is also available as a universal binary for each platform. The universal binaries bundle the four main tools (mkdwarfs, dwarfsck, dwarfsextract, dwarfs) in a single, compressed binary that is between 2.5 and 4 MiB in size, a fraction of the size of the standalone binaries. The tools can be accessed either by passing the --tool=<name> option as the first argument, or, more conveniently, by creating symbolic links to the universal binary using the name of the respective tool.

New Features

Windows support. All tools are fully working on Windows, including tfeatures such as hard links, symbolic links, Unicode file names. Thanks to WinFsp, the FUSE driver is also working, albeit with a few quirks (1, 2, 3, 4) compared to the Linux version.
Universal binaries that bundle all tools in a single binary. On Windows, the universal binary supports delayed loading of WinFsp DLL. This makes the mkdwarfs, dwarfsck and dwarfsextract tools usable without the WinFsp DLL.
Added support for Brotli compression. This is generally much slower at compression than ZSTD or LZMA, but faster than LZMA, while offering a compression ratio better than ZSTD. Fixes github #76.
Added --filter option to support simple (rsync-like) filter rules. This resulted from a discussion on github #6.
Added --compress-niceness option to mkdwarfs. This lowers the priority of the compression worker threads, which has two advantages: a system running mkdwarfs will generally be more responsive, and the compression threads won't starve themselves by taking processing power away from the segmenter.
Added --stdout-progress option to dwarfsextract for use with tools such as yad. Fixes github #117.
Added --chmod option to mkdwarfs. Fixes github #7.
Added --input-list option to support reading a list of input files from a file or stdin. At least partially fixes github #6.
Added support for choosing the file hashing algorithm using the --file-hash option. This allows you to pick a secure hash instead of the default XXH3 hash. Also fixes github #92.
Added --max-similarity-size option to prevent similarity hashing of huge files. This saves scanning time, especially on slow file systems, while it shouldn't affect compression ratio too much.
Added --num-scanner-workers option.
Added support for extracting corrupted file systems with dwarfsextract. This is enabled using the --continue-on-error and, if really needed, --disable-integrity-check options. Fixes github #51.
Show throughput in the scanning and segmenting phases in mkdwarfs.
Show how much of a file has been consumed in the segmenting phase in mkdwarfs. Useful primarily for large files.
New metadata format (v2.5). The only change is the addition of a "preferred path separator". This is used to correctly interpret symbolic links, as this is the only place where path separators are stored in DwarFS at all.
dwarfs and dwarfsextract now have options to enable performance monitoring. This can provide insight into the latency of various file system operations.
Unreadable files are now added as empty files instead of being ignored. Fixes github #40.
Honour user locale settings when formatting numbers.

Performance improvements

Added a small offset cache to improve random access as well as sequential read latency for large, fragmented files. This gave a 100x higher throughput for a case where DwarFS was used to compress raw file system images. The DwarFS FUSE driver is now capable of achieving read throughput of more than 6 GB/s on a Xeon(R) E-2286M machine.
Bypass the block cache for uncompressed blocks. This saves copying block data to memory unnecessarily and allows us to keep all uncompressed blocks accessible directly through the memory mapping. Partially addresses github #139.
Improved de-duplication algorithm to only hash files with the same size. File hashing is delayed until at least one more file with the same size is discovered. This happens automatically and should improve scanning speed, especially on slow file systems.

Bugfixes

Use folly::hardware_concurrency(). Fixes github #130.
Handle ARCHIVE_FAILED status from libarchive, which could be triggered by trying to write long path names to old archive formats (e.g. USTAR, which has a limit of at most 255 characters).
Properly handle unicode path truncation.
Support LZ4 compression levels above 9.
Fix heap-use-after-free in dwarfsextract due to missing archive_write_close() call.
Fix heap-use-after-free in brotli decompressor due to re-allocation of the decompressed block data.
Default FUSE driver debuglevel to warn in background mode. Fixes github #113.
Fixed extract_block.py, which was incorrectly using printf instead of print.

Documentation

Updated file system format documentation to cover headers and section indices.
Documented how to produce bit-identical images.
Updated internal operation section of mkdwarfs manpage.

Testing

Lots of new tools tests.
Removed dependency on tar and diff binaries, mainly driven by their unavailability on Windows.
Added GitHub workflow based CI pipeline to avoid regressions and simplify builds.

Other

The compression code has been made more modular. This should make it much easier to add support for more compression algorithms in the future.
Started using C++20 features.
Versioning files are no longer written to the git source tree.

mhx/dwarfs v0.7.0 dwarfs-0.7.0 on GitHub