After more than 600 commits, it's time for another major release. In addition to a long list of fixes, there are quite a few new features, most notably a categorization framework that allows identifying different categories of files and treating them differently. Right now, there are only two categorizers — pcmaudio and incompressible — but there are hopefully more to come. Along with the pcmaudio categorizer, support for FLAC compression has been added. This allows for large collections of uncompressed audio files to be archived efficiently, and also accessed efficiently: the DwarFS FUSE driver can decode a large audio file using multiple cores, something that cannot be done with a single compressed FLAC file.
The project code is now tested much more thoroughly; various new abstractions allow the command line interfaces to actually be covered by the unit tests.
Also, unlike many previous releases, images produced by this release will be compatible with older releases as long as they don't use new features like FLAC compression or history sections, which are unsuppored by older releases. The 0.7.3 and later releases will even deal with unknown sections and compression algorithms. Going forward, use of new features will be tracked by feature flags, so older releases can determine if the feature set used by a file system image is fully or partially supported.
Last but not least, the binaries can now be built with manual pages built-in. This is particularly useful on Windows, where man is not a thing, but also with the universal binaries if you don't have a full install and need to quickly check the manual. The manuals can be read using the --man option.
New Features
-
Categorizer framework. Initially supported categorizers are
pcmaudio(detect audio data & metadata and provide context for FLAC compressor) andincompressible(detects "incompressible" data). Enabled using the--categorizeoption. -
Multiple segmenters can now run in parallel and write to the same filesystem image in a fully deterministic way. Currently, a segmenter instance will be used per category/subcategory. This can makes segmenting multi-threaded in cases where there are multiple categories. The number of segmenter worker threads can be configured using
--num-segmenter-workers. -
The segmenter now supports different "granularities". The granularity is determined by the categorizer. For example, when segmenting the audio data in a 16-bit stereo PCM file, the granularity is 4 (bytes). This ensures that the segmenter will only produce chunks that start/end on a sample boundary.
-
The segmenter now also features simple "repeating sequence detection". Under certain conditions, these sequences could cause the segmenter to slow down dramatically. See github #161 for details.
-
FLAC compression. This can only be used along with the
pcmaudiocategorizer. Due to the way data is spread across different blocks, both FLAC compression and decompression can likely make use of multiple CPU cores for large audio files, meaning that loading a.wavfile from a DwarFS image using FLAC compression will likely be much faster than loading the same data from a single FLAC file. -
Completely new similarity ordering implementation that supports multi-threaded and fully deterministic nilsimsa ordering. Also, nilsimsa options are now ever so slightly more user friendly.
-
The
--recompressfeature ofmkdwarfshas been largely rewritten. It now ensures the input filesystem is checked before an attempt is made to recompress it. Decompression is now using multiple threads. Also, recompression can be applied only to a subset of categories and compression options can be selected per category. -
mkdwarfsnow stores a history block in the output image by default. The history block contains information about the version ofmkdwarfs, all command line arguments, and a time stamp. A new history entry will be added whenever the image is altered (i.e. by using--recompress). The history can be displayed usingdwarfsck. History timestamps can be disabled using--no-history-timestampsfor bit-identical images. History creation can also be completely disabled using--no-history. -
All tools now come with built-in manual pages. This is valuable especially on Windows, which doesn't have
manat all, or for the universal binaries, which are usually not installed alongside the manual pages. Running each tool with--manwill show the manual page for the tool, using the configured pager. On Windows, ifless.exeis in the PATH, it'll also be used as a pager. -
New
verboselogging level (betweeninfoanddebug). -
Logging now properly supports multi-line strings.
-
Show compression library versions as part of the
--helpoutput. Fordwarfsextract, also showlibarchiveversion. -
--set-timenow supports time strings in different formats (e.g.20240101T0530). -
mkdwarfscan now write the filesystem image tostdout, making it possible to directly stream the output image to e.g.netcat. -
Progress display for
mkdwarfshas been completely overhauled. Different components (e.g. hashing, categorization, segmenting, ...) can now display their own progress in addition to a "global" progress. -
mkdwarfsnow supports ordering by "reverse path" with--order=revpath. This is likepathordering, but with the path components reversed (i.e.foo/bar/baz.xyzwill be ordered as if it werebaz.xyz/bar/foo). -
It is now possible to configure larger bloom filters in
mkdwarfs. -
The
mkdwarfssegmenter can now be fully disabled using-W 0. -
mkdwarfsnow adds "feature sets" to the filesystem metadata. These can be used to introduce now features without necessarily breaking compatibility with older tools. As long as a filesystem image doesn't actively use the new features, it can still be read by old tools. Addresses github #158. -
dwarfsckhas a new--quietoption that will only report errors. -
dwarfsckwith--print-headerwill exit with a special exit code (2) if the image has no header. In all other cases, the exit code will be 0 (no error) or 1 (error). -
The
--jsonoption ofdwarfscknow outputs filesystem information in JSON format. -
dwarfsckhas a new--no-checkoption that skips checking all block hashes. This is useful for quickly accessing filesystem information. -
The FUSE driver exposes a new
dwarfs.inodeinfoxattr on Linux that contains a JSON object with information about the inode, e.g. a list of chunks and associated categories. -
Don't enable
readlinkin the FUSE driver if filesystem has no symlinks. This is mainly useful for Windows where symlink support increases the number ofgetattrcalls issued byWinFsp. -
As an experimental feature, CPU affinity for each worker group can be configured via the
DWARFS_WORKER_GROUP_AFFINITYenvironment variable. This works for all tools, but is really only useful if you have different types of cores (e.g. performance and efficiency cores) and would like to e.g. always run the segmenter on a performance core. -
The universal binaries are now compressed with a different
upxcompression level, making them slightly bigger, but decompress much faster.
Bugfixes
-
Allow version override for nixpkgs. Fixes github #155.
-
Resize progress bar when terminal size changes. Fixes github #159.
-
Add Extended Attributes section to README. Fixes github #160.
-
Support 32-bit uid/gid/mode. Also support more than 65536 uids/gids/modes in a filesystem image. Fixes gh #173.
-
Add workaround for broken
utf8cpprelease. Fixes github #182. -
Don't call
check_section()in filesystem ctor, as it renders the section index useless. Also add regression test to ensure this won't be accidentally reintroduced. Fixes github #183. -
Ensure timely exit in progress dtor. This could occasionally block command line tools for a few seconds before exiting.
-
--set-ownerand--set-groupdid not work properly with non-zero ids. There were two distinct issues: (1) when building a DwarFS image with--set-ownerand/or--set-group, the single uid/gid was stored in place of the index and the respective lookup vectors were left empty and (2) when reading such a DwarFS image, the uid/gid was always set to zero. The issue with (1) is not only that it's a special case, but it also wastes metadata space by repeatedly storing a potentially wide integer value. This fix addresses both issues. The uid/gid information is now stored more efficiently and, when reading an image using the old representation, the correct uid/gid will be reported. Unit tests were added to ensure both old and new formats are read correctly. -
mkdwarfsis now much better at handling inaccessible or vanishing files. In particular on Windows, where a successfulaccess()call doesn't necessarily mean it'll be possible to open a file, this will make it possible to create a DwarFS file system from hierarchies containing inaccessible files. On other platforms, this meansmkdwarfscan now handle files that are vanishing while the file system is being built. -
mkdwarfsprogress updates are now "atomic", i.e. one update is always written with a single system call. This didn't make much of a difference on Linux, but the notoriously slow Windows terminal, along with somewhat interesting thread scheduling, would sometimes make the updates look like a typewriter in slow-motion. -
utf8_truncate()didn't handle zero-width characters properly. This could cause issues when truncating certain UTF8 strings. -
A race condition in
simpleprogress mode was fixed. -
A race condition in
filesystem_writerwas fixed. -
The
--no-create-timestampoption inmkdwarfswas always enabled and thus useless. -
Common options (like
--log-level) were inconsistent between tools. -
Progress was incorrect when
mkdwarfswas copying sections with--recompress. -
Treat NTFS junctions like directories.
-
Fix canonical path on Windows when accessing mounted DwarFS image.
-
Fix slow sorting in
file_scannerdue to path comparison. -
On Windows, don't crash with an assertion if the input path for
mkdwarfsis not found.
Removed Features
- Python scripting support has been completely removed.
Documentation
-
Add mkdwarfs sequence diagram.
-
Document known issues with WinFsp.
-
Update README with extended attributes information.
-
Add script to check if all options are documented in manpage.
Building
-
Factor out repetitive thrift library code in CMakeLists.txt.
-
Use FetchContent for both
fmtandgoogletest. -
Use
moldfor linking when available. -
The CI workflow now uploads coverage information to codecov.io with every commit.
Testing
-
A ton of tests were added (from 4 kLOC to more than 10 kLOC) and, unsurprisingly, a number of bugs were found in the process.
-
Introduced I/O abstraction layer for all
*_main()functions. This allows testing of almost all tool functionality without the need to start the tool as a subprocess. It also allows to inject errors more easily, and change properties such as the terminal size.