Dataset Features
-
feat: use content defined chunking by @kszucs in #7589
- pass
use_content_defined_chunking=True
when writing Parquet files - this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking ds.push_to_hub("username/dataset_name")
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- pass
-
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
ds = load_dataset("username/dataset-with-hdf5-files")
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows
Other improvements and bug fixes
- Convert to string when needed + faster .zstd by @lhoestq in #7683
- fix audio cast storage from array + sampling_rate by @lhoestq in #7684
- Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
- Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
- Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
- Update dataset_dict push_to_hub by @lhoestq in #7711
- Retry intermediate commits too by @lhoestq in #7712
- num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
- Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
- fix num_proc=1 ci test by @lhoestq in #7714
- Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
- typo by @lhoestq in #7716
- fix largelist repr by @lhoestq in #7735
- Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
- Fix type hint
train_test_split
by @qgallouedec in #7736 - fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
- Refactor HDF5 and preserve tree structure by @klamike in #7743
- docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
- Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
- Support pathlib.Path for feature input by @Joshua-Chin in #7755
- add support for pyarrow string view in features by @onursatici in #7718
- Fix typo in error message for cache directory deletion by @brchristian in #7749
- update torchcodec in ci by @lhoestq in #7764
- Bump dill to 0.4.0 by @Bomme in #7763
New Contributors
- @DavidRConnell made their first contribution in #7438
- @rootAvish made their first contribution in #7701
- @tanuj-rai made their first contribution in #7702
- @evalstate made their first contribution in #7713
- @brchristian made their first contribution in #7730
- @klamike made their first contribution in #7690
- @YassineYousfi made their first contribution in #7726
- @Sanjaykumar030 made their first contribution in #7737
- @kszucs made their first contribution in #7589
- @Joshua-Chin made their first contribution in #7755
- @onursatici made their first contribution in #7718
- @Bomme made their first contribution in #7763
Full Changelog: 4.0.0...4.1.0