huggingface/datasets 4.1.0 on GitHub

Dataset Features

feat: use content defined chunking by @kszucs in #7589
- pass use_content_defined_chunking=True when writing Parquet files
- this enables fast deduped uploads to Hugging Face !
```
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
```
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
Concurrent push_to_hub by @lhoestq in #7708
Concurrent IterableDataset push_to_hub by @lhoestq in #7710
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
```
ds = load_dataset("username/dataset-with-hdf5-files")
```
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Convert to string when needed + faster .zstd by @lhoestq in #7683
fix audio cast storage from array + sampling_rate by @lhoestq in #7684
Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
Update dataset_dict push_to_hub by @lhoestq in #7711
Retry intermediate commits too by @lhoestq in #7712
num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
fix num_proc=1 ci test by @lhoestq in #7714
Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
typo by @lhoestq in #7716
fix largelist repr by @lhoestq in #7735
Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
Fix type hint train_test_split by @qgallouedec in #7736
fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
Refactor HDF5 and preserve tree structure by @klamike in #7743
docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
Support pathlib.Path for feature input by @Joshua-Chin in #7755
add support for pyarrow string view in features by @onursatici in #7718
Fix typo in error message for cache directory deletion by @brchristian in #7749
update torchcodec in ci by @lhoestq in #7764
Bump dill to 0.4.0 by @Bomme in #7763

New Contributors

@DavidRConnell made their first contribution in #7438
@rootAvish made their first contribution in #7701
@tanuj-rai made their first contribution in #7702
@evalstate made their first contribution in #7713
@brchristian made their first contribution in #7730
@klamike made their first contribution in #7690
@YassineYousfi made their first contribution in #7726
@Sanjaykumar030 made their first contribution in #7737
@kszucs made their first contribution in #7589
@Joshua-Chin made their first contribution in #7755
@onursatici made their first contribution in #7718
@Bomme made their first contribution in #7763

Full Changelog: 4.0.0...4.1.0