huggingface/datasets 2.11.0 on GitHub

Important

Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in #5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
Deprecated batch_size on Dataset.to_dict()

Datasets Features

Add writer_batch_size for ArrowBasedBuilder by @lhoestq in #5565
- allow to specofy the row group / record batch size when you download_and_prepare() a dataset
Experimental support of cloud storage in load_dataset():
- Support cloud storage in load_dataset via fsspec by @dwyatte in #5580
- Pass down storage options by @dwyatte in #5673
Support PyArrow arrays as column values in from_dict by @mariosasko in #5643
Allow direct cast from binary to Audio/Image by @mariosasko in #5644
Add column_names to IterableDataset by @patrickloeber in #5582
pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in #5569
add Dataset.to_list by @kyoto7250 in #5611

General imrovements and bug fixes

Update csv.py by @XDoubleU in #5562
Remove instructions for ffmpeg system package installation on Colab by @polinaeterna in #5558
Apply ruff flake8-comprehension checks by @Skylion007 in #5549
Fix datasets.load_from_disk, DatasetDict.load_from_disk and Dataset.load_from_disk by @alvarobartt in #5529
Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in #5561
Add huggingface_hub version to env cli command by @mariosasko in #5578
Do no write index by default when exporting a dataset by @mariosasko in #5583
Flatten dataset on the fly in save_to_disk by @mariosasko in #5588
Fix sort with indices mapping by @mariosasko in #5587
Fix docstring example by @stevhliu in #5592
Fix push_to_hub with no dataset_infos by @lhoestq in #5598
Don't compute checksums if not necessary in datasets-cli test by @lhoestq in #5603
Update README logo by @gary149 in #5605
Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in #5617
Fix archive fs test by @lhoestq in #5614
unpin fsspec by @lhoestq in #5619
Bump pyarrow to 8.0.0 by @lhoestq in #5620
Remove set_access_token usage + fail tests if FutureWarning by @Wauplin in #5623
Fix outdated verification_mode values by @polinaeterna in #5607
Adding Oracle Cloud to docs by @ahosler in #5621
Fix CI: ignore C901 ("some_func" is to complex) in ruff by @polinaeterna in #5636
add kwargs to index search by @SaulLu in #5628
Less zip false positives by @lhoestq in #5640
Allow self as key in Features by @mariosasko in #5646
Bump hfh to 0.11.0 by @lhoestq in #5642
Support streaming datasets with numpy.load by @albertvillanova in #5626
Fix unnecessary dict comprehension by @albertvillanova in #5662
Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in #5664
Copy features by @lhoestq in #5652
Improve features decoding in to_iterable_dataset by @lhoestq in #5655
Fix fsspec.open when using an HTTP proxy by @bryant1410 in #5656
Jax requires jaxlib by @lhoestq in #5667
docs: Update num_shards docs to mention num_proc on Dataset and DatasetDict by @connor-henderson in #5658
Allow loading/saving of FAISS index using fsspec by @Dref360 in #5526
Fix verification_mode when ignore_verifications is passed by @albertvillanova in #5683
Release: 2.11.0 by @lhoestq in #5684

New Contributors

@XDoubleU made their first contribution in #5562
@Skylion007 made their first contribution in #5549
@Hubert-Bonisseur made their first contribution in #5569
@ahosler made their first contribution in #5621
@patrickloeber made their first contribution in #5582
@SaulLu made their first contribution in #5628
@connor-henderson made their first contribution in #5658
@kyoto7250 made their first contribution in #5611

Full Changelog: 2.10.0...2.11.0