Important
- Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in #5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
- Deprecated
batch_size
onDataset.to_dict()
Datasets Features
- Add writer_batch_size for ArrowBasedBuilder by @lhoestq in #5565
- allow to specofy the row group / record batch size when you
download_and_prepare()
a dataset
- allow to specofy the row group / record batch size when you
- Experimental support of cloud storage in
load_dataset()
: - Support PyArrow arrays as column values in
from_dict
by @mariosasko in #5643 - Allow direct cast from binary to Audio/Image by @mariosasko in #5644
- Add column_names to IterableDataset by @patrickloeber in #5582
- pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in #5569
- add Dataset.to_list by @kyoto7250 in #5611
General imrovements and bug fixes
- Update csv.py by @XDoubleU in #5562
- Remove instructions for
ffmpeg
system package installation on Colab by @polinaeterna in #5558 - Apply ruff flake8-comprehension checks by @Skylion007 in #5549
- Fix
datasets.load_from_disk
,DatasetDict.load_from_disk
andDataset.load_from_disk
by @alvarobartt in #5529 - Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in #5561
- Add
huggingface_hub
version to env cli command by @mariosasko in #5578 - Do no write index by default when exporting a dataset by @mariosasko in #5583
- Flatten dataset on the fly in
save_to_disk
by @mariosasko in #5588 - Fix
sort
with indices mapping by @mariosasko in #5587 - Fix docstring example by @stevhliu in #5592
- Fix push_to_hub with no dataset_infos by @lhoestq in #5598
- Don't compute checksums if not necessary in
datasets-cli test
by @lhoestq in #5603 - Update README logo by @gary149 in #5605
- Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in #5617
- Fix archive fs test by @lhoestq in #5614
- unpin fsspec by @lhoestq in #5619
- Bump pyarrow to 8.0.0 by @lhoestq in #5620
- Remove set_access_token usage + fail tests if FutureWarning by @Wauplin in #5623
- Fix outdated
verification_mode
values by @polinaeterna in #5607 - Adding Oracle Cloud to docs by @ahosler in #5621
- Fix CI: ignore C901 ("some_func" is to complex) in
ruff
by @polinaeterna in #5636 - add kwargs to index search by @SaulLu in #5628
- Less zip false positives by @lhoestq in #5640
- Allow self as key in
Features
by @mariosasko in #5646 - Bump hfh to 0.11.0 by @lhoestq in #5642
- Support streaming datasets with numpy.load by @albertvillanova in #5626
- Fix unnecessary dict comprehension by @albertvillanova in #5662
- Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in #5664
- Copy features by @lhoestq in #5652
- Improve features decoding in to_iterable_dataset by @lhoestq in #5655
- Fix
fsspec.open
when using an HTTP proxy by @bryant1410 in #5656 - Jax requires jaxlib by @lhoestq in #5667
- docs: Update num_shards docs to mention num_proc on Dataset and DatasetDict by @connor-henderson in #5658
- Allow loading/saving of FAISS index using fsspec by @Dref360 in #5526
- Fix verification_mode when ignore_verifications is passed by @albertvillanova in #5683
- Release: 2.11.0 by @lhoestq in #5684
New Contributors
- @XDoubleU made their first contribution in #5562
- @Skylion007 made their first contribution in #5549
- @Hubert-Bonisseur made their first contribution in #5569
- @ahosler made their first contribution in #5621
- @patrickloeber made their first contribution in #5582
- @SaulLu made their first contribution in #5628
- @connor-henderson made their first contribution in #5658
- @kyoto7250 made their first contribution in #5611
Full Changelog: 2.10.0...2.11.0