huggingface/datasets 2.13.0 on GitHub

Dataset Features

Add IterableDataset.from_spark by @maddiedawson in #5770

Stream the data from your Spark DataFrame directly to your training pipeline

from datasets import IterableDataset
from torch.utils.data import DataLoader

ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
    ...

IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:

IterableDataset Arrow formatting by @lhoestq in #5821
Iterable torch formatting by @lhoestq in #5852

from datasets import load_dataset

ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.

Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in #5893

from datasets import IterableDataset

ids = IterableDataset.from_file("path/to/data.arrow")

Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in #5944

from datasets import load_dataset

ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})

Experimental

Add parallel module using joblib for Spark by @es94129 in #5924

General improvements and bug fixes

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in #5816
Fix incomplete docstring for BuilderConfig by @Laurent2916 in #5824
[docs] Custom decoding transforms by @stevhliu in #5836
Add accelerate as metric's test dependency to fix CI error by @mariosasko in #5848
Add date_format param to the CSV reader by @mariosasko in #5845
[docs] Redirects, migrated from nginx by @julien-c in #5853
Fix infer module for uppercase extensions by @albertvillanova in #5872
Minor tqdm optim by @lhoestq in #5860
Always set nullable fields in the writer by @lhoestq in #5835
Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in #5810
Better error message when combining dataset dicts instead of datasets by @lhoestq in #5861
Force overwrite existing filesystem protocol by @baskrahmer in #5894
Support working_dir in from_spark by @maddiedawson in #5826
Raise TypeError when indexing a dataset with bool by @albertvillanova in #5859
Fix minor typo in docs loading.mdx by @albertvillanova in #5900
Fix FixedSizeListArray casting by @mariosasko in #5897
Unpin responses by @mariosasko in #5916
Validate name parameter in make_file_instructions by @albertvillanova in #5904
Raise error in DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in #5915
Refactor extensions by @albertvillanova in #5917
Use more efficient and idiomatic way to construct list. by @ttsugriy in #5909
Add flatten_indices to DatasetDict by @maximxlss in #5907
Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in #5920
Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in #5901
Fix streaming parquet with image feature in schema by @lhoestq in #5921
canonicalize data dir in config ID hash by @kylrth in #5899
Fix link to quickstart docs in README.md by @mariosasko in #5928
Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in #5883
Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in #5863
[doc build] Use secrets by @mishig25 in #5932
Fix to_numpy when None values in the sequence by @qgallouedec in #5933
Better row group size in push_to_hub by @lhoestq in #5935
Avoid parallel redownload in cache by @albertvillanova in #5937
Better filenotfound for gated by @lhoestq in #5954
Make get_from_cache use custom temp filename that is locked by @albertvillanova in #5938
Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in #5956
Add Arrow builder docs by @lhoestq in #5952
Fix sequence of array support for most dtype by @qgallouedec in #5948

New Contributors

@Laurent2916 made their first contribution in #5824
@yuukicammy made their first contribution in #5810
@baskrahmer made their first contribution in #5894
@ttsugriy made their first contribution in #5909
@maximxlss made their first contribution in #5907
@mariusz-jachimowicz-83 made their first contribution in #5893
@kylrth made their first contribution in #5899
@qgallouedec made their first contribution in #5933
@es94129 made their first contribution in #5924

Full Changelog: 2.12.0...zef