Dataset Features
-
Add IterableDataset.from_spark by @maddiedawson in #5770
- Stream the data from your Spark DataFrame directly to your training pipeline
from datasets import IterableDataset from torch.utils.data import DataLoader ids = IterableDataset.from_spark(df) ids = ids.map(...).filter(...).with_format("torch") for batch in DataLoader(ids, batch_size=16, num_workers=4): ...
-
IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
- IterableDataset Arrow formatting by @lhoestq in #5821
- Iterable torch formatting by @lhoestq in #5852
from datasets import load_dataset ids = load_dataset("c4", "en", split="train", streaming=True) ids = ids.map(...).with_format("torch") # to get PyTorch tensors - also works with tf, np, jax etc.
-
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in #5893
from datasets import IterableDataset ids = IterableDataset.from_file("path/to/data.arrow")
-
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in #5944
from datasets import load_dataset ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
Experimental
General improvements and bug fixes
- Preserve
stopping_strategy
of shuffled interleaved dataset (random cycling case) by @mariosasko in #5816 - Fix incomplete docstring for
BuilderConfig
by @Laurent2916 in #5824 - [docs] Custom decoding transforms by @stevhliu in #5836
- Add
accelerate
as metric's test dependency to fix CI error by @mariosasko in #5848 - Add
date_format
param to the CSV reader by @mariosasko in #5845 - [docs] Redirects, migrated from nginx by @julien-c in #5853
- Fix infer module for uppercase extensions by @albertvillanova in #5872
- Minor tqdm optim by @lhoestq in #5860
- Always set nullable fields in the writer by @lhoestq in #5835
- Add
fn_kwargs
tomap
andfilter
ofIterableDataset
andIterableDatasetDict
by @yuukicammy in #5810 - Better error message when combining dataset dicts instead of datasets by @lhoestq in #5861
- Force overwrite existing filesystem protocol by @baskrahmer in #5894
- Support working_dir in from_spark by @maddiedawson in #5826
- Raise TypeError when indexing a dataset with bool by @albertvillanova in #5859
- Fix minor typo in docs loading.mdx by @albertvillanova in #5900
- Fix
FixedSizeListArray
casting by @mariosasko in #5897 - Unpin responses by @mariosasko in #5916
- Validate name parameter in make_file_instructions by @albertvillanova in #5904
- Raise error in
DatasetBuilder.as_dataset
whenfile_format
is not"arrow"
by @mariosasko in #5915 - Refactor extensions by @albertvillanova in #5917
- Use more efficient and idiomatic way to construct list. by @ttsugriy in #5909
- Add
flatten_indices
toDatasetDict
by @maximxlss in #5907 - Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in #5920
- Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in #5901
- Fix streaming parquet with image feature in schema by @lhoestq in #5921
- canonicalize data dir in config ID hash by @kylrth in #5899
- Fix link to quickstart docs in README.md by @mariosasko in #5928
- Fix string-encoding, make
batch_size
optional, and minor improvements inDataset.to_tf_dataset
by @alvarobartt in #5883 - Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in #5863
- [doc build] Use secrets by @mishig25 in #5932
- Fix
to_numpy
when None values in the sequence by @qgallouedec in #5933 - Better row group size in push_to_hub by @lhoestq in #5935
- Avoid parallel redownload in cache by @albertvillanova in #5937
- Better filenotfound for gated by @lhoestq in #5954
- Make get_from_cache use custom temp filename that is locked by @albertvillanova in #5938
- Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in #5956
- Add Arrow builder docs by @lhoestq in #5952
- Fix sequence of array support for most dtype by @qgallouedec in #5948
New Contributors
- @Laurent2916 made their first contribution in #5824
- @yuukicammy made their first contribution in #5810
- @baskrahmer made their first contribution in #5894
- @ttsugriy made their first contribution in #5909
- @maximxlss made their first contribution in #5907
- @mariusz-jachimowicz-83 made their first contribution in #5893
- @kylrth made their first contribution in #5899
- @qgallouedec made their first contribution in #5933
- @es94129 made their first contribution in #5924
Full Changelog: 2.12.0...zef