Datasets Features
-
Add Dataset.from_spark by @maddiedawson in #5701
- Get a Dataset from a Spark DataFrame (docs):
>>> from datasets import Dataset >>> ds = Dataset.from_spark(df)
-
Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689
- Stream data from Wikipedia:
>>> from datasets import load_dataset >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True) >>> next(iter(ds["train"])) {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
-
Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735
- Use interleaved datasets in a distributed setup or with a DataLoader
>>> from datasets import load_dataset, interleave_datasets >>> from torch.utils.data import DataLoader >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True) >>> c4 = load_dataset("c4", "en", split="train", streaming=True) >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted") >>> dataloader = DataLoader(merged, num_workers=4)
-
Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas
General improvements and bug fixes
- Fix a description error for interleave_datasets. by @QizhiPei in #5680
- [docs] Split pattern search order by @stevhliu in #5693
- Raise an error on missing distributed seed by @lhoestq in #5697
- Fix xnumpy_load for .npz files by @albertvillanova in #5714
- Temporarily pin fsspec by @albertvillanova in #5731
- Unpin fsspec by @albertvillanova in #5733
- Fix CI warnings by @albertvillanova in #5741
- Fix CI mock filesystem fixtures by @albertvillanova in #5740
- Fix link in docs by @bbbxyz in #5746
- fix typo: "mow" -> "now" by @csris in #5763
- [docs] Compress data files by @stevhliu in #5691
- Fix style by @lhoestq in #5774
- Minor tqdm fixes by @mariosasko in #5754
- Fixes #5757 by @eli-osherovich in #5758
- Fix JSON builder when missing keys in first row by @albertvillanova in #5772
- Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in #5742
- Prepare tests for hfh 0.14 by @Wauplin in #5788
- Call fs.makedirs in save_to_disk by @lhoestq in #5779
- Allow to run CI on push to ci-branch by @albertvillanova in #5790
- Fix nondeterministic sharded data split order by @albertvillanova in #5729
- Raise subprocesses traceback when interrupting by @lhoestq in #5784
- Fix spark imports by @lhoestq in #5795
- Change downloaded file permission based on umask by @albertvillanova in #5800
- Fix inferring module for unsupported data files by @albertvillanova in #5787
- Reorder default data splits to have validation before test by @albertvillanova in #5718
- Validate non-empty data_files by @albertvillanova in #5802
- Spark docs by @lhoestq in #5796
- Release: 2.12.0 by @lhoestq in #5803
New Contributors
- @QizhiPei made their first contribution in #5680
- @bbbxyz made their first contribution in #5746
- @csris made their first contribution in #5763
- @eli-osherovich made their first contribution in #5758
- @maddiedawson made their first contribution in #5701
Full Changelog: 2.11.0...2.12.0