huggingface/datasets 2.12.0 on GitHub

Datasets Features

Add Dataset.from_spark by @maddiedawson in #5701
- Get a Dataset from a Spark DataFrame (docs):
```
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
```

Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689

Stream data from Wikipedia:

>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}

Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735

Use interleaved datasets in a distributed setup or with a DataLoader

>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)

Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

Fix a description error for interleave_datasets. by @QizhiPei in #5680
[docs] Split pattern search order by @stevhliu in #5693
Raise an error on missing distributed seed by @lhoestq in #5697
Fix xnumpy_load for .npz files by @albertvillanova in #5714
Temporarily pin fsspec by @albertvillanova in #5731
Unpin fsspec by @albertvillanova in #5733
Fix CI warnings by @albertvillanova in #5741
Fix CI mock filesystem fixtures by @albertvillanova in #5740
Fix link in docs by @bbbxyz in #5746
fix typo: "mow" -> "now" by @csris in #5763
[docs] Compress data files by @stevhliu in #5691
Fix style by @lhoestq in #5774
Minor tqdm fixes by @mariosasko in #5754
Fixes #5757 by @eli-osherovich in #5758
Fix JSON builder when missing keys in first row by @albertvillanova in #5772
Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in #5742
Prepare tests for hfh 0.14 by @Wauplin in #5788
Call fs.makedirs in save_to_disk by @lhoestq in #5779
Allow to run CI on push to ci-branch by @albertvillanova in #5790
Fix nondeterministic sharded data split order by @albertvillanova in #5729
Raise subprocesses traceback when interrupting by @lhoestq in #5784
Fix spark imports by @lhoestq in #5795
Change downloaded file permission based on umask by @albertvillanova in #5800
Fix inferring module for unsupported data files by @albertvillanova in #5787
Reorder default data splits to have validation before test by @albertvillanova in #5718
Validate non-empty data_files by @albertvillanova in #5802
Spark docs by @lhoestq in #5796
Release: 2.12.0 by @lhoestq in #5803

New Contributors

@QizhiPei made their first contribution in #5680
@bbbxyz made their first contribution in #5746
@csris made their first contribution in #5763
@eli-osherovich made their first contribution in #5758
@maddiedawson made their first contribution in #5701

Full Changelog: 2.11.0...2.12.0