github huggingface/datasets 2.12.0

latest releases: 3.1.0, 3.0.2, 3.0.1...
18 months ago

Datasets Features

  • Add Dataset.from_spark by @maddiedawson in #5701

    • Get a Dataset from a Spark DataFrame (docs):
    >>> from datasets import Dataset
    >>> ds = Dataset.from_spark(df)
  • Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689

    • Stream data from Wikipedia:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
    >>> next(iter(ds["train"]))
    {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
  • Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735

    • Use interleaved datasets in a distributed setup or with a DataLoader
    >>> from datasets import load_dataset, interleave_datasets
    >>> from torch.utils.data import DataLoader
    >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
    >>> c4 = load_dataset("c4", "en", split="train", streaming=True)
    >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
    >>> dataloader = DataLoader(merged, num_workers=4)
  • Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751

    • Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
    • Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
    • Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

New Contributors

Full Changelog: 2.11.0...2.12.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.