github huggingface/datasets 2.8.0

latest releases: 3.0.1, 3.0.0, 2.21.0...
21 months ago

Important

  • Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
    • From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
    • The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
    • Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

  • Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in #5287
    • Datasets in streaming mode now update their features after column renaming or removal
  • Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
    • Use multiprocessing to load multiple files in parallel
  • Add features param to IterableDataset.map by @alvarobartt in #5311
  • Sharded save_to_disk + multiprocessing by @lhoestq in #5268
    • Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
    • Pass num_proc to use multiprocessing.
  • Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
  • Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
    • You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
    from datasets import load_dataset
    ds = load_dataset("c4", "en", streaming=True, split="train")
    dataloader = DataLoader(ds, batch_size=32, num_workers=4)

Docs

General improvements and bug fixes

New Contributors

Full Changelog: 2.7.0...2.8.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.