github huggingface/datasets 2.9.0

latest releases: 2.20.0, 2.19.2, 2.19.1...
17 months ago

Datasets Features

  • Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377

    • Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing
  • Distributed support by @lhoestq in #5369

    • Split your dataset for each node for distributed training
    • It supports both Dataset and IterableDataset (e.g. in streaming mode)
    • See the documentation for more details
    import os
    from datasets.distributed import split_dataset_by_node
    
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
  • Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400

  • Tqdm progress bar for to_parquet by @zanussbaum in #5456

  • ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379

  • Support other formats than uint8 for image arrays by @vigsterkr in #5365

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.8.0...2.9.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.