github huggingface/datasets 3.1.0

13 hours ago

Dataset Features

  • Video support by @lhoestq in #7230
    >>> from datasets import Dataset, Video, load_dataset
    >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
    >>> # or from the hub
    >>> ds = load_dataset("username/dataset_name", split="train")
    >>> ds[0]["video"]
    <decord.video_reader.VideoReader at 0x105525c70>
  • Add IterableDataset.shard() by @lhoestq in #7252
    >>> from datasets import load_dataset
    >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
    >>> full_ds.num_shards
    2360
    >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
    >>> ds.num_shards
    1
    >>> ds = full_ds.shard(num_shards=8, index=0)
    >>> ds.num_shards
    295
  • Basic XML support by @lhoestq in #7250

What's Changed

New Contributors

Full Changelog: 3.0.2...3.1.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.