Dataset Features
- Video support by @lhoestq in #7230
>>> from datasets import Dataset, Video, load_dataset >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) >>> # or from the hub >>> ds = load_dataset("username/dataset_name", split="train") >>> ds[0]["video"] <decord.video_reader.VideoReader at 0x105525c70>
- Add IterableDataset.shard() by @lhoestq in #7252
>>> from datasets import load_dataset >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True) >>> full_ds.num_shards 2360 >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0) >>> ds.num_shards 1 >>> ds = full_ds.shard(num_shards=8, index=0) >>> ds.num_shards 295
- Basic XML support by @lhoestq in #7250
What's Changed
- (Super tiny doc update) Mention to_polars by @fzyzcjy in #7232
- [MINOR:TYPO] Update arrow_dataset.py by @cakiki in #7236
- Missing video docs by @lhoestq in #7251
- fix decord import by @lhoestq in #7255
- fix ci for pyarrow 18 by @lhoestq in #7257
- Retry all requests timeouts by @lhoestq in #7256
- Always set non-null writer batch size by @lhoestq in #7258
- Don't embed videos by @lhoestq in #7259
- Allow video with disabeld decoding without decord by @lhoestq in #7262
- Small addition to video docs by @lhoestq in #7263
- fix docs relative links by @lhoestq in #7264
- Disallow video push_to_hub by @lhoestq in #7265
New Contributors
Full Changelog: 3.0.2...3.1.0