Dataset Features
-
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
- /!\ Breaking change: we replaced
decord
withtorchvision
to read videos, sincedecord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideo
type is still marked as experimental is this version
from datasets import load_dataset, Video dataset = load_dataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
- faster streaming for image/audio/video folder from Hugging Face
- support for
metadata.parquet
in addition tometadata.csv
ormetadata.jsonl
for the metadata of the image/audio/video files
- /!\ Breaking change: we replaced
-
Add IterableDataset.decode with multithreading by @lhoestq in #7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
dataset = dataset.decode(num_threads=num_threads)
General improvements and bug fixes
- fix: None default with bool type on load creates typing error by @stephantul in #7426
- Use pyupgrade --py39-plus by @cyyever in #7428
- Refactor
string_to_dict
to returnNone
if there is no match instead of raisingValueError
by @ringohoffman in #7435 - Fix small bugs with async map by @lhoestq in #7445
- Fix resuming after
ds.set_epoch(new_epoch)
by @lhoestq in #7451 - minor docs changes by @lhoestq in #7452
New Contributors
- @stephantul made their first contribution in #7426
- @cyyever made their first contribution in #7428
- @jp1924 made their first contribution in #7368
Full Changelog: 3.3.2...3.4.0