Dataset Features
-
Support Image, Video and Audio types in Lance datasets
>>> from datasets import load_dataset >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train") >>> ds.features {'video_blob': Video(), 'video_path': Value('string'), 'caption': Value('string'), 'aesthetic_score': Value('float64'), 'motion_score': Value('float64'), 'temporal_consistency_score': Value('float64'), 'camera_motion': Value('string'), 'frame': Value('int64'), 'fps': Value('float64'), 'seconds': Value('float64'), 'embedding': List(Value('float32'), length=1024)}
-
Push to hub now supports Video types
>>> from datasets import Dataset, Video >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]}) >>> ds = ds.cast_column("video", Video()) >>> ds.push_to_hub("username/my-video-dataset")
-
Write image/audio/video blobs as is in parquet (PLAIN) in
push_to_hub()by @lhoestq in #7976- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication
-
Add
IterableDataset.reshard()by @lhoestq in #7992Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
>>> from datasets import load_dataset >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True) >>> ds IterableDataset({ features: ['label', 'title', 'content'], num_shards: 4 }) >>> ds.reshard() IterableDataset({ features: ['label', 'title', 'content'], num_shards: 3600 })
What's Changed
- Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in #7919
- Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in #7961
- docs: fix grammar and add type hints in splits.py by @Edge-Explorer in #7960
- Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in #7955
- Add examples for Lance datasets by @prrao87 in #7950
- Support null in json string cols by @lhoestq in #7963
- handle blob lance by @lhoestq in #7964
- Count examples in lance by @lhoestq in #7969
- Use temp files in push_to_hub to save memory by @lhoestq in #7979
- Drop python 3.9 by @lhoestq in #7980
- Support pandas 3 by @lhoestq in #7981
- Remove unused data files optims by @lhoestq in #7985
- Remove pre-release workaround in CI for
transformers v5andhuggingface_hub v1by @hanouticelina in #7989 - very basic support for more hf urls by @lhoestq in #8003
- Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in #7995
- Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in #8000
- More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in #8009
- Support empty shard in from_generator by @lhoestq in #8023
- Allow import polars in map() by @lhoestq in #8024
New Contributors
- @omarfarhoud made their first contribution in #7919
- @Edge-Explorer made their first contribution in #7960
- @prathamk-tw made their first contribution in #7955
- @prrao87 made their first contribution in #7950
- @hanouticelina made their first contribution in #7989
- @jayzuccarelli made their first contribution in #7995
- @AnkitAhlawat7742 made their first contribution in #8000
Full Changelog: 4.5.0...4.6.0