Dataset Features
-
Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in #8064
from datasets import load_dataset # load raw data from a Storage Bucket on HF ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"]) # or manually, using hf:// paths ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"]) # process, filter ds = ds.map(...).filter(...) # publish the AI-ready dataset ds.push_to_hub("username/my-dataset-ready-for-training")
This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumpsdillandmultiprocessversions to support python 3.14 -
Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in #8068
- added
max_shard_sizeto IterableDataset.push_to_hub - more arrow-native iterable operations for IterableDataset
- better support of glob patterns in archives, e.g.
zip://*.jsonl::hf://datasets/username/dataset-name/data.zip - fixes for to_pandas, videofolder, load_dataset_builder kwargs
- added
What's Changed
- fix reshard_data_sources by @lhoestq in #8061
- Improve error message for invalid data_files pattern format by @kushalkkb in #8060
- fix null filling in missing jsonl columns by @lhoestq in #8069
New Contributors
- @kushalkkb made their first contribution in #8060
- @Michael-RDev made their first contribution in #8068
Full Changelog: 4.7.0...4.8.0