New Features
-
Add
IterableDataset.push_to_hub()
by @lhoestq in #7595# Build streaming data pipelines in a few lines of code ! from datasets import load_dataset ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...)
-
Add
num_proc=
to.push_to_hub()
(Dataset and IterableDataset) by @lhoestq in #7606# Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)
-
New
Column
object- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
- Lazy column by @lhoestq in #7614
# Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...) # Iterate on a column: for text in ds["text"]: ... # Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"]
-
Torchcodec decoding by @TyTodd in #7616
- Enables streaming only the ranges you need !
# Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
- Requires
torch>=2.7.0
and FFmpeg >= 4 - Not available for Windows yet but it is coming soon - in the meantime please use
datasets<4.0
- Load audio data with
AudioDecoder
:
audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000 # old syntax is still supported array, sr = audio["array"], audio["sampling_rate"]
- Load video data with
VideoDecoder
:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])
Breaking changes
-
Remove scripts altogether by @lhoestq in #7592
trust_remote_code
is no longer supported
-
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
-
Replace Sequence by List by @lhoestq in #7634
- Introduction of the
List
type
from datasets import Features, List, Value features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) })
Sequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aList
or adict
depending on the subfeature
from datasets import Sequence Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
- Introduction of the
Other improvements and bug fixes
- Refactor
Dataset.map
to reuse cache files mapped with differentnum_proc
by @ringohoffman in #7434 - fix string_to_dict test by @lhoestq in #7571
- Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
- Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
- fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
- load_dataset splits typing by @lhoestq in #7587
- Fixed typos by @TopCoder2K in #7572
- Fix regex library warnings by @emmanuel-ferdman in #7576
- [MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
- Add missing property on
RepeatExamplesIterable
by @SilvanCodes in #7581 - Avoid multiple default config names by @albertvillanova in #7585
- Fix broken link to albumentations by @ternaus in #7593
- fix string_to_dict usage for windows by @lhoestq in #7598
- No TF in win tests by @lhoestq in #7603
- Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
- Tests typing and fixes for push_to_hub by @lhoestq in #7608
- fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
- remove unused code by @lhoestq in #7615
- Update
_dill.py
to useco_linetable
for Python 3.10+ in place ofco_lnotab
by @qgallouedec in #7609 - Fixes in docs by @lhoestq in #7620
- Add albumentations to use dataset by @ternaus in #7596
- minor docs data aug by @lhoestq in #7621
- fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
- fix save_infos by @lhoestq in #7639
- better features repr by @lhoestq in #7640
- update docs and docstrings by @lhoestq in #7641
- fix length for ci by @lhoestq in #7642
- Backward compat sequence instance by @lhoestq in #7643
- fix sequence ci by @lhoestq in #7644
- Custom metadata filenames by @lhoestq in #7663
- Update the beans dataset link in Preprocess by @HJassar in #7659
- Backward compat list feature by @lhoestq in #7666
- Fix infer list of images by @lhoestq in #7667
- Fix audio bytes by @lhoestq in #7670
- Fix double sequence by @lhoestq in #7672
New Contributors
- @TopCoder2K made their first contribution in #7564
- @francescorubbo made their first contribution in #7522
- @emmanuel-ferdman made their first contribution in #7576
- @SilvanCodes made their first contribution in #7581
- @ternaus made their first contribution in #7593
- @ArjunJagdale made their first contribution in #7623
- @TyTodd made their first contribution in #7616
- @HJassar made their first contribution in #7659
Full Changelog: 3.6.0...4.0.0