huggingface/datasets 4.0.0 on GitHub

New Features

Add IterableDataset.push_to_hub() by @lhoestq in #7595

# Build streaming data pipelines in a few lines of code !
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)

Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in #7606

# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

New Column object

Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
Lazy column by @lhoestq in #7614

# Syntax:
ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column:
for text in ds["text"]:
    ...

# Load one cell without bringing the full column in memory
first_text = ds["text"][0]  # equivalent to ds[0]["text"]

Torchcodec decoding by @TyTodd in #7616

Enables streaming only the ranges you need !

# Don't download full audios/videos when it's not necessary
# Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

# old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]

Load video data with VideoDecoder:

video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in #7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding

Replace Sequence by List by @lhoestq in #7634

Introduction of the List type

from datasets import Features, List, Value

features = Features({
    "texts": List(Value("string")),
    "four_paragraphs": List(Value("string"), length=4)
})

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

from datasets import Sequence

Sequence(Value("string"))  # List(Value("string"))
Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in #7434
fix string_to_dict test by @lhoestq in #7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
load_dataset splits typing by @lhoestq in #7587
Fixed typos by @TopCoder2K in #7572
Fix regex library warnings by @emmanuel-ferdman in #7576
[MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in #7581
Avoid multiple default config names by @albertvillanova in #7585
Fix broken link to albumentations by @ternaus in #7593
fix string_to_dict usage for windows by @lhoestq in #7598
No TF in win tests by @lhoestq in #7603
Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
Tests typing and fixes for push_to_hub by @lhoestq in #7608
fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
remove unused code by @lhoestq in #7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in #7609
Fixes in docs by @lhoestq in #7620
Add albumentations to use dataset by @ternaus in #7596
minor docs data aug by @lhoestq in #7621
fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
fix save_infos by @lhoestq in #7639
better features repr by @lhoestq in #7640
update docs and docstrings by @lhoestq in #7641
fix length for ci by @lhoestq in #7642
Backward compat sequence instance by @lhoestq in #7643
fix sequence ci by @lhoestq in #7644
Custom metadata filenames by @lhoestq in #7663
Update the beans dataset link in Preprocess by @HJassar in #7659
Backward compat list feature by @lhoestq in #7666
Fix infer list of images by @lhoestq in #7667
Fix audio bytes by @lhoestq in #7670
Fix double sequence by @lhoestq in #7672

New Contributors

@TopCoder2K made their first contribution in #7564
@francescorubbo made their first contribution in #7522
@emmanuel-ferdman made their first contribution in #7576
@SilvanCodes made their first contribution in #7581
@ternaus made their first contribution in #7593
@ArjunJagdale made their first contribution in #7623
@TyTodd made their first contribution in #7616
@HJassar made their first contribution in #7659

Full Changelog: 3.6.0...4.0.0