Dataset Features
- Use Polars functions in
.map()
-
Example:
>>> from datasets import load_dataset >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars") >>> cols = [pl.col("content").str.len_bytes().alias("length")] >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True) >>> ds_with_length[:5] shape: (5, 5) ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐ │ idx ┆ title ┆ content ┆ labels ┆ length │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ u32 │ ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡ │ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure ┆ 180 │ │ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative ┆ 138 │ │ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentle_adventure ┆ 135 │ │ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarming_story ┆ 112 │ │ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111 │ └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
- Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by @albertvillanova in #7118
Cache Changes
- Use
huggingface_hub
cache by @lhoestq in #7105- use the
huggingface_hub
cache for files downloaded from HF, by default at~/.cache/huggingface/hub
- cached datasets (Arrow files) will still be reloaded from the
datasets
cache, by default at~/.cache/huggingface/datasets
- use the
Breaking changes
- Remove deprecated code by @albertvillanova in #6996
- removed deprecated arguments like
use_auth_token
,fs
orignore_verifications
- removed deprecated arguments like
- Remove beam by @albertvillanova in #6987
- removed deprecated apache beam datasets support
- Remove metrics by @albertvillanova in #6983
- remove deprecated
load_metric
, please use theevaluate
library instead
- remove deprecated
- Remove tasks by @albertvillanova in #6999
- remove deprecated
task
argument inload_dataset()
.prepare_for_task()
method,datasets.tasks
module
- remove deprecated
General improvements and bug fixes
- Improved the tutorial by adding a link for loading datasets by @AmboThom in #7042
- Automatically create
cache_dir
fromcache_file_name
by @ringohoffman in #7096 - remove more script docs by @lhoestq in #7104
- Fix args of feature docstrings by @albertvillanova in #7103
- Temporarily pin numpy<2.1 to fix CI by @albertvillanova in #7114
- Fix ConnectionError for gated datasets and unauthenticated users by @albertvillanova in #7110
- Install transformers with numpy-2 CI by @albertvillanova in #7119
- don't mention the script if trust_remote_code=False by @severo in #7120
- Fix typed examples iterable state dict by @lhoestq in #7121
- Rename LargeList.dtype to LargeList.feature by @albertvillanova in #7106
- Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by @albertvillanova in #7125
- Disable implicit token in CI by @albertvillanova in #7126
- Test get_dataset_config_info with non-existing/gated/private dataset by @albertvillanova in #7124
- fix streaming from arrow files by @fschlatt in #7083
New Contributors
Full Changelog: 2.21.0...3.0.0