github huggingface/datasets 3.0.0

7 days ago

Dataset Features

  • Use Polars functions in .map()
    • Allow Polars as valid output type by @psmyth94 in #6762

    • Example:

      >>> from datasets import load_dataset
      >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
      >>> cols = [pl.col("content").str.len_bytes().alias("length")]
      >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
      >>> ds_with_length[:5]
      shape: (5, 5)
      ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
      │ idxtitlecontentlabelslength │
      │ ---------------    │
      │ i64strstrstru32    │
      ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
      │ 0The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure180    │
      │ 1Pikachu's Quest for PeacePikachu, with his cheeky persona… ┆ peaceful_narrative138    │
      │ 2The Tender Tale of SquirtleSquirtle took everyone on a memo… ┆ gentle_adventure135    │
      │ 3Charizard's Heartwarming TaleCharizard found joy in helping o… ┆ heartwarming_story112    │
      │ 4Jolteon's Sparkling JourneyJolteon, with his zest for life,… ┆ celebratory_narrative111    │
      └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
  • Support NumPy 2

Cache Changes

  • Use huggingface_hub cache by @lhoestq in #7105
    • use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
    • cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

  • Remove deprecated code by @albertvillanova in #6996
    • removed deprecated arguments like use_auth_token, fs or ignore_verifications
  • Remove beam by @albertvillanova in #6987
    • removed deprecated apache beam datasets support
  • Remove metrics by @albertvillanova in #6983
    • remove deprecated load_metric, please use the evaluate library instead
  • Remove tasks by @albertvillanova in #6999
    • remove deprecated task argument in load_dataset() .prepare_for_task() method, datasets.tasks module

General improvements and bug fixes

New Contributors

Full Changelog: 2.21.0...3.0.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.