github huggingface/datasets 2.14.0

latest releases: 3.1.0, 3.0.2, 3.0.1...
15 months ago

Important: caching

  • Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
  • Datasets that were already cached are still supported.
  • This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
  • This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.

Dataset Configuration

  • Support for multiple configs via metadata yaml info by @polinaeterna in #5331

    • Configure your dataset using YAML at the top of your dataset card (docs here)
    • Choose which file goes into which split
      ---
      configs:
      - config_name: default
        data_files:
        - split: train
           path: data.csv
        - split: test
            path: holdout.csv
      ---
    • Define multiple dataset configurations
      ---
      configs:
      - config_name: main_data
        data_files: main_data.csv
      - config_name: additional_data
        data_files: additional_data.csv
      ---

Dataset Features

  • Support for multiple configs via metadata yaml info by @polinaeterna in #5331

    • push_to_hub() additional dataset configurations
    ds.push_to_hub("username/dataset_name", config_name="additional_data")
    # reload later
    ds = load_dataset("username/dataset_name", "additional_data")
  • Support returning dataframe in map transform by @mariosasko in #5995

What's Changed

New Contributors

Full Changelog: 2.13.1...2.14.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.