huggingface/datasets 2.14.0 on GitHub

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

Configure your dataset using YAML at the top of your dataset card (docs here)
Choose which file goes into which split

  ---
  configs:
  - config_name: default
    data_files:
    - split: train
       path: data.csv
    - split: test
        path: holdout.csv
  ---

Define multiple dataset configurations

  ---
  configs:
  - config_name: main_data
    data_files: main_data.csv
  - config_name: additional_data
    data_files: additional_data.csv
  ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

push_to_hub() additional dataset configurations

ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")

Support returning dataframe in map transform by @mariosasko in #5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in #5974
Fix select_columns columns order by @lhoestq in #5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in #5949
Pin joblib to avoid joblibspark test failures by @mariosasko in #6000
Align column_names type check with type hint in sort by @mariosasko in #6001
Deprecate use_auth_token in favor of token by @mariosasko in #5996
Drop Python 3.7 support by @mariosasko in #6005
Misc improvements by @mariosasko in #6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
Fix cast for dictionaries with no keys by @mariosasko in #6009
Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
Deprecate task api by @mariosasko in #5865
Add metadata ui screenshot in docs by @lhoestq in #6015
Fix ClassLabel min max check for None values by @mariosasko in #6023
[docs] Update return statement of index search by @stevhliu in #6021
Improve logging by @mariosasko in #6019
Fix style with ruff 0.0.278 by @lhoestq in #6026
Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in #6027
[docs] Fix link by @stevhliu in #6029
fixed typo in comment by @NightMachinery in #6030
Fix legacy_dataset_infos by @lhoestq in #6040
Flatten repository_structure docs on yaml by @lhoestq in #6041
Use new hffs by @lhoestq in #6028
Bump dev version by @lhoestq in #6047
Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in #6052
Dill 3.7 support by @mariosasko in #6061
Improve Dataset.from_list docstring by @mariosasko in #6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in #6045
Release: 2.14.0 by @lhoestq in #6063

New Contributors

@mathewjacob1002 made their first contribution in #5986
@pappacena made their first contribution in #5976

Full Changelog: 2.13.1...2.14.0