Important: caching
- Datasets downloaded and cached using
datasets>=2.14.0
may not be reloaded from cache using older version ofdatasets
(and therefore re-downloaded). - Datasets that were already cached are still supported.
- This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
- This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.
Dataset Configuration
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
- Configure your dataset using YAML at the top of your dataset card (docs here)
- Choose which file goes into which split
--- configs: - config_name: default data_files: - split: train path: data.csv - split: test path: holdout.csv ---
- Define multiple dataset configurations
--- configs: - config_name: main_data data_files: main_data.csv - config_name: additional_data data_files: additional_data.csv ---
Dataset Features
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
push_to_hub()
additional dataset configurations
ds.push_to_hub("username/dataset_name", config_name="additional_data") # reload later ds = load_dataset("username/dataset_name", "additional_data")
-
Support returning dataframe in map transform by @mariosasko in #5995
What's Changed
- Deprecate
errors
param in favor ofencoding_errors
in text builder by @mariosasko in #5974 - Fix select_columns columns order by @lhoestq in #5994
- Replace metadata utils with
huggingface_hub
's RepoCard API by @mariosasko in #5949 - Pin
joblib
to avoidjoblibspark
test failures by @mariosasko in #6000 - Align
column_names
type check with type hint insort
by @mariosasko in #6001 - Deprecate
use_auth_token
in favor oftoken
by @mariosasko in #5996 - Drop Python 3.7 support by @mariosasko in #6005
- Misc improvements by @mariosasko in #6004
- Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
- Fix cast for dictionaries with no keys by @mariosasko in #6009
- Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
- Deprecate task api by @mariosasko in #5865
- Add metadata ui screenshot in docs by @lhoestq in #6015
- Fix
ClassLabel
min max check forNone
values by @mariosasko in #6023 - [docs] Update return statement of index search by @stevhliu in #6021
- Improve logging by @mariosasko in #6019
- Fix style with ruff 0.0.278 by @lhoestq in #6026
- Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
- Delete
task_templates
inIterableDataset
when they are no longer valid by @mariosasko in #6027 - [docs] Fix link by @stevhliu in #6029
- fixed typo in comment by @NightMachinery in #6030
- Fix legacy_dataset_infos by @lhoestq in #6040
- Flatten repository_structure docs on yaml by @lhoestq in #6041
- Use new hffs by @lhoestq in #6028
- Bump dev version by @lhoestq in #6047
- Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
- Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
- Remove
HfFileSystem
and deprecateS3FileSystem
by @mariosasko in #6052 - Dill 3.7 support by @mariosasko in #6061
- Improve
Dataset.from_list
docstring by @mariosasko in #6062 - Check if column names match in Parquet loader only when config
features
are specified by @mariosasko in #6045 - Release: 2.14.0 by @lhoestq in #6063
New Contributors
- @mathewjacob1002 made their first contribution in #5986
- @pappacena made their first contribution in #5976
Full Changelog: 2.13.1...2.14.0