Security features
- Add trust_remote_code argument by @lhoestq in #6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
https://hf.co/datasets/<repo_id>
. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argumenttrust_remote_code=True
. - Passing
trust_remote_code=True
will be mandatory to load these datasets from the next major release ofdatasets
. - Using the environment variable
HF_DATASETS_TRUST_REMOTE_CODE=0
you can already disable custom code by default without waiting for the next release ofdatasets
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
- Use parquet export if possible by @lhoestq in #6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at
https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet
Features
- Webdataset dataset builder by @lhoestq in #6391
- Implement get dataset default config name by @albertvillanova in #6511
- Lazy data files resolution and offline cache reload by @lhoestq in #6493
- This speeds up the
load_dataset
step that lists the data files of big repositories (up to x100) but requireshuggingface_hub
0.20 or newer - Fix
load_dataset
that used to reload data from cache even if the dataset was updated on Hugging Face - Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets:
~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from
datasets
2.15 (using the old scheme) are still reloaded from cache
- This speeds up the
General improvements and bug fixes
- Remove unused argument in
_get_data_files_patterns
by @lhoestq in #6343 - Set
usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in #6414 - Use
ruff
for formatting by @mariosasko in #6434 - Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in #6431
- Fix multi gpu map example by @lhoestq in #6415
- Better
tqdm
wrapper by @mariosasko in #6433 - Remove
Table.__getstate__
andTable.__setstate__
by @LZHgrla in #6444 - Use
filelock
package for file locking by @mariosasko in #6445 - Fix metadata file resolution when inferred pattern is
**
by @mariosasko in #6449 - Update hub-docs reference by @mishig25 in #6453
- Refactor
dill
logic by @mariosasko in #6454 - Don't require trust_remote_code in inspect_dataset by @lhoestq in #6456
- [docs] troubleshooting guide by @MKhalusova in #6424
- Missing DatasetNotFoundError by @lhoestq in #6462
- Disable benchmarks in PRs by @lhoestq in #6463
- More robust temporary directory deletion by @mariosasko in #6426
- Fix shard retry mechanism in
push_to_hub
by @mariosasko in #6461 - Use auth to get parquet export by @lhoestq in #6468
- Remove delete doc CI by @lhoestq in #6471
- Fix CI quality by @albertvillanova in #6473
- Fix PermissionError on Windows CI by @albertvillanova in #6477
- More robust preupload retry mechanism by @mariosasko in #6479
- Add IterableDataset
__repr__
by @lhoestq in #6480 - Fix max lock length on unix by @lhoestq in #6482
- Fix ArrayXD YAML conversion by @mariosasko in #6168
- Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in #6486
- Fix deprecation warning when building conda package by @albertvillanova in #6425
- Make push_to_hub return CommitInfo by @albertvillanova in #6492
- docs: add reference Git over SSH by @severo in #6499
- Fallback on dataset script if user wants to load default config by @lhoestq in #6498
- Don't expand_info in HF glob by @lhoestq in #6469
- Fix streaming xnli by @lhoestq in #6503
- Pickle support for
torch.Generator
objects by @mariosasko in #6502 - Enable setting config as default when push_to_hub by @albertvillanova in #6500
- Better cast error when generating dataset by @lhoestq in #6509
- Replace
list_files_info
withlist_repo_tree
inpush_to_hub
by @mariosasko in #6510 - Remove deprecated HfFolder by @lhoestq in #6512
- Support huggingface-hub pre-releases by @albertvillanova in #6516
- Support push_to_hub canonical datasets by @albertvillanova in #6519
- Support commit_description parameter in push_to_hub by @albertvillanova in #6520
- fix get_metadata_patterns function args error by @d710055071 in #6518
- Fix metrics dead link by @qgallouedec in #6491
- fix tests by @lhoestq in #6523
- Cache backward compatibility with 2.15.0 by @lhoestq in #6514
- Preserve order of configs and splits when using Parquet exports by @albertvillanova in #6526
New Contributors
- @LZHgrla made their first contribution in #6444
- @d710055071 made their first contribution in #6518
Full Changelog: 2.15.0...2.16.0