🤗 Datasets 2.0.0
We're happy to announce that our new documentation is available at hf.co/docs/datasets !
Dataset Features
- Load a folder of images using the
imagefolder
dataset loader:- Add imagefolder dataset by @nateraw in #2830
- Faster ImageFolder + add option to drop labels by @mariosasko in #3887
- Push your image and audio datasets on the Hugging Face Hub with
push_to_hub
:- Add support for
Audio
andImage
feature inpush_to_hub
by @mariosasko in #3685
- Add support for
- New processing methods for streaming datasets:
- And more:
- Add more compression types for
to_json
by @bhavitvyamalik in #3551 - Multi-GPU support for
FaissIndex
by @rentruewang in #3721
- Add more compression types for
Breaking changes
- API changes for
map
andshuffle
for datasets loaded in streaming mode: - Rename GenerateMode to DownloadMode by @albertvillanova in #3759
- Remove deprecated methods/params (preparation for v2.0) by @mariosasko in #3803
- Remove deprecated
remove_columns
param infilter
by @mariosasko in #3827 - Module namespace cleanup for v2.0 by @mariosasko in #3875
Dataset Changes
- New: CFPB Consumer Complaints by @kayvane1 in #3617
- New: told-br (brazilian hate speech) by @JAugusto97 in #3683
- New: electricity load diagram by @kashif in #3722
- New: MIT Scene Parsing Benchmark by @mariosasko in #3607
- New: ElkarHizketak v1.0 by @antxa in #3780
- New: wikitablequestions by @SivilTaram in #3870
- New: ontonotes_conll by @richarddwang in #3853
- Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in #3616
- Update: Common voice - add validated partition by @shalymin-amzn in #3669
- Update: Common Voice - add local paths to audio files by @lhoestq in #3736
- Update: Common Voice - simplify code by @lhoestq in #3817
- Update: Natural Questions - add dev-only configuration by @albertvillanova in #3699
- Update: pubmed - update data url by @albertvillanova in #3692
- Update: pubmed - make the dataset streamable by @abhi-mosaic in #3740
- Update: RedCaps - make the dataset streamable by @mariosasko in #3737
- Update: cats_vs_dogs - update metadata by @albertvillanova in #3752
- Update: newsroom - update manual download url by @albertvillanova in #3779
- Update: xcopa - update to new version by @albertvillanova in #3810
- Update: cats_vs_dogs size by @mariosasko in #3878
- Fix: sem_eval_2018_task_1 - fix download location by @maxpel in #3643
- Fix: newsqa - fix unique keys by @albertvillanova in #3696
- Fix: The Pile datasets - fix host urls by @albertvillanova in #3627
- Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in #3718
- Fix: NewsQA - fix dataset script by @albertvillanova in #3734
- Fix: head_qa - fix data url by @albertvillanova in #3766
- Fix: msr_sqa - fix unique keys by @albertvillanova in #3771
- Fix: reddit_tifu - fix data url by @albertvillanova in #3774
- Fix: wiki_lingua - fix spanish data file url by @albertvillanova in #3806
- Fix: beans - fix data urls by @mariosasko in #3890
- Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in #3921
- Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in #3922
Dataset cards
- Add code example in wikipedia card by @lhoestq in #3678
- Fix Multi-News dataset metadata and card by @albertvillanova in #3731
- Reddit dataset card additions by @anna-kay in #3781
- Update gigaword card and info by @mariosasko in #3775
- Reddit dataset card contribution by @anna-kay in #3797
Metric Changes
- New: FrugalScore by @moussaKam in #3674
- New: Mahalanobis distance by @JoaoLages in #3794
- New: mIoU by @NielsRogge in #3745
- New: MSE and MAE - V2 by @dnaveenr in #3874
- Fix: METEOR - fix bug due to nltk version by @albertvillanova in #3884
Metric cards
- Add perplexity to metrics by @emibaylor in #3757
- Create SQuAD metric README.md by @sashavor in #3873
- SQuAD v2 metric: create README.md by @sashavor in #3879
- Update README.md for SQuAD v2 metric by @sashavor in #3908
- Update README.md for SQuAD metric by @sashavor in #3907
- Create README.md for WER metric by @sashavor in #3898
- Create README.md for GLUE by @sashavor in #3916
New documentation
General improvements and bug fixes
- Better TQDM output by @mariosasko in #3654
- Prioritize
module.builder_kwargs
over defaults inTestCommand
by @lvwerra in #3672 - Extend support for streaming datasets that use os.path.relpath by @albertvillanova in #3623
- Add Fon language tag by @albertvillanova in #3620
- Remove unnecessary 'r' arg in by @bryant1410 in #3661
- Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in #3680
- Upgrade black to version ~=22.0 by @LysandreJik in #3691
- Fix streaming for servers not supporting HTTP range requests by @albertvillanova in #3689
- Pin ElasticSearch by @lhoestq in #3701
- Raise informative error when loading a save_to_disk dataset by @albertvillanova in #3705
- Fix ClassLabel to/from dict when passed names_file by @albertvillanova in #3695
- Fix CI code quality issue by @albertvillanova in #3710
- Check if indices values in
Dataset.select
are within bounds by @mariosasko in #3719 - Pin pandas to avoid bug in streaming mode by @albertvillanova in #3725
- Use config pandas version in CSV dataset builder by @albertvillanova in #3726
- Set base path to hub url for canonical datasets by @lhoestq in #3709
- Fix ValueError message formatting in int2str by @akulchik in #3742
- Patch all module attributes in its namespace by @albertvillanova in #3727
- Fix typo in train split name by @albertvillanova in #3751
- feat: 🎸 generate info if dataset_infos.json does not exist by @severo in #3670
- Support streaming in size estimation function in
push_to_hub
by @mariosasko in #3732 - Expose method and fix param by @severo in #3767
- Fix HfFileSystem docstring by @lhoestq in #3768
- process .opus files (for Multilingual Spoken Words) by @polinaeterna in #3666
- Fix: dataset name is stored in keys by @thomasw21 in #3772
- Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in #3746
- Start removing canonical datasets logic by @lhoestq in #3777
- Support passing str to iter_files by @albertvillanova in #3783
- Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in #3787
- Skip checksum computation if
ignore_verifications
isTrue
by @mariosasko in #3796 - Fix error message in CSV loader for newer Pandas versions by @mariosasko in #3798
- Add
data_dir
todata_files
resolution and misc improvements to HfFileSystem by @mariosasko in #3791 - Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in #3782
- Handle Nones in PyArrow struct by @mariosasko in #3814
- Fix iter_archive getting reset by @lhoestq in #3815
- Added computer vision tasks by @merveenoyan in #3800
- Fix typo in doc build yml by @mishig25 in #3819
- Allow not specifying feature cols other than
predictions
/references
inMetric.compute
by @mariosasko in #3824 - Logo float left by @mishig25 in #3836
- Pin responses to fix CI for Windows by @albertvillanova in #3840
- Fix dead dataset scripts creation link. by @dnaveenr in #3834
- Remove decode: true for image feature in head_qa by @craffel in #3805
- Update faiss device docstring by @lhoestq in #3846
- Udpate index.mdx margins by @gary149 in #3858
- Fix push_to_hub with null images by @lhoestq in #3856
- Redundant add dataset information and dead link. by @dnaveenr in #3852
- Update image dataset tags by @mariosasko in #3864
- Bring back imgs so that forsk dont get broken by @mishig25 in #3866
- Small typos in How-to-train tutorial. by @lkhphuc in #3833
- Small doc fixes by @mishig25 in #3860
- add pandas to env command by @patrickvonplaten in #3871
- Ignore duplicate keys if
ignore_verifications=True
by @mariosasko in #3868 - Update code blocks by @lhoestq in #3863
- Fix download_mode in dataset_module_factory by @albertvillanova in #3876
- Fix some shuffle docs by @lhoestq in #3885
- Fix race condition in doc build by @lhoestq in #3891
- Add default branch for doc building by @sgugger in #3893
- [docs] make dummy data creation optional by @lhoestq in #3894
- Fix code examples indentation by @lhoestq in #3895
- Align tqdm control/cache control with Transformers by @mariosasko in #3897
- Fix CLI test checksums by @albertvillanova in #3892
- Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in #3843
- Change the framework switches to the new syntax by @sgugger in #3880
New Contributors
- @kayvane1 made their first contribution in #3617
- @JAugusto97 made their first contribution in #3683
- @shalymin-amzn made their first contribution in #3669
- @kashif made their first contribution in #3722
- @akulchik made their first contribution in #3742
- @abhi-mosaic made their first contribution in #3740
- @emibaylor made their first contribution in #3757
- @anna-kay made their first contribution in #3781
- @JoaoLages made their first contribution in #3794
- @mishig25 made their first contribution in #3690
- @antxa made their first contribution in #3780
- @dnaveenr made their first contribution in #3834
- @lkhphuc made their first contribution in #3833
- @rentruewang made their first contribution in #3721
- @gary149 made their first contribution in #3858
- @NielsRogge made their first contribution in #3745
- @sashavor made their first contribution in #3873
- @SivilTaram made their first contribution in #3870
- Document cases for github datasets by @lhoestq in #3924
- Fix text loader to split only on universal newlines by @albertvillanova in #3910
- Retry HfApi call inside push_to_hub when 504 error by @albertvillanova in #3886
Full Changelog: 1.18.3...0.0.0