Important
- Drop Python 3.6 support by @mariosasko in #4460
- Deprecate metrics by @albertvillanova in #4739
- Metrics are now deprecated and have been moved to evaluate:
!pip install evaluate import evaluate metric = evaluate.load("accuracy")
- Metrics are now deprecated and have been moved to evaluate:
- Load GitHub datasets from Hub by @albertvillanova in #4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
- Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
- Use HTTP requests to access data and metadata through the Datasets REST API (docs here)
Datasets features
No-code loaders
- Add AudioFolder packaged loader by @polinaeterna in #4530
- Add support for CSV metadata files to ImageFolder by @mariosasko in #4837
- Add support for parsing JSON files in array form by @mariosasko in #4997
Dataset methods
- add
Dataset.from_list
by @sanderland in #4890 - Add
Dataset.from_generator
by @mariosasko in #4957 - Add oversampling strategies to interleave datasets by @ylacombe in #4831
- Preserve non-
input_colums
inDataset.map
ifinput_columns
are specified by @mariosasko in #4971 - Add
fn_kwargs
param toIterableDataset.map
by @mariosasko in #4975 - More rigorous shape inference in to_tf_dataset by @Rocketknight1 in #4763
Parquet support
- Download and prepare as Parquet for cloud storage by @lhoestq in #4724
- Shard parquet in download_and_prepare by @lhoestq in #4747
- Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987
Datasets changes
- Update: natural questions - Add long answer candidates by @seirasto in #4368
- Update: opus_paracrawl - update version by @albertvillanova in #4816
- Update: ReCoRD - Include entity positions as feature by @richarddwang in #4479
- Update: swda - Support streaming by @albertvillanova in #4914
- Update: Enwik8 - update broken link and information by @mtanghu in #4
- Update: compguesswhat - Support streaming by @albertvillanova in #4968
- Update: nli_tr - Support streaming by @albertvillanova in #4970
- Update: IndicGLUE - update download links by @sumanthd17 in #4978
- Update: iwslt2017 - Support streaming by @albertvillanova in #4992
- Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in #4788
- Fix: mkqa - Update data URL by @albertvillanova in #4823
- Fix: exams - fix bug and checksums by @albertvillanova in #4853
- Fix: trec - use fine classes by @albertvillanova in #4801
- Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in #4871
- Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in #4904
- Fix: compguesswhat - fix data URLs by @albertvillanova in #4959
- Fix: vivos - fix data URL and metadata by @albertvillanova in #4969
- Fix: MBPP - Add splits by @cwarny in #4943
Dataset cards
- Add
language_bcp47
tag by @lhoestq in #4753 - Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in #4701
- Remove "unkown" language tags by @lhoestq in #4754
- Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in #4712
- Added dataset information in clinic oos dataset card by @Arnav-Ladkat in #4751
- Fix opus_gnome dataset card by @gojiteji in #4806
- Complete the mlqa dataset card by @eldhoittangeorge in #4809
- Fix loading example in opus dataset cards by @albertvillanova in #4813
- Add missing language tags to resources by @albertvillanova in #4819
- Fix titles in dataset cards by @albertvillanova in #4824
- Fix language tags in dataset cards by @albertvillanova in #4826
- Add license metadata to pg19 by @julien-c in #4827
- Fix task tags in dataset cards by @albertvillanova in #4830
- Fix tags in dataset cards by @albertvillanova in #4832
- Fix missing tags in dataset cards by @albertvillanova in #4833
- Fix documentation card of recipe_nlg dataset by @albertvillanova in #4834
- Fix documentation card of ethos dataset by @albertvillanova in #4835
- Update documentation card of miam dataset by @PierreColombo in #4846
- Update stackexchange license by @cakiki in #4842
- Update ted_talks_iwslt license to include ND by @cakiki in #4841
- Fix documentation card of adv_glue dataset by @albertvillanova in #4838
- Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
- Fix license tag and Source Data section in billsum dataset card by @kashif in #4851
- Fix documentation card of covid_qa_castorini dataset by @albertvillanova in #4877
- Fix Citation Information section in dataset cards by @albertvillanova in #4879
- Fix documentation card of math_qa dataset by @albertvillanova in #4884
- Added names of less-studied languages by @BenjaminGalliot in #4880
- Fix language tags resource file by @albertvillanova in #4882
- Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in #4892
- Add citation information to makhzan dataset by @albertvillanova in #4894
- Fix missing tags in dataset cards by @albertvillanova in #4891
- Fix missing tags in dataset cards by @albertvillanova in #4896
- Re-add code and und language tags by @albertvillanova in #4899
- Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
- Update GLUE evaluation metadata by @lewtun in #4909
- Fix missing tags in dataset cards by @albertvillanova in #4908
- Add license and citation information to cosmos_qa dataset by @albertvillanova in #4913
- Fix missing tags in dataset cards by @albertvillanova in #4921
- Add cc-by-nc-2.0 to list of licenses by @albertvillanova in #4930
- Fix missing tags in dataset cards by @albertvillanova in #4931
- Add Papers with Code ID to scifact dataset by @albertvillanova in #4941
- Fix license information in qasc dataset card by @albertvillanova in #4951
- Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in #4940
- Fix missing tags in dataset cards by @albertvillanova in #4979
- Fix missing tags in dataset cards by @albertvillanova in #4991
Documentation
- Update map docs by @stevhliu in #4743
- Add image classification processing guide by @stevhliu in #4748
- Fix train_test_split docs by @NielsRogge in #4821
- Update local loading script docs by @stevhliu in #4778
- Docs for creating a loading script for image datasets by @stevhliu in #4783
- Docs for creating an audio dataset by @stevhliu in #4872
General improvements and bug fixes
- Use CI unit/integration tests by @albertvillanova in #4738
- Fix multiprocessing in map_nested by @albertvillanova in #4740
- Add 2.4.0 version added to docstrings by @albertvillanova in #4767
- Update CI badge by @mariosasko in #4764
- Fix version in map_nested docstring by @albertvillanova in #4765
- fix typo by @xwwwwww in #4770
- Unpin rouge_score test dependency by @albertvillanova in #4768
- Remove apache_beam import from module level in natural_questions dataset by @albertvillanova in #4780
- Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in #4777
- Remove dummy data generation docs by @stevhliu in #4771
- Require torchaudio<0.12.0 in docs by @albertvillanova in #4785
- Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in #4812
- Fix typo in streaming docs by @flozi00 in #4843
- Fix test of _get_extraction_protocol for TAR files by @albertvillanova in #4850
- Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/
- Mark CI tests as xfail if Hub HTTP error by @albertvillanova in #4845
- [Windows] Fix Access Denied when using os.rename() by @DougTrajano in #4825
- [docs] Some tiny doc tweaks by @julien-c in #4874
- Document loading from relative path by @stevhliu in #4773
- Fix CI reporting by @albertvillanova in #4903
- Add 'val' to VALIDATION_KEYWORDS. by @akt42 in #4844
- Raise ManualDownloadError from get_dataset_config_info by @albertvillanova in #4901
- feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in #4919
- Fixes a typo in loading documentation by @sighingnow in #4929
- Remove main branch rename notice by @lhoestq in #4938
- Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in #4939
- Remove deprecated identical_ok by @lhoestq in #4937
- Pin TensorFlow temporarily by @albertvillanova in #4954
- Fix minor typo in error message for missing imports by @mariosasko in #4948
- Fix TF tests for 2.10 by @Rocketknight1 in #4956
- fix BLEU metric card by @antoniolanza1996 in #4927
- Update doc upload_dataset.mdx by @mishig25 in #4789
- Improve features resolution in streaming by @lhoestq in #4762
- Fix label renaming and add a battery of tests by @Rocketknight1 in #4781
- Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in #4967
- Introduce regex check when pushing as well by @LysandreJik in #4946
- [doc] Fix broken snippet that had too many quotes by @tomaarsen in #4986
- Fix map batched with torch output by @lhoestq in #4972
- fix: avoid casting tuples after Dataset.map by @szmoro in #4993
- decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- Don't add a tag on the Hub on release by @lhoestq in #4998
- Add EmptyDatasetError by @lhoestq in #4999
New Contributors
- @seirasto made their first contribution in #4368
- @sbroadhurst-hf made their first contribution in #4712
- @nawarhalabi made their first contribution in #4701
- @Arnav-Ladkat made their first contribution in #4751
- @xwwwwww made their first contribution in #4770
- @gojiteji made their first contribution in #4806
- @eldhoittangeorge made their first contribution in #4809
- @flozi00 made their first contribution in #4843
- @fl-lo made their first contribution in #4869
- @BenjaminGalliot made their first contribution in #4880
- @DougTrajano made their first contribution in #4825
- @ylacombe made their first contribution in #4831
- @osanseviero made their first contribution in #4887
- @akt42 made their first contribution in #4844
- @sanderland made their first contribution in #4890
- @sighingnow made their first contribution in #4929
- @mtanghu made their first contribution in #4950
- @antoniolanza1996 made their first contribution in #4927
- @apohllo made their first contribution in #4967
- @cwarny made their first contribution in #4943
- @tomaarsen made their first contribution in #4986
- @szmoro made their first contribution in #4993
Full Changelog: 2.4.0...2.5.0