huggingface/datasets 2.5.0 on GitHub

Important

Drop Python 3.6 support by @mariosasko in #4460
Deprecate metrics by @albertvillanova in #4739
- Metrics are now deprecated and have been moved to evaluate:
```
!pip install evaluate
import evaluate
metric = evaluate.load("accuracy")
```
Load GitHub datasets from Hub by @albertvillanova in #4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

Datasets features

No-code loaders

Add AudioFolder packaged loader by @polinaeterna in #4530
Add support for CSV metadata files to ImageFolder by @mariosasko in #4837
Add support for parsing JSON files in array form by @mariosasko in #4997

Dataset methods

add Dataset.from_list by @sanderland in #4890
Add Dataset.from_generator by @mariosasko in #4957
Add oversampling strategies to interleave datasets by @ylacombe in #4831
Preserve non-input_colums in Dataset.map if input_columns are specified by @mariosasko in #4971
Add fn_kwargs param to IterableDataset.map by @mariosasko in #4975
More rigorous shape inference in to_tf_dataset by @Rocketknight1 in #4763

Parquet support

Download and prepare as Parquet for cloud storage by @lhoestq in #4724
Shard parquet in download_and_prepare by @lhoestq in #4747
Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987

Datasets changes

Update: natural questions - Add long answer candidates by @seirasto in #4368
Update: opus_paracrawl - update version by @albertvillanova in #4816
Update: ReCoRD - Include entity positions as feature by @richarddwang in #4479
Update: swda - Support streaming by @albertvillanova in #4914
Update: Enwik8 - update broken link and information by @mtanghu in #4
Update: compguesswhat - Support streaming by @albertvillanova in #4968
Update: nli_tr - Support streaming by @albertvillanova in #4970
Update: IndicGLUE - update download links by @sumanthd17 in #4978
Update: iwslt2017 - Support streaming by @albertvillanova in #4992
Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in #4788
Fix: mkqa - Update data URL by @albertvillanova in #4823
Fix: exams - fix bug and checksums by @albertvillanova in #4853
Fix: trec - use fine classes by @albertvillanova in #4801
Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in #4871
Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in #4904
Fix: compguesswhat - fix data URLs by @albertvillanova in #4959
Fix: vivos - fix data URL and metadata by @albertvillanova in #4969
Fix: MBPP - Add splits by @cwarny in #4943

Dataset cards

Add language_bcp47 tag by @lhoestq in #4753
Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in #4701
Remove "unkown" language tags by @lhoestq in #4754
Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in #4712
Added dataset information in clinic oos dataset card by @Arnav-Ladkat in #4751
Fix opus_gnome dataset card by @gojiteji in #4806
Complete the mlqa dataset card by @eldhoittangeorge in #4809
Fix loading example in opus dataset cards by @albertvillanova in #4813
Add missing language tags to resources by @albertvillanova in #4819
Fix titles in dataset cards by @albertvillanova in #4824
Fix language tags in dataset cards by @albertvillanova in #4826
Add license metadata to pg19 by @julien-c in #4827
Fix task tags in dataset cards by @albertvillanova in #4830
Fix tags in dataset cards by @albertvillanova in #4832
Fix missing tags in dataset cards by @albertvillanova in #4833
Fix documentation card of recipe_nlg dataset by @albertvillanova in #4834
Fix documentation card of ethos dataset by @albertvillanova in #4835
Update documentation card of miam dataset by @PierreColombo in #4846
Update stackexchange license by @cakiki in #4842
Update ted_talks_iwslt license to include ND by @cakiki in #4841
Fix documentation card of adv_glue dataset by @albertvillanova in #4838
Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
Fix license tag and Source Data section in billsum dataset card by @kashif in #4851
Fix documentation card of covid_qa_castorini dataset by @albertvillanova in #4877
Fix Citation Information section in dataset cards by @albertvillanova in #4879
Fix documentation card of math_qa dataset by @albertvillanova in #4884
Added names of less-studied languages by @BenjaminGalliot in #4880
Fix language tags resource file by @albertvillanova in #4882
Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in #4892
Add citation information to makhzan dataset by @albertvillanova in #4894
Fix missing tags in dataset cards by @albertvillanova in #4891
Fix missing tags in dataset cards by @albertvillanova in #4896
Re-add code and und language tags by @albertvillanova in #4899
Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
Update GLUE evaluation metadata by @lewtun in #4909
Fix missing tags in dataset cards by @albertvillanova in #4908
Add license and citation information to cosmos_qa dataset by @albertvillanova in #4913
Fix missing tags in dataset cards by @albertvillanova in #4921
Add cc-by-nc-2.0 to list of licenses by @albertvillanova in #4930
Fix missing tags in dataset cards by @albertvillanova in #4931
Add Papers with Code ID to scifact dataset by @albertvillanova in #4941
Fix license information in qasc dataset card by @albertvillanova in #4951
Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in #4940
Fix missing tags in dataset cards by @albertvillanova in #4979
Fix missing tags in dataset cards by @albertvillanova in #4991

Documentation

Update map docs by @stevhliu in #4743
Add image classification processing guide by @stevhliu in #4748
Fix train_test_split docs by @NielsRogge in #4821
Update local loading script docs by @stevhliu in #4778
Docs for creating a loading script for image datasets by @stevhliu in #4783
Docs for creating an audio dataset by @stevhliu in #4872

General improvements and bug fixes

Use CI unit/integration tests by @albertvillanova in #4738
Fix multiprocessing in map_nested by @albertvillanova in #4740
Add 2.4.0 version added to docstrings by @albertvillanova in #4767
Update CI badge by @mariosasko in #4764
Fix version in map_nested docstring by @albertvillanova in #4765
fix typo by @xwwwwww in #4770
Unpin rouge_score test dependency by @albertvillanova in #4768
Remove apache_beam import from module level in natural_questions dataset by @albertvillanova in #4780
Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in #4777
Remove dummy data generation docs by @stevhliu in #4771
Require torchaudio<0.12.0 in docs by @albertvillanova in #4785
Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in #4812
Fix typo in streaming docs by @flozi00 in #4843
Fix test of _get_extraction_protocol for TAR files by @albertvillanova in #4850
Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/
Mark CI tests as xfail if Hub HTTP error by @albertvillanova in #4845
[Windows] Fix Access Denied when using os.rename() by @DougTrajano in #4825
[docs] Some tiny doc tweaks by @julien-c in #4874
Document loading from relative path by @stevhliu in #4773
Fix CI reporting by @albertvillanova in #4903
Add 'val' to VALIDATION_KEYWORDS. by @akt42 in #4844
Raise ManualDownloadError from get_dataset_config_info by @albertvillanova in #4901
feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in #4919
Fixes a typo in loading documentation by @sighingnow in #4929
Remove main branch rename notice by @lhoestq in #4938
Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in #4939
Remove deprecated identical_ok by @lhoestq in #4937
Pin TensorFlow temporarily by @albertvillanova in #4954
Fix minor typo in error message for missing imports by @mariosasko in #4948
Fix TF tests for 2.10 by @Rocketknight1 in #4956
fix BLEU metric card by @antoniolanza1996 in #4927
Update doc upload_dataset.mdx by @mishig25 in #4789
Improve features resolution in streaming by @lhoestq in #4762
Fix label renaming and add a battery of tests by @Rocketknight1 in #4781
Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in #4967
Introduce regex check when pushing as well by @LysandreJik in #4946
[doc] Fix broken snippet that had too many quotes by @tomaarsen in #4986
Fix map batched with torch output by @lhoestq in #4972
fix: avoid casting tuples after Dataset.map by @szmoro in #4993
decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
Don't add a tag on the Hub on release by @lhoestq in #4998
Add EmptyDatasetError by @lhoestq in #4999

New Contributors

@seirasto made their first contribution in #4368
@sbroadhurst-hf made their first contribution in #4712
@nawarhalabi made their first contribution in #4701
@Arnav-Ladkat made their first contribution in #4751
@xwwwwww made their first contribution in #4770
@gojiteji made their first contribution in #4806
@eldhoittangeorge made their first contribution in #4809
@flozi00 made their first contribution in #4843
@fl-lo made their first contribution in #4869
@BenjaminGalliot made their first contribution in #4880
@DougTrajano made their first contribution in #4825
@ylacombe made their first contribution in #4831
@osanseviero made their first contribution in #4887
@akt42 made their first contribution in #4844
@sanderland made their first contribution in #4890
@sighingnow made their first contribution in #4929
@mtanghu made their first contribution in #4950
@antoniolanza1996 made their first contribution in #4927
@apohllo made their first contribution in #4967
@cwarny made their first contribution in #4943
@tomaarsen made their first contribution in #4986
@szmoro made their first contribution in #4993

Full Changelog: 2.4.0...2.5.0