datasets 2.2.0 on Python PyPI

Dataset Changes

New: ImageNet by @apsdehal in #4178
- Manual download only for now
New: Google Conceptual Captions by @abhishekkrthakur in #1459
New: Conceptual 12M by @thomasw21 in #4162
New: Visual Genome by @thomasw21 in #4161
New: RVL-CDIP by @dnaveenr in #4050
New: Text-based NP Enrichment (TNE) by @yanaiela in #4153
New: TextVQA by @apsdehal in #3967
New: ETT time series dataset by @kashif in #4213
Update: assin2 - update metadata by @lhoestq in #4172
Update: Librispeech - Add 'all' config by @patrickvonplaten in #4184
Update: XGLUE - Support streaming dataset by @albertvillanova in #4249
Update: crd3 - group all the turns in one example by @shanyas10 in #4240
Update: pubmed_qa - Remove google drive URL by @lhoestq in #4255
Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in #4254
Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in #4267
Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in #4236
Update: openbookqa - Add missing features for additional config by @albertvillanova in #4278
Update: commonsense_qa - Add missing features by @albertvillanova in #4280
Fix: Common Voice - Make sure bytes are correctly deleted if path exists by @patrickvonplaten in #4212
Fix: openbookqa - fix bug in choices labels by @manandey in #4259
Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in #4270

Dataset Features

Add support for metadata files to imagefolder by @mariosasko in #4069
- load a folder of images and metadata stored in metadata.jsonl, more info in the documentation on how to load an image dataset
Infer splits from the data_dir parameter when loading datasets without script by @polinaeterna in #4144
- splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository
Enable label alignment for token classification datasets by @lewtun in #4277
Add drop_last_batch to IterableDataset.map by @mariosasko in #4215
Load dataset with TSV files by @albertvillanova in #4246

Dataset Cards

Autoeval config by @nrajani in #4234
- Add train-deval-index metadata to automate evaluation on your datasets based on their tasks
Adding license information for Openbookcorpus by @meg-huggingface in #3525
Make code for image downloading from image urls cacheable by @mariosasko in #4218
Fix description links in dataset cards by @albertvillanova in #4222
Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in #4262
Remove a copy-paste sentence in dataset cards by @albertvillanova in #4281
Update LexGLUE README.md by @iliaschalkidis in #4285
leadboard info added for TNE by @yanaiela in #4273
Add Lahnda language tag by @mariosasko in #4286
Add license and point of contact to big_patent dataset by @albertvillanova in #4269
Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in #4266

Metrics Changes

Perplexity Speedup by @emibaylor in #4108
Add AUC ROC Metric by @emibaylor in #4158
Small fixes in ROC AUC docs by @wschella in #4239
Fix/start token mask issue and update documentation by @TristanThrush in #4258
Add pearsonr mc, update functionality to match the original docs by @emibaylor in #4226

Metric Cards

Metric card for the XTREME-S dataset by @sashavor in #4251
Creating metric card for MAE by @sashavor in #4252
Create metric cards for mean IOU by @sashavor in #4253
Create metric card for Mahalanobis Distance by @sashavor in #4257
Create metric card for MSE by @sashavor in #4256
Fix exact match by @emibaylor in #4166
Fix google bleu typos, examples by @emibaylor in #4165
Add f1 metric card, update docstring in py file by @emibaylor in #4227
Add Recall Metric Card by @emibaylor in #4204
Matthews Correlation Metric Card by @emibaylor in #4110
Add Precision Metric Card by @emibaylor in #4203
Add Accuracy Metric Card by @emibaylor in #4223
Add Spearmanr Metric Card by @emibaylor in #4109
Metric card template by @emibaylor in #3915

Documentation

Document save_to_disk and push_to_hub on images and audio files by @lhoestq in #4193
Add to docs how to load from local script by @albertvillanova in #4200
Add code examples to API docs by @stevhliu in #4168
Add code examples for DatasetDict by @stevhliu in #4245
Add API code examples for IterableDataset by @stevhliu in #4274
Add packaged builder configs to the documentation by @lhoestq in #4307
[Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in #4311

General improvements and bug fixes

Generate tasks.json taxonomy from huggingface_hub by @julien-c in #4154
Fix when map function modifies input in-place by @thomasw21 in #4174
Support streaming cnn_dailymail dataset by @albertvillanova in #4188
Don't duplicate data when encoding audio or image by @lhoestq in #4187
Fix outdated docstring about default dataset config by @lhoestq in #4186
Deprecate shard_size in push_to_hub in favor of max_shard_size by @mariosasko in #4190
Fix some type annotation in doc by @thomasw21 in #4202
Update GH template for dataset viewer issues by @albertvillanova in #4201
Update auth when mirroring datasets on the hub by @lhoestq in #4242
Rename imagenet2012 -> imagenet-1k by @lhoestq in #4263
Skip checksum computation in Imagefolder by default by @mariosasko in #4214
Fix convert_file_size_to_int for kilobits and megabits by @mariosasko in #4205
Fix typo in logging docs by @stevhliu in #4272
Bump PyArrow Version to 6 by @dnaveenr in #4250
task id update by @nrajani in #4244
Avoid recursion error in map if example is returned as dict value by @mariosasko in #4216
Update minimal PyArrow version warning by @mariosasko in #4279
[Minor edit] Fix typo in class name by @cakiki in #4207
Stream private zipped images by @lhoestq in #4173
Fix filesystem docstring by @stevhliu in #4283
Document how to use FAISS index for special operations by @albertvillanova in #4189
Contributing MedMCQA dataset by @monk1337 in #4064
Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in #4282
Fix missing lz4 dependency for tests by @albertvillanova in #4295
Altered faiss installation comment by @vishalsrao in #4220
Fix CLI run_beam save_infos by @albertvillanova in #4294
Add missing faiss import to fix #4287 by @alvarobartt in #4288

New Contributors

@shanyas10 made their first contribution in #4240
@apsdehal made their first contribution in #4178
@wschella made their first contribution in #4239
@TristanThrush made their first contribution in #4258
@yanaiela made their first contribution in #4153
@mo6zes made their first contribution in #4262
@nrajani made their first contribution in #4244
@sanchit-gandhi made their first contribution in #4266
@cakiki made their first contribution in #4207
@monk1337 made their first contribution in #4064
@alvarobartt made their first contribution in #4288

Full Changelog: 2.1.0...2.2.0