Dataset Changes
- New: ImageNet by @apsdehal in #4178
- Manual download only for now
- New: Google Conceptual Captions by @abhishekkrthakur in #1459
- New: Conceptual 12M by @thomasw21 in #4162
- New: Visual Genome by @thomasw21 in #4161
- New: RVL-CDIP by @dnaveenr in #4050
- New: Text-based NP Enrichment (TNE) by @yanaiela in #4153
- New: TextVQA by @apsdehal in #3967
- New: ETT time series dataset by @kashif in #4213
- Update: assin2 - update metadata by @lhoestq in #4172
- Update: Librispeech - Add 'all' config by @patrickvonplaten in #4184
- Update: XGLUE - Support streaming dataset by @albertvillanova in #4249
- Update: crd3 - group all the turns in one example by @shanyas10 in #4240
- Update: pubmed_qa - Remove google drive URL by @lhoestq in #4255
- Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in #4254
- Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in #4267
- Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in #4236
- Update: openbookqa - Add missing features for additional config by @albertvillanova in #4278
- Update: commonsense_qa - Add missing features by @albertvillanova in #4280
- Fix: Common Voice - Make sure bytes are correctly deleted if
path
exists by @patrickvonplaten in #4212 - Fix: openbookqa - fix bug in choices labels by @manandey in #4259
- Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in #4270
Dataset Features
- Add support for metadata files to
imagefolder
by @mariosasko in #4069- load a folder of images and metadata stored in
metadata.jsonl
, more info in the documentation on how to load an image dataset
- load a folder of images and metadata stored in
- Infer splits from the
data_dir
parameter when loading datasets without script by @polinaeterna in #4144- splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository
- Enable label alignment for token classification datasets by @lewtun in #4277
- Add
drop_last_batch
toIterableDataset.map
by @mariosasko in #4215 - Load dataset with TSV files by @albertvillanova in #4246
Dataset Cards
- Autoeval config by @nrajani in #4234
- Add
train-deval-index
metadata to automate evaluation on your datasets based on their tasks
- Add
- Adding license information for Openbookcorpus by @meg-huggingface in #3525
- Make code for image downloading from image urls cacheable by @mariosasko in #4218
- Fix description links in dataset cards by @albertvillanova in #4222
- Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in #4262
- Remove a copy-paste sentence in dataset cards by @albertvillanova in #4281
- Update LexGLUE README.md by @iliaschalkidis in #4285
- leadboard info added for TNE by @yanaiela in #4273
- Add Lahnda language tag by @mariosasko in #4286
- Add license and point of contact to big_patent dataset by @albertvillanova in #4269
- Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in #4266
Metrics Changes
- Perplexity Speedup by @emibaylor in #4108
- Add AUC ROC Metric by @emibaylor in #4158
- Small fixes in ROC AUC docs by @wschella in #4239
- Fix/start token mask issue and update documentation by @TristanThrush in #4258
- Add pearsonr mc, update functionality to match the original docs by @emibaylor in #4226
Metric Cards
- Metric card for the XTREME-S dataset by @sashavor in #4251
- Creating metric card for MAE by @sashavor in #4252
- Create metric cards for mean IOU by @sashavor in #4253
- Create metric card for Mahalanobis Distance by @sashavor in #4257
- Create metric card for MSE by @sashavor in #4256
- Fix exact match by @emibaylor in #4166
- Fix google bleu typos, examples by @emibaylor in #4165
- Add f1 metric card, update docstring in py file by @emibaylor in #4227
- Add Recall Metric Card by @emibaylor in #4204
- Matthews Correlation Metric Card by @emibaylor in #4110
- Add Precision Metric Card by @emibaylor in #4203
- Add Accuracy Metric Card by @emibaylor in #4223
- Add Spearmanr Metric Card by @emibaylor in #4109
- Metric card template by @emibaylor in #3915
Documentation
- Document save_to_disk and push_to_hub on images and audio files by @lhoestq in #4193
- Add to docs how to load from local script by @albertvillanova in #4200
- Add code examples to API docs by @stevhliu in #4168
- Add code examples for DatasetDict by @stevhliu in #4245
- Add API code examples for IterableDataset by @stevhliu in #4274
- Add packaged builder configs to the documentation by @lhoestq in #4307
- [Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in #4311
General improvements and bug fixes
- Generate tasks.json taxonomy from
huggingface_hub
by @julien-c in #4154 - Fix when map function modifies input in-place by @thomasw21 in #4174
- Support streaming cnn_dailymail dataset by @albertvillanova in #4188
- Don't duplicate data when encoding audio or image by @lhoestq in #4187
- Fix outdated docstring about default dataset config by @lhoestq in #4186
- Deprecate
shard_size
inpush_to_hub
in favor ofmax_shard_size
by @mariosasko in #4190 - Fix some type annotation in doc by @thomasw21 in #4202
- Update GH template for dataset viewer issues by @albertvillanova in #4201
- Update auth when mirroring datasets on the hub by @lhoestq in #4242
- Rename imagenet2012 -> imagenet-1k by @lhoestq in #4263
- Skip checksum computation in Imagefolder by default by @mariosasko in #4214
- Fix
convert_file_size_to_int
for kilobits and megabits by @mariosasko in #4205 - Fix typo in logging docs by @stevhliu in #4272
- Bump PyArrow Version to 6 by @dnaveenr in #4250
- task id update by @nrajani in #4244
- Avoid recursion error in map if example is returned as dict value by @mariosasko in #4216
- Update minimal PyArrow version warning by @mariosasko in #4279
- [Minor edit] Fix typo in class name by @cakiki in #4207
- Stream private zipped images by @lhoestq in #4173
- Fix filesystem docstring by @stevhliu in #4283
- Document how to use FAISS index for special operations by @albertvillanova in #4189
- Contributing MedMCQA dataset by @monk1337 in #4064
- Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in #4282
- Fix missing lz4 dependency for tests by @albertvillanova in #4295
- Altered faiss installation comment by @vishalsrao in #4220
- Fix CLI run_beam save_infos by @albertvillanova in #4294
- Add missing
faiss
import to fix #4287 by @alvarobartt in #4288
New Contributors
- @shanyas10 made their first contribution in #4240
- @apsdehal made their first contribution in #4178
- @wschella made their first contribution in #4239
- @TristanThrush made their first contribution in #4258
- @yanaiela made their first contribution in #4153
- @mo6zes made their first contribution in #4262
- @nrajani made their first contribution in #4244
- @sanchit-gandhi made their first contribution in #4266
- @cakiki made their first contribution in #4207
- @monk1337 made their first contribution in #4064
- @alvarobartt made their first contribution in #4288
Full Changelog: 2.1.0...2.2.0