huggingface/datasets 1.17.0 on GitHub

Dataset Changes

New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in #3287
- Add The Pile Free Law subset by @albertvillanova in #3359
- Add The Pile USPTO subset by @albertvillanova in #3360
- Add The Pile subsets by @albertvillanova in #3378
- Add The Pile Enron Emails subset by @albertvillanova in #3427
New: British Library Books Genre by @davanstrien in #3312
New: Americas NLI by @fdschmidt93 in #3371
New: Speech commands by @polinaeterna in #3335
New: eli5_category by @jingshenSN2 in #3420
New: OneStopQa by @scaperex in #3436
Update: LABR - make the dataset streamable by @albertvillanova in #3352
Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in #3376
Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in #3362
Update: CC100- add Georgian data by @AnzorGozalishvili in #3383
Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in #3426
Update: swahili_news - update to new version by @albertvillanova in #3463
Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in #3266
Fix: QED - fix type of bridge field by @mariosasko in #3417
Fix: ASSET - fix dataset data URLs by @tianjianjiang in #3342

Dataset Features

Add Image feature by @mariosasko in #3163
to_tf_dataset() refactor by @Rocketknight1 in #3356
More robust None handling by @mariosasko in #3195
Add cast_column to IterableDataset by @mariosasko in #3439
Support streaming zipped dataset repo by passing only repo name by @albertvillanova in #3375
Extend support for streaming datasets that use pd.read_excel by @albertvillanova in #3355
Extend iter_archive to support file object input by @albertvillanova in #3443
Extend text to support yielding lines, paragraphs or documents by @albertvillanova in #3442
Push dataset_infos.json to Hub to preserve feature types by @lhoestq in #3467

Dataset cards

Change TriviaQA license (#3313) by @avinashsai in #3330
Add missing tags to XTREME by @mariosasko in #3322
Remove duplicate name from dataset cards by @albertvillanova in #3354
Fix typos in dataset cards by @albertvillanova in #3386
Fix duplicated tag in wikicorpus dataset card by @lhoestq in #3458

Dataset Tasks

Create Language Modeling task by @albertvillanova in #3387

Metric Changes

BLEURT: Match key names to correspond with filename by @jaehlee in #3348
Fix links in metrics description by @albertvillanova in #3461
Fix METEOR missing NLTK's omw-1.4 by @lhoestq in #3469

Docs

Add ArrayXD docs by @stevhliu in #3344
Document a training loop for streaming dataset by @lhoestq in #3370
Fix formatting in IterableDataset.map docs by @mariosasko in #3395
Correctly indent builder config in dataset script docs by @mariosasko in #3432
Update BLEURT hyperlink by @lewtun in #3437

Additional improvements and bug fixes

Quick fix error formatting by @NouamaneTazi in #3328
Fix error message and add extension fallback by @mariosasko in #3332
Avoid content-encoding issue while streaming datasets by @albertvillanova in #3350
Fix JSON ClassLabel casting for integers by @lhoestq in #3340
Better error message when download fails by @lhoestq in #3343
Fix dict source_datasets tagset validator by @albertvillanova in #3368
Fix typo in other-structured-to-text task tag by @albertvillanova in #3367
Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in #3296
Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in #3388
More robust first elem check in encode/cast example by @mariosasko in #3402
Fix module inference for archive with a directory by @albertvillanova in #3406
Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in #3410
Pass new_fingerprint in multiprocessing by @lhoestq in #3409
Fix flaky test again for s3 serialization by @lhoestq in #3412
Skip None encoding (line deleted by accident in #3195) by @mariosasko in #3414
Clean squad dummy data by @lhoestq in #3428
#3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in #3382
Make cast cacheable (again) on Windows by @mariosasko in #3429
Use max number of data files to infer module by @albertvillanova in #3407
Fix iter_archive generator by @albertvillanova in #3454
[Staging] Update dataset repos automatically on the Hub by @lhoestq in #3451
Update supported versions of Python in setup.py by @mariosasko in #3438
raise exception instead of using assertions. by @manisnesan in #3349

New Contributors

@avinashsai made their first contribution in #3330
@NouamaneTazi made their first contribution in #3328
@davanstrien made their first contribution in #3312
@francisco-perez-sorrosal made their first contribution in #3296
@LashaO made their first contribution in #3266
@fdschmidt93 made their first contribution in #3371
@polinaeterna made their first contribution in #3335
@AnzorGozalishvili made their first contribution in #3383
@tianjianjiang made their first contribution in #3342
@jingshenSN2 made their first contribution in #3420
@scaperex made their first contribution in #3436

Full Changelog: 1.16.1...1.17.0