Dataset Changes
- New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in #3287
- Add The Pile Free Law subset by @albertvillanova in #3359
- Add The Pile USPTO subset by @albertvillanova in #3360
- Add The Pile subsets by @albertvillanova in #3378
- Add The Pile Enron Emails subset by @albertvillanova in #3427
- New: British Library Books Genre by @davanstrien in #3312
- New: Americas NLI by @fdschmidt93 in #3371
- New: Speech commands by @polinaeterna in #3335
- New: eli5_category by @jingshenSN2 in #3420
- New: OneStopQa by @scaperex in #3436
- Update: LABR - make the dataset streamable by @albertvillanova in #3352
- Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in #3376
- Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in #3362
- Update: CC100- add Georgian data by @AnzorGozalishvili in #3383
- Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in #3426
- Update: swahili_news - update to new version by @albertvillanova in #3463
- Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in #3266
- Fix: QED - fix type of bridge field by @mariosasko in #3417
- Fix: ASSET - fix dataset data URLs by @tianjianjiang in #3342
Dataset Features
- Add Image feature by @mariosasko in #3163
- to_tf_dataset() refactor by @Rocketknight1 in #3356
- More robust
None
handling by @mariosasko in #3195 - Add
cast_column
toIterableDataset
by @mariosasko in #3439 - Support streaming zipped dataset repo by passing only repo name by @albertvillanova in #3375
- Extend support for streaming datasets that use pd.read_excel by @albertvillanova in #3355
- Extend iter_archive to support file object input by @albertvillanova in #3443
- Extend text to support yielding lines, paragraphs or documents by @albertvillanova in #3442
- Push dataset_infos.json to Hub to preserve feature types by @lhoestq in #3467
Dataset cards
- Change TriviaQA license (#3313) by @avinashsai in #3330
- Add missing tags to XTREME by @mariosasko in #3322
- Remove duplicate name from dataset cards by @albertvillanova in #3354
- Fix typos in dataset cards by @albertvillanova in #3386
- Fix duplicated tag in wikicorpus dataset card by @lhoestq in #3458
Dataset Tasks
- Create Language Modeling task by @albertvillanova in #3387
Metric Changes
- BLEURT: Match key names to correspond with filename by @jaehlee in #3348
- Fix links in metrics description by @albertvillanova in #3461
- Fix METEOR missing NLTK's omw-1.4 by @lhoestq in #3469
Docs
- Add ArrayXD docs by @stevhliu in #3344
- Document a training loop for streaming dataset by @lhoestq in #3370
- Fix formatting in IterableDataset.map docs by @mariosasko in #3395
- Correctly indent builder config in dataset script docs by @mariosasko in #3432
- Update BLEURT hyperlink by @lewtun in #3437
Additional improvements and bug fixes
- Quick fix error formatting by @NouamaneTazi in #3328
- Fix error message and add extension fallback by @mariosasko in #3332
- Avoid content-encoding issue while streaming datasets by @albertvillanova in #3350
- Fix JSON ClassLabel casting for integers by @lhoestq in #3340
- Better error message when download fails by @lhoestq in #3343
- Fix dict source_datasets tagset validator by @albertvillanova in #3368
- Fix typo in other-structured-to-text task tag by @albertvillanova in #3367
- Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in #3296
- Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in #3388
- More robust first elem check in encode/cast example by @mariosasko in #3402
- Fix module inference for archive with a directory by @albertvillanova in #3406
- Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in #3410
- Pass new_fingerprint in multiprocessing by @lhoestq in #3409
- Fix flaky test again for s3 serialization by @lhoestq in #3412
- Skip None encoding (line deleted by accident in #3195) by @mariosasko in #3414
- Clean squad dummy data by @lhoestq in #3428
- #3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in #3382
- Make cast cacheable (again) on Windows by @mariosasko in #3429
- Use max number of data files to infer module by @albertvillanova in #3407
- Fix iter_archive generator by @albertvillanova in #3454
- [Staging] Update dataset repos automatically on the Hub by @lhoestq in #3451
- Update supported versions of Python in setup.py by @mariosasko in #3438
- raise exception instead of using assertions. by @manisnesan in #3349
New Contributors
- @avinashsai made their first contribution in #3330
- @NouamaneTazi made their first contribution in #3328
- @davanstrien made their first contribution in #3312
- @francisco-perez-sorrosal made their first contribution in #3296
- @LashaO made their first contribution in #3266
- @fdschmidt93 made their first contribution in #3371
- @polinaeterna made their first contribution in #3335
- @AnzorGozalishvili made their first contribution in #3383
- @tianjianjiang made their first contribution in #3342
- @jingshenSN2 made their first contribution in #3420
- @scaperex made their first contribution in #3436
Full Changelog: 1.16.1...1.17.0