Datasets Changes
- New: VCTK
- New: CPPE-5 dataset by @mariosasko in #3517
- New: RedCaps dataset by @mariosasko in #3424
- New: WIDER FACE dataset by @mariosasko in #3413
- New: SVHN dataset by @mariosasko in #3535
- New: BNL newspapers by @davanstrien in #3397
- New: PASS dataset by @mariosasko in #3576
- New: Text2log Dataset by @apergo-ai in #3579
- Update: beans, cats_vs_dogs - Use
iter_files
instead ofstr(Path(...)
in image dataset by @mariosasko in #3477 - Update : PIB - update version and make it streamable by @albertvillanova in #3496
- Update: code_x_glue_tt_text_to_text, compguesswhat - Remove print statements in datasets by @mariosasko in #3546
- Update: MuchoCine - add missing tasks by @mariosasko in #3571
- Fix: Tashkeela - fix to yield stripped text by @albertvillanova in #3471
- Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in #3516
- Fix: CC100 - use HTTPS for the data source URL by @aajanki in #3519
- Fix: vision datsets - Fix bug in
ImageClassifcation
task template by @mariosasko in #3557 - Fix: tweet_qa - fix
DuplicatedKeysError
and improve card by @mariosasko in #3559 - Fix: mC4 - fix multiple language downloading by @polinaeterna in #3594
- Fix: CoNLL2003:
Datasets Features
- [Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in #3591
- [Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in #3575
- Allows DatasetDict.filter to have batching option by @thomasw21 in #3506
- Add desc parameter to filter by @mariosasko in #3513
- Add
gzip
forto_json
by @bhavitvyamalik in #3492 - Allow multiple task templates of the same type by @mariosasko in #3562
- Add parameter
preserve_index
tofrom_pandas
by @Sorrow321 in #3565 - Dataset Streaming:
- Fix
str(Path(...))
conversion in streaming on Linux by @mariosasko in #3472 - Extend support for streaming datasets that use ET.parse by @albertvillanova in #3476
- Extend support for streaming datasets that use os.walk by @albertvillanova in #3478
- Fix
Metrics Changes
- Add Mauve metric by @jthickstun in #3573
Dataset cards
- update
pretty_name
for first 200 datasets by @bhavitvyamalik in #3498 - update
pretty_name
for all the other datasets by @bhavitvyamalik in #3536 - pib: Update pib dataset card by @albertvillanova in #3501
- arabic_speech_corpus: Adding link to license. by @meg-huggingface in #3524
- Covost2: Update README.md by @meg-huggingface in #3528
- librispeech_asr: Update README.md by @meg-huggingface in #3529
- vivos: Update README.md by @meg-huggingface in #3530
- audio datasets: Audio datacard update - first pass by @meg-huggingface in #3520
- common_language: Update README.md by @meg-huggingface in #3527
- wiki_dpr: Update wiki_dpr README.md by @lhoestq in #3534
- qa4mre: Fix qa4mre tags by @lhoestq in #3574
- HellaSwag: Update HellaSwag README.md by @borgr in #3588
- ANLI: Update ANLI README.md by @borgr in #3590
- tweet_eval: Update README.md by @borgr in #3593
Documentation
- Fix rendering of docs by @albertvillanova in #3470
- Fix to_tf_dataset references in docs by @mariosasko in #3514
- added PII statements and license links to data cards by @mcmillanmajora in #3537
- Readme usage update by @meg-huggingface in #3538
- Update the CC-100 dataset card by @aajanki in #3542
- Research wording for nc licenses by @meg-huggingface in #3539
- Added links to licensing and PII message in vctk dataset by @mcmillanmajora in #3523
- Give clearer instructions to add the YAML tags by @albertvillanova in #3532
General improvements and bug fixes
- Fix overriding of filesystem info by @albertvillanova in #3481
- Update ADD_NEW_DATASET.md by @apergo-ai in #3487
- Fix weird spacing in ManualDownloadError message by @bryant1410 in #3486
- Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in #3494
- Remove unused phony rule from Makefile by @bryant1410 in #3483
- fix: 🐛 pass token when retrieving the split names by @severo in #3545
- Pin torchmetrics to fix the COMET test by @lhoestq in #3589
- Preserve encoding/decoding with features in
Iterable.map
call by @mariosasko in #3556
New Contributors
- @apergo-ai made their first contribution in #3487
- @bryant1410 made their first contribution in #3486
- @meg-huggingface made their first contribution in #3527
- @aajanki made their first contribution in #3519
- @Sorrow321 made their first contribution in #3565
- @jthickstun made their first contribution in #3573
- @borgr made their first contribution in #3588
Full Changelog: 1.17.0...1.18.0