Datasets Features
- Support remote data files #2616 (@albertvillanova)
This allows to pass URLs of remote data files to any dataset loader:This works for all these dataset loaders:load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
- text
- csv
- json
- parquet
- pandas
- Streaming from remote text/json/csv/parquet/pandas files:
When you pass URLs to a dataset loader, you can enable streaming mode withstreaming=True
. Main contributions: - Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
- Delete extracted files when loading dataset #2631 (@albertvillanova)
Datasets Changes
- Fix: C4 - fix expected files list #2682 (@lhoestq)
- Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
- Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
- Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
- Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
- Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
- Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)
Dataset Tasks
- Add speech processing tasks #2620 (@lewtun)
- Update ASR tags #2633 (@lewtun)
- Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
- Add ASR task for SUPERB #2619 (@lewtun)
- add image-classification task template #2632 (@nateraw)
Metrics Changes
- New: wiki_split #2623 (@bhadreshpsavani)
- Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
- Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)
General improvements and bug fixes
- Fix BibTeX entry #2594 (@albertvillanova)
- Fix test_is_small_dataset #2588 (@albertvillanova)
- Remove import of transformers #2602 (@albertvillanova)
- Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
- Fix
filter
with multiprocessing in case all samples are discarded #2601 (@mxschmdt) - Remove redundant prepare_module #2597 (@albertvillanova)
- Create ExtractManager #2295 (@albertvillanova)
- Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
- Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
- Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
- Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
- Use correct logger in metrics.py #2626 (@mariosasko)
- Minor fix tests with Windows paths #2627 (@albertvillanova)
- Use ETag of remote data files #2628 (@albertvillanova)
- More consistent naming #2611 (@mariosasko)
- Refactor patching to specific submodule #2639 (@albertvillanova)
- Fix docstrings #2640 (@albertvillanova)
- Fix anchor in README #2647 (@mariosasko)
- Fix logging docstring #2652 (@mariosasko)
- Allow dataset config kwargs to be None #2659 (@lhoestq)
- Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
- Use tqdm from tqdm_utils #2667 (@mariosasko)
- Increase json reader block_size automatically #2676 (@lhoestq)
- Parallelize ETag requests #2675 (@lhoestq)
- Fix bad config ids that name cache directories #2686 (@lhoestq)
- Minor documentation fix #2687 (@slowwavesleep)
Dataset Cards
- Add missing WikiANN language tags #2610 (@albertvillanova)
- feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)
Docs
- Update processing.rst with other export formats #2599 (@TevenLeScao)