huggingface/datasets 1.10.0 on GitHub

Datasets Features

Support remote data files #2616 (@albertvillanova)
This allows to pass URLs of remote data files to any dataset loader:
```
load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
```
This works for all these dataset loaders:
- text
- csv
- json
- parquet
- pandas
Streaming from remote text/json/csv/parquet/pandas files:
When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
- Streaming for the Pandas loader #2636 (@lhoestq)
- Streaming for the CSV loader #2635 (@lhoestq)
- Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
Delete extracted files when loading dataset #2631 (@albertvillanova)

Fix: C4 - fix expected files list #2682 (@lhoestq)
Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)

New: wiki_split #2623 (@bhadreshpsavani)
Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)

Fix BibTeX entry #2594 (@albertvillanova)
Fix test_is_small_dataset #2588 (@albertvillanova)
Remove import of transformers #2602 (@albertvillanova)
Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
Fix filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)
Remove redundant prepare_module #2597 (@albertvillanova)
Create ExtractManager #2295 (@albertvillanova)
Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
Use correct logger in metrics.py #2626 (@mariosasko)
Minor fix tests with Windows paths #2627 (@albertvillanova)
Use ETag of remote data files #2628 (@albertvillanova)
More consistent naming #2611 (@mariosasko)
Refactor patching to specific submodule #2639 (@albertvillanova)
Fix docstrings #2640 (@albertvillanova)
Fix anchor in README #2647 (@mariosasko)
Fix logging docstring #2652 (@mariosasko)
Allow dataset config kwargs to be None #2659 (@lhoestq)
Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
Use tqdm from tqdm_utils #2667 (@mariosasko)
Increase json reader block_size automatically #2676 (@lhoestq)
Parallelize ETag requests #2675 (@lhoestq)
Fix bad config ids that name cache directories #2686 (@lhoestq)
Minor documentation fix #2687 (@slowwavesleep)