datasets 2.4.0 on Python PyPI

Dataset Features

Add concatenate_datasets for iterable datasets by @lhoestq in #4500
Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in #4625
Support using PCM audio files (#4323) by @YooSungHyun in #4409
[data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in #4633
Support extract 7-zip compressed data files by @albertvillanova in #4672
Support extract lz4 compressed data files by @albertvillanova in #4700
Support metadata.jsonl from parent directories in imagefolder @mariosasko in #4576

Dataset changes

Update: allocine - Support streaming by @albertvillanova in #4563
Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in #4585
Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in #4586
Update: financial_phrasebank - Host data on the Hub by @albertvillanova in #4598
Update: cfq - Support streaming by @albertvillanova in #4579
Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in #4588
Update: bookcorpus - Support streaming dataset by @albertvillanova in #4564
Update: fever - Refactor and add metadata by @albertvillanova in #4503
Update: mlsum - Support streaming dataset by @albertvillanova in #4574
Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in #4523
Fix: conll2003 - fix empty example by @lhoestq in #4662
Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in #4554
Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in #4706
Fix: crd3 - fix splits that were containing the same data by @lhoestq in #4705

Dataset Cards

Add action names in schema_guided_dstc8 dataset card by @lhoestq in #4559
Add evaluation data to acronym_identification by @lewtun in #4561
Update WinoBias README by @sashavor in #4631
Support "tags" yaml tag by @lhoestq in #4716
Fix POS tags by @lhoestq in #4715
AESLC dataset: Add summarization tags by @hobson in #4517

Documentation

Update docs around audio and vision by @stevhliu in #4440
Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in #4513
Remove multiple config section by @stevhliu in #4600
Create new sections for audio and vision in guides by @stevhliu in #4519
Document installation of sox OS dependency for audio by @albertvillanova in #4713

General improvements and bug fixes

Add regression test for ArrowWriter.write_batch when batch is empty by @alvarobartt in #4510
Support all negative values in ClassLabel by @lhoestq in #4511
Add uppercased versions of image file extensions for automatic module inference by @mariosasko in #4515
Patch tests for hfh v0.8.0 by @LysandreJik in #4518
Replace deprecated logging.warn with logging.warning by @hugovk in #4539
[CI] Fix upstream hub test url by @lhoestq in #4543
Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in #4541
[CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in #4546
Tell users to upload on the hub directly by @lhoestq in #4552
Add batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in #4535
Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in #4545
Properly raise FileNotFound even if the dataset is private by @lhoestq in #4536
Fix hashing for python 3.9 by @lhoestq in #4516
[CI] Fix some warnings by @lhoestq in #4547
Validate new_fingerprint passed by user by @lhoestq in #4587
Update CI Windows orb by @albertvillanova in #4604
Perform hidden file check on relative data file path by @mariosasko in #4551
Align more metadata with other repo types (models,spaces) by @julien-c in #4607
Align/fix license metadata info by @julien-c in #4613
Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in #4611
Add authentication tip to load_dataset by @mariosasko in #4577
Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in #4553
fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in #4630
Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in #4608
Rename master to main by @lhoestq in #4643
Set HF_SCRIPTS_VERSION to main by @lhoestq in #4645
[Minor fix] Typo correction by @cakiki in #4644
fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #4627
Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in #4590
Fix time type _arrow_to_datasets_dtype conversion by @mariosasko in #4628
Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in #4660
Replace assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in #4496
Fix embed_storage on features inside lists/sequences by @mariosasko in #4615
Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in #4512
Transfer CI to GitHub Actions by @albertvillanova in #4659
Fix mock fsspec by @albertvillanova in #4685
Trigger CI also on push to main by @albertvillanova in #4687
Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in #4622
Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in #4688
Test extractors for all compression formats by @albertvillanova in #4689
Refactor base extractors by @albertvillanova in #4690
Update create dataset card docs by @stevhliu in #4683
Add text decorators by @stevhliu in #4663
Skip tests only for lz4/zstd params if not installed by @albertvillanova in #4704
Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in #4614
Docs: Fix same-page haslinks by @mishig25 in #4722
Fix broken link to the Hub by @stevhliu in #4726
Refactor conftest fixtures by @albertvillanova in #4723
Add object detection processing tutorial by @nateraw in #4710
Fix require torchaudio and refactor test requirements by @albertvillanova in #4708
docs: ✏️ fix TranslationVariableLanguages example by @severo in #4731
Pin rouge_score test dependency by @albertvillanova in #4735
Fix named split sorting and remove unnecessary casting by @albertvillanova in #4714
Make cast in from_pandas more robust by @mariosasko in #4703
Make Extractor accept Path as input by @albertvillanova in #4718
Refactor Hub tests by @albertvillanova in #4729
Fix to dict conversion of DatasetInfo/Features by @mariosasko in #4741

New Contributors

@hugovk made their first contribution in #4539
@VijayKalmath made their first contribution in #4545
@gugarosa made their first contribution in #4630
@benlipkin made their first contribution in #4627
@YooSungHyun made their first contribution in #4409
@hobson made their first contribution in #4517
@khushmeeet made their first contribution in #4554
@dtuit made their first contribution in #4614

Full Changelog: 2.3.2...2.4.0