Dataset Features
- Add
concatenate_datasets
for iterable datasets by @lhoestq in #4500 - Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in #4625
- Support using PCM audio files (#4323) by @YooSungHyun in #4409
- [data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in #4633
- Support extract 7-zip compressed data files by @albertvillanova in #4672
- Support extract lz4 compressed data files by @albertvillanova in #4700
- Support
metadata.jsonl
from parent directories inimagefolder
@mariosasko in #4576
Dataset changes
- Update: allocine - Support streaming by @albertvillanova in #4563
- Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in #4585
- Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in #4586
- Update: financial_phrasebank - Host data on the Hub by @albertvillanova in #4598
- Update: cfq - Support streaming by @albertvillanova in #4579
- Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in #4588
- Update: bookcorpus - Support streaming dataset by @albertvillanova in #4564
- Update: fever - Refactor and add metadata by @albertvillanova in #4503
- Update: mlsum - Support streaming dataset by @albertvillanova in #4574
- Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in #4523
- Fix: conll2003 - fix empty example by @lhoestq in #4662
- Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in #4554
- Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in #4706
- Fix: crd3 - fix splits that were containing the same data by @lhoestq in #4705
Dataset Cards
- Add action names in schema_guided_dstc8 dataset card by @lhoestq in #4559
- Add evaluation data to acronym_identification by @lewtun in #4561
- Update WinoBias README by @sashavor in #4631
- Support "tags" yaml tag by @lhoestq in #4716
- Fix POS tags by @lhoestq in #4715
- AESLC dataset: Add summarization tags by @hobson in #4517
Documentation
- Update docs around audio and vision by @stevhliu in #4440
- Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in #4513
- Remove multiple config section by @stevhliu in #4600
- Create new sections for audio and vision in guides by @stevhliu in #4519
- Document installation of sox OS dependency for audio by @albertvillanova in #4713
General improvements and bug fixes
- Add regression test for
ArrowWriter.write_batch
when batch is empty by @alvarobartt in #4510 - Support all negative values in ClassLabel by @lhoestq in #4511
- Add uppercased versions of image file extensions for automatic module inference by @mariosasko in #4515
- Patch tests for hfh v0.8.0 by @LysandreJik in #4518
- Replace deprecated logging.warn with logging.warning by @hugovk in #4539
- [CI] Fix upstream hub test url by @lhoestq in #4543
- Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in #4541
- [CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in #4546
- Tell users to upload on the hub directly by @lhoestq in #4552
- Add
batch_size
parameter when callingadd_faiss_index
andadd_faiss_index_from_external_arrays
by @alvarobartt in #4535 - Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in #4545
- Properly raise FileNotFound even if the dataset is private by @lhoestq in #4536
- Fix hashing for python 3.9 by @lhoestq in #4516
- [CI] Fix some warnings by @lhoestq in #4547
- Validate new_fingerprint passed by user by @lhoestq in #4587
- Update CI Windows orb by @albertvillanova in #4604
- Perform hidden file check on relative data file path by @mariosasko in #4551
- Align more metadata with other repo types (models,spaces) by @julien-c in #4607
- Align/fix license metadata info by @julien-c in #4613
- Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in #4611
- Add authentication tip to
load_dataset
by @mariosasko in #4577 - Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in #4553
- fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in #4630
- Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in #4608
- Rename master to main by @lhoestq in #4643
- Set HF_SCRIPTS_VERSION to main by @lhoestq in #4645
- [Minor fix] Typo correction by @cakiki in #4644
- fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #4627
- Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in #4590
- Fix time type
_arrow_to_datasets_dtype
conversion by @mariosasko in #4628 - Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in #4660
- Replace
assertEqual
withassertTupleEqual
in unit tests for verbosity by @alvarobartt in #4496 - Fix
embed_storage
on features inside lists/sequences by @mariosasko in #4615 - Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in #4512
- Transfer CI to GitHub Actions by @albertvillanova in #4659
- Fix mock fsspec by @albertvillanova in #4685
- Trigger CI also on push to main by @albertvillanova in #4687
- Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in #4622
- Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in #4688
- Test extractors for all compression formats by @albertvillanova in #4689
- Refactor base extractors by @albertvillanova in #4690
- Update create dataset card docs by @stevhliu in #4683
- Add text decorators by @stevhliu in #4663
- Skip tests only for lz4/zstd params if not installed by @albertvillanova in #4704
- Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in #4614
- Docs: Fix same-page haslinks by @mishig25 in #4722
- Fix broken link to the Hub by @stevhliu in #4726
- Refactor conftest fixtures by @albertvillanova in #4723
- Add object detection processing tutorial by @nateraw in #4710
- Fix require torchaudio and refactor test requirements by @albertvillanova in #4708
- docs: ✏️ fix TranslationVariableLanguages example by @severo in #4731
- Pin rouge_score test dependency by @albertvillanova in #4735
- Fix named split sorting and remove unnecessary casting by @albertvillanova in #4714
- Make cast in
from_pandas
more robust by @mariosasko in #4703 - Make Extractor accept Path as input by @albertvillanova in #4718
- Refactor Hub tests by @albertvillanova in #4729
- Fix to dict conversion of
DatasetInfo
/Features
by @mariosasko in #4741
New Contributors
- @hugovk made their first contribution in #4539
- @VijayKalmath made their first contribution in #4545
- @gugarosa made their first contribution in #4630
- @benlipkin made their first contribution in #4627
- @YooSungHyun made their first contribution in #4409
- @hobson made their first contribution in #4517
- @khushmeeet made their first contribution in #4554
- @dtuit made their first contribution in #4614
Full Changelog: 2.3.2...2.4.0