Important
- [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on
Datasets features
- Add ability to read-write to SQL databases. by @Dref360 in #4928
- Read from sqlite file:
from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
- Allow connection objects in
from_sql
+ small doc improvement by @mariosasko in #5091
from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
- Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
- return numpy/torch/tf/jax tensors with
from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
- Added
IterableDataset.from_generator
by @hamid-vakilzadeh in #5052 - Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
- Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
- Add
kwargs
toDataset.from_generator
by @mariosasko in #5049 - Support
converters
inCsvBuilder
by @mariosasko in #5057 - Restore saved format state in
load_from_disk
by @asofiaoliveira in #5073
Dataset changes
- Update: hendrycks_test - support streaming by @albertvillanova in #5041
- Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
- Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
- Fix: sbu_captions - fix URLs by @donglixp in #5020
- Fix: xcsr - fix string features by @albertvillanova in #5024
- Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
- Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
- Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
- Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715
Dataset cards
- Add description to hellaswag dataset by @julien-c in #4810
- Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
- Update languages in aeslc dataset card by @apergo-ai in #3357
- Update license to bookcorpus dataset card by @meg-huggingface in #3526
- Update paper link in medmcqa dataset card by @monk1337 in #4290
- Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
- Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054
General improvements and bug fixes
- Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
- Add some note about running the transformers ci before a release by @lhoestq in #5007
- Remove license tag file and validation by @albertvillanova in #5004
- Re-apply input columns change by @mariosasko in #5008
- patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
- Fix typo in error message by @severo in #5027
- Fix import in
ClassLabel
docstring example by @alvarobartt in #5029 - Remove redundant code from some dataset module factories by @albertvillanova in #5033
- Fix typos in load docstrings and comments by @albertvillanova in #5035
- Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
- Fix tar extraction vuln by @lhoestq in #5016
- Support hfh 0.10 implicit auth by @lhoestq in #5031
- Fix
flatten_indices
with empty indices mapping by @mariosasko in #5043 - Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
- Revert task removal in folder-based builders by @mariosasko in #5051
- Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
- Fix typo by @stevhliu in #5059
- Fix CI hfh token warning by @albertvillanova in #5062
- Mark CI tests as xfail when 502 error by @albertvillanova in #5058
- Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
- Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
- Fix header level in Audio docs by @stevhliu in #5078
- Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
- Support streaming gzip.open by @albertvillanova in #5066
- adding keep in memory by @Mustapha-AJEGHRIR in #5082
- refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
- fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
- Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
- Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
- Fix filter with empty indices by @Mouhanedg56 in #5087
- Fix tutorial (#5093) by @riccardobucco in #5095
- Use HTML relative paths for tiles in the docs by @lewtun in #5092
- Fix loading how to guide (#5102) by @riccardobucco in #5104
- url encode hub url (#5099) by @riccardobucco in #5103
- Free the "hf" filesystem protocol for
hffs
by @lhoestq in #5101 - Fix task template reload from dict by @lhoestq in #5106
New Contributors
- @Wauplin made their first contribution in #5026
- @donglixp made their first contribution in #5020
- @Timothyxxx made their first contribution in #3715
- @hamid-vakilzadeh made their first contribution in #5052
- @Mustapha-AJEGHRIR made their first contribution in #5082
- @galbwe made their first contribution in #5079
- @rahulXs made their first contribution in #5076
- @Mouhanedg56 made their first contribution in #5087
- @riccardobucco made their first contribution in #5095
- @asofiaoliveira made their first contribution in #5073
Full Changelog: 2.5.1...2.6.0