Important
- Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of
datasets
are not able to reload datasets pushed with this new model, so we encourage everyone to update.
Datasets Features
- Fix methods using
IterableDataset.map
that lead tofeatures=None
by @alvarobartt in #5287- Datasets in streaming mode now update their
features
after column renaming or removal
- Datasets in streaming mode now update their
- Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
- Use multiprocessing to load multiple files in parallel
- Add
features
param toIterableDataset.map
by @alvarobartt in #5311 - Sharded save_to_disk + multiprocessing by @lhoestq in #5268
- Pass
num_shards
ormax_shard_size
tods.save_to_disk()
ords.push_to_hub()
- Pass
num_proc
to use multiprocessing.
- Pass
- Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
- Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)
Docs
General improvements and bug fixes
- typo by @WrRan in #5253
- typo by @WrRan in #5254
- remove an unused statement by @WrRan in #5257
- fix wrong print by @WrRan in #5256
- Fix
max_shard_size
docs by @lhoestq in #5267 - Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in #5266
- Change release procedure to use only pull requests by @albertvillanova in #5250
- Warn about checksums by @lhoestq in #5279
- Tweak readme by @lhoestq in #5210
- Save file name in embed_storage by @lhoestq in #5285
- Use correct dataset type in
from_generator
docs by @mariosasko in #5307 - Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in #5294
- Fix xjoin for Windows pathnames by @albertvillanova in #5297
- Fix xopen for Windows pathnames by @albertvillanova in #5299
- Ci py3.10 by @lhoestq in #5065
- Update Overview.ipynb google colab by @lhoestq in #5211
- Support xPath for Windows pathnames by @albertvillanova in #5310
- Fix description of streaming in the docs by @polinaeterna in #5313
- Fix Text sample_by paragraph by @albertvillanova in #5319
- [Extract] Place the lock file next to the destination directory by @lhoestq in #5320
- Fix loading from HF GCP cache by @lhoestq in #5321
- This was affecting datasets like
wikipedia
ornatural_questions
- This was affecting datasets like
- Fix docs building for main by @albertvillanova in #5328
- Origin/fix missing features error by @eunseojo in #5318
- fix: 🐛 pass the token to get the list of config names by @severo in #5333
- Clarify imagefolder is for small datasets by @stevhliu in #5329
- Close stream in
ArrowWriter.finalize
before inference error by @mariosasko in #5309 - Use same
num_proc
for dataset download and generation by @mariosasko in #5300 - Set
IterableDataset.map
parambatch_size
typing as optional by @alvarobartt in #5336 - fix: dataset path should be absolute by @vigsterkr in #5234
- Clean up DatasetInfo and Dataset docstrings by @stevhliu in #5340
- Clean up docstrings by @stevhliu in #5334
- Remove tasks.json by @lhoestq in #5341
- Support
topdown
parameter inxwalk
by @mariosasko in #5308 - Improve
use_auth_token
docstring and deprecateuse_auth_token
indownload_and_prepare
by @mariosasko in #5302 - Clean up Loading methods docstrings by @stevhliu in #5350
- Clean up remaining Main Classes docstrings by @stevhliu in #5349
- Clean up Dataset and DatasetDict by @stevhliu in #5344
- Clean up Table class docstrings by @stevhliu in #5355
- Raise error for
.tar
archives in the same way as for.tar.gz
and.tgz
in_get_extraction_protocol
by @polinaeterna in #5322 - Clean filesystem and logging docstrings by @stevhliu in #5356
- ExamplesIterable fixes by @lhoestq in #5366
- Simplify skipping by @Muennighoff in #5373
- Release: 2.8.0 by @lhoestq in #5375
New Contributors
- @WrRan made their first contribution in #5253
- @eunseojo made their first contribution in #5318
- @vigsterkr made their first contribution in #5234
- @Muennighoff made their first contribution in #5373
Full Changelog: 2.7.0...2.8.0