Datasets Features
-
Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377
- Pass
num_workers=
to.to_tf_dataset()
to make your dataset faster with multiprocessing
- Pass
-
Distributed support by @lhoestq in #5369
- Split your dataset for each node for distributed training
- It supports both
Dataset
andIterableDataset
(e.g. in streaming mode) - See the documentation for more details
import os from datasets.distributed import split_dataset_by_node rank = int(os.environ["RANK"]) world_size = int(os.environ["WORLD_SIZE"]) ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
-
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400
-
Tqdm progress bar for
to_parquet
by @zanussbaum in #5456 -
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379
-
Support other formats than uint8 for image arrays by @vigsterkr in #5365
Documentation
- Depth estimation dataset guide by @sayakpaul in #5379
- Imagefolder docs: mention support of CSV and ZIP by @lhoestq in #5463
- Update docs of S3 filesystem with async aiobotocore by @maheshpec in #5411
General improvements and bug fixes
- Raise error if ClassLabel names is not python list by @freddyheppell in #5359
- Temporarily pin pydantic test dependency by @albertvillanova in #5395
- Unpin pydantic test dependency by @albertvillanova in #5397
- Replace one letter import in docs by @MKhalusova in #5403
- Fix Colab notebook link by @albertvillanova in #5392
- Fix
fs.open
resource leaks by @tkukurin in #5358 - Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in #5409
- Fix streaming pandas.read_excel by @albertvillanova in #5372
- ci: 🎡 remove two obsolete issue templates by @severo in #5420
- Handle 0-dim tensors in
cast_to_python_objects
by @mariosasko in #5384 - Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in #5429
- Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in #5432
- Revert container image pin in CI benchmarks by @0x2b3bfa0 in #5436
- Finish deprecating the fs argument by @dconathan in #5393
- Update actions/checkout in CD Conda release by @albertvillanova in #5438
- Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in #5416
- Fix documentation about batch samplers by @thomasw21 in #5440
- Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in #5447
- Support fsspec 2023.1.0 in CI by @albertvillanova in #5449
- Update share tutorial by @stevhliu in #5443
- Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in #5452
- Fix base directory while extracting insecure TAR files by @albertvillanova in #5453
- Fix link in
load_dataset
docstring by @mariosasko in #5389 - Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in #5460
- Concatenate on axis=1 with misaligned blocks by @lhoestq in #5462
- Raise from disconnect error in xopen by @lhoestq in #5382
- remove pathlib.Path with URIs by @jonny-cyberhaven in #5466
- Remove deprecated
shard_size
arg from.push_to_hub()
by @polinaeterna in #5469
New Contributors
- @freddyheppell made their first contribution in #5359
- @MKhalusova made their first contribution in #5403
- @tkukurin made their first contribution in #5358
- @0x2b3bfa0 made their first contribution in #5436
- @maheshpec made their first contribution in #5411
- @dconathan made their first contribution in #5393
- @zanussbaum made their first contribution in #5456
- @jonny-cyberhaven made their first contribution in #5466
Full Changelog: 2.8.0...2.9.0