huggingface/datasets 2.10.0 on GitHub

Important

Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
- Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
Skip dataset verifications by default by @mariosasko in #5303
- introduces multiple verification_mode you can pass to `load_dataset()):
- the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Single TQDM bar in multi-proc map by @mariosasko in #5455
- No more stacked TQDM bars when calling .map() in multiprocessing
Map-style Dataset to IterableDataset by @lhoestq in #5410
- introduces .to_iterable_dataset() to get a IterableDataset from a Dataset
- see all the advantages of IterableDataset in the documentation about the differences between Dataset and IterableDataset
Select columns of Dataset or DatasetDict by @daskol in #5480
- introduces .select_column() to return a dataset only containing the requested columns
Added functionality: sort datasets by multiple keys by @MichlF in #5502
- introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
Add JAX device selection when formatting by @alvarobartt in #5547
- introduces ds = ds.with_format("jax", device=device)
Reload features from Parquet metadata by @MFreidank in #5516
Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

Add section in tutorial for IterableDataset by @stevhliu in #5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
Tutorial for creating a dataset by @stevhliu in #5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
Add JAX-formatting documentation by @alvarobartt in #5535
- https://huggingface.co/docs/datasets/main/en/use_with_jax

General improvements and bug fixes

Pin sqlalchemy by @lhoestq in #5476
Update dataset card creation by @stevhliu in #5470
Add num_test_batches option by @amyeroberts in #5471
Tip for recomputing metadata by @stevhliu in #5478
Disable aiohttp requoting of redirection URL by @albertvillanova in #5459
[MINOR] Typo by @cakiki in #5491
Pin dill lower version by @albertvillanova in #5489
Improved error message for gated/private repos by @osanseviero in #5497
Update docs for nyu_depth_v2 dataset by @awsaf49 in #5484
don't zero copy timestamps by @dwyatte in #5504
Remove unused load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in #5493
Do not add index column by default when exporting to CSV by @albertvillanova in #5490
Fix bug when casting empty array to class labels by @marioga in #5521
Fix benchmarks CI - pin protobuf by @lhoestq in #5527
Remove py.typed by @mariosasko in #5518
Add missing license in NumpyFormatter by @alvarobartt in #5530
Unify load_from_cache_file type and logic by @HallerPatrick in #5515
Format code with ruff by @mariosasko in #5519
Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in #5522
Resolve four broken refs in the docs by @tomaarsen in #5550
Use default audio resampling type by @lhoestq in #5556
- resampy is no longer needed to resample audio data
improved message error row formatting by @Plutone11011 in #5553
Make tiktoken tokenizers hashable by @mariosasko in #5552
Suggest scikit-learn instead of sklearn by @osbm in #5551
Add filter desc by @lhoestq in #5557
Fix map suffix_template by @lhoestq in #5559
Ensure last tqdm update in map by @mariosasko in #5560

New Contributors

@amyeroberts made their first contribution in #5471
@awsaf49 made their first contribution in #5484
@dwyatte made their first contribution in #5504
@marioga made their first contribution in #5521
@MFreidank made their first contribution in #5516
@daskol made their first contribution in #5480
@Plutone11011 made their first contribution in #5553
@osbm made their first contribution in #5551
@MichlF made their first contribution in #5502

Full Changelog: 2.9.0...ef