Important
- Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
- Big improvements on the speed of
.flatten_indices()
(x2) +save/load_from_disk
(x100) on selected/shuffled datasets
- Big improvements on the speed of
- Skip dataset verifications by default by @mariosasko in #5303
- introduces multiple
verification_mode
you can pass to `load_dataset()): - the new default verification steps are much faster (no need to compute expensive checksums)
- introduces multiple
Datasets features
- Single TQDM bar in multi-proc map by @mariosasko in #5455
- No more stacked TQDM bars when calling
.map()
in multiprocessing
- No more stacked TQDM bars when calling
- Map-style Dataset to IterableDataset by @lhoestq in #5410
- introduces
.to_iterable_dataset()
to get aIterableDataset
from aDataset
- see all the advantages of
IterableDataset
in the documentation about the differences between Dataset and IterableDataset
- introduces
- Select columns of Dataset or DatasetDict by @daskol in #5480
- introduces
.select_column()
to return a dataset only containing the requested columns
- introduces
- Added functionality: sort datasets by multiple keys by @MichlF in #5502
- introduces
ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
- introduces
- Add JAX device selection when formatting by @alvarobartt in #5547
- introduces
ds = ds.with_format("jax", device=device)
- introduces
- Reload features from Parquet metadata by @MFreidank in #5516
- Speed up batched PyTorch DataLoader by @lhoestq in #5512
Documentation
- Add section in tutorial for IterableDataset by @stevhliu in #5485
- Tutorial for creating a dataset by @stevhliu in #5540
- Add JAX-formatting documentation by @alvarobartt in #5535
General improvements and bug fixes
- Pin sqlalchemy by @lhoestq in #5476
- Update dataset card creation by @stevhliu in #5470
- Add num_test_batches option by @amyeroberts in #5471
- Tip for recomputing metadata by @stevhliu in #5478
- Disable aiohttp requoting of redirection URL by @albertvillanova in #5459
- [MINOR] Typo by @cakiki in #5491
- Pin dill lower version by @albertvillanova in #5489
- Improved error message for gated/private repos by @osanseviero in #5497
- Update docs for
nyu_depth_v2
dataset by @awsaf49 in #5484 - don't zero copy timestamps by @dwyatte in #5504
- Remove unused
load_from_cache_file
arg fromDataset.shard()
docstring by @polinaeterna in #5493 - Do not add index column by default when exporting to CSV by @albertvillanova in #5490
- Fix bug when casting empty array to class labels by @marioga in #5521
- Fix benchmarks CI - pin protobuf by @lhoestq in #5527
- Remove py.typed by @mariosasko in #5518
- Add missing license in
NumpyFormatter
by @alvarobartt in #5530 - Unify
load_from_cache_file
type and logic by @HallerPatrick in #5515 - Format code with
ruff
by @mariosasko in #5519 - Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in #5522
- Resolve four broken refs in the docs by @tomaarsen in #5550
- Use default audio resampling type by @lhoestq in #5556
- resampy is no longer needed to resample audio data
- improved message error row formatting by @Plutone11011 in #5553
- Make tiktoken tokenizers hashable by @mariosasko in #5552
- Suggest scikit-learn instead of sklearn by @osbm in #5551
- Add filter desc by @lhoestq in #5557
- Fix map suffix_template by @lhoestq in #5559
- Ensure last tqdm update in map by @mariosasko in #5560
New Contributors
- @amyeroberts made their first contribution in #5471
- @awsaf49 made their first contribution in #5484
- @dwyatte made their first contribution in #5504
- @marioga made their first contribution in #5521
- @MFreidank made their first contribution in #5516
- @daskol made their first contribution in #5480
- @Plutone11011 made their first contribution in #5553
- @osbm made their first contribution in #5551
- @MichlF made their first contribution in #5502
Full Changelog: 2.9.0...ef