github huggingface/datasets 2.10.0

latest releases: 3.1.0, 3.0.2, 3.0.1...
20 months ago

Important

  • Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
    • Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
  • Skip dataset verifications by default by @mariosasko in #5303
    • introduces multiple verification_mode you can pass to `load_dataset()):
    • the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

  • Single TQDM bar in multi-proc map by @mariosasko in #5455
    • No more stacked TQDM bars when calling .map() in multiprocessing
  • Map-style Dataset to IterableDataset by @lhoestq in #5410
  • Select columns of Dataset or DatasetDict by @daskol in #5480
    • introduces .select_column() to return a dataset only containing the requested columns
  • Added functionality: sort datasets by multiple keys by @MichlF in #5502
    • introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
  • Add JAX device selection when formatting by @alvarobartt in #5547
    • introduces ds = ds.with_format("jax", device=device)
  • Reload features from Parquet metadata by @MFreidank in #5516
  • Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.9.0...ef

Don't miss a new datasets release

NewReleases is sending notifications on new releases.