github huggingface/datasets 4.1.0

13 hours ago

Dataset Features

  • feat: use content defined chunking by @kszucs in #7589

    • pass use_content_defined_chunking=True when writing Parquet files
    • this enables fast deduped uploads to Hugging Face !
    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
  • Concurrent push_to_hub by @lhoestq in #7708

  • Concurrent IterableDataset push_to_hub by @lhoestq in #7710

  • HDF5 support by @klamike in #7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

New Contributors

Full Changelog: 4.0.0...4.1.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.