github huggingface/datasets 3.2.0

4 days ago

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

New Contributors

Full Changelog: 3.1.0...3.2.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.