github huggingface/datasets 4.2.0

2 days ago

Dataset Features

  • Sample without replacement option when interleaving datasets by @radulescupetru in #7786

    ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
  • Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in #7806

    ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
  • Add parquet scan options and docs by @lhoestq in #7801

    • docs to select columns and filter data efficiently
    ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
    ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
    • new argument to control buffering and caching when streaming
    fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
    ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

New Contributors

Full Changelog: 4.1.1...4.2.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.