pypi datasets 3.3.0

latest releases: 3.3.2, 3.3.1
6 days ago

Dataset Features

  • Support async functions in map() by @lhoestq in #7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
  • Add repeat method to datasets by @alex-hh in #7198

    ds = ds.repeat(10)
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in #7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: 3.2.0...3.3.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.