github huggingface/datasets 2.6.0

latest releases: 2.19.1, 2.19.0, 2.18.0...
19 months ago

Important

  • [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
    • all the dataset scripts and dataset cards are now on https://hf.co/datasets
    • we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

  • Add ability to read-write to SQL databases. by @Dref360 in #4928
    • Read from sqlite file:
    from datasets import Dataset
    dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
    • Allow connection objects in from_sql + small doc improvement by @mariosasko in #5091
    from datasets import Dataset
    from sqlite3 import connect
    con = connect(...)
    dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
  • Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
    • return numpy/torch/tf/jax tensors with
    from datasets import load_dataset
    ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
    ds[0]["image"]
  • Added IterableDataset.from_generator by @hamid-vakilzadeh in #5052
  • Fast dataset iter by @mariosasko in #5030
    • speed up by a factor of 2 using the Arrow Table reader
  • Dataset infos in yaml by @lhoestq in #4926
  • Add kwargs to Dataset.from_generator by @mariosasko in #5049
  • Support converters in CsvBuilder by @mariosasko in #5057
  • Restore saved format state in load_from_disk by @asofiaoliveira in #5073

Dataset changes

Dataset cards

General improvements and bug fixes

New Contributors

Full Changelog: 2.5.1...2.6.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.