github huggingface/datasets 2.20.0

15 days ago

Important

  • Remove default trust_remote_code=True by @lhoestq in #6954
    • datasets with a python loading script now require passing trust_remote_code=True to be used

Datasets features

  • [Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in #6658
    • checkpoint and resume an iterable dataset (e.g. when streaming):

      >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
      >>> for idx, example in enumerate(iterable_dataset):
      ...     print(example)
      ...     if idx == 2:
      ...         state_dict = iterable_dataset.state_dict()
      ...         print("checkpoint")
      ...         break
      >>> iterable_dataset.load_state_dict(state_dict)
      >>> print(f"restart from checkpoint")
      >>> for example in iterable_dataset:
      ...     print(example)

      Returns:

      {'a': 0}
      {'a': 1}
      {'a': 2}
      checkpoint
      restart from checkpoint
      {'a': 3}
      {'a': 4}
      {'a': 5}
      

General improvements and bug fixes

New Contributors

Full Changelog: 2.19.0...2.20.0

Don't miss a new datasets release

NewReleases is sending notifications on new releases.