github huggingface/datasets 0.2.0
New datasets + Apache Beam, new metrics, bug fixes

latest releases: 2.19.0, 2.18.0, 2.17.1...
3 years ago

Datasets changes

  • New: germeval14
  • New: wmt
  • New: Ubuntu dialog corpus
  • New: squad spanish
  • New: Quanta
  • New: arcd
  • New: Natural Questions (needs to be processed using a beam pipeline)
  • New: C4 (needs to be processed using a beam pipeline)
  • Skip the processing: wikipedia (english and french version are now already processed)
  • Skip the processing: wiki40b (english version is now already processed)
  • Renamed: anli -> art
  • Better instructions: xsum
  • Add .filter() for arrow datasets
  • Add instruction message for manual data when required

Metrics changes:

  • New: BERTScore
  • Allow to add examples by element or by batch to compute a metric score

Commands:

  • New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
  • New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud

Bug fixes:

  • Now .map return the right values when run on different splits of the same dataset
  • Fix input of the squad metric format to fit the format of the squad dataset
  • Fix download from google drive for small files
  • For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing

More tests

  • Local tests of dataset processing scripts
  • AWS tests of dataset processing scripts
  • Tests for arrow dataset methods
  • Tests for arrow reader methods

Don't miss a new datasets release

NewReleases is sending notifications on new releases.