Datasets changes
- New: germeval14
- New: wmt
- New: Ubuntu dialog corpus
- New: squad spanish
- New: Quanta
- New: arcd
- New: Natural Questions (needs to be processed using a beam pipeline)
- New: C4 (needs to be processed using a beam pipeline)
- Skip the processing: wikipedia (english and french version are now already processed)
- Skip the processing: wiki40b (english version is now already processed)
- Renamed: anli -> art
- Better instructions: xsum
- Add .filter() for arrow datasets
- Add instruction message for manual data when required
Metrics changes:
- New: BERTScore
- Allow to add examples by element or by batch to compute a metric score
Commands:
- New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
- New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud
Bug fixes:
- Now .map return the right values when run on different splits of the same dataset
- Fix input of the squad metric format to fit the format of the squad dataset
- Fix download from google drive for small files
- For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing
More tests
- Local tests of dataset processing scripts
- AWS tests of dataset processing scripts
- Tests for arrow dataset methods
- Tests for arrow reader methods