Datasets Features
- add from_pandas and from_dict
- add shard method
- add rename/remove/cast columns methods
- faster select method
- add concatenate datasets
- add support for taking samples using numpy arrays
- add export to TFRecords
- add features parameter when loading from text/json/pandas/csv or when using the map transform
- add support for nested features for json
- add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
- add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
- add indexing using FAISS or ElasticSearch:
- add add_faiss_index and add_elasticsearch_index methods
- add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add save_faiss_index/load_faiss_index to save/load a serialized faiss index
Datasets changes
- new: PG19
- new: ANLI
- new: WikiSQL
- new: qa_zre
- new: MWSC
- new: AG news
- new: SQuADShifts
- new: doc red
- new: Wiki DPR
- new: fever
- new: hyperpartisan news detection
- new: pandas
- new: text
- new: emotion
- new: quora
- new: BioMRC
- new: web questions
- new: search QA
- new: LinCE
- new: TREC
- new: Style Change Detection
- new: 20newsgroup
- new: social biais frames
- new: Emo
- new: web of science
- new: sogou news
- new: crd3
- update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
- update: xtreme - add PAWS-X.es
- update: xsum - manual download is no longer required.
- new processed: Natural Questions
Metrics Features
- add seed parameter for metrics that does sampling like rouge
- better installation messages
Metrics changes
- new: bleurt
- update seqeval: fix entities extraction (more info here)
Bug fixes
- fix bug in map and select that was causing memory issues
- fix pyarrow version check
- fix text/json/pandas/csv caching when loading different files in a row
- fix metrics caching when they have with different config names
- fix cache that was nto discarded when there's a KeybordInterrupt during .map
- fix sacrebleu tokenizer's parameter
- fix docstrings of metrics when multiple instances are created
More Tests
- add tests for features handling in dataset transforms
- add tests for dataset builders
- add tests for metrics loading
Backward compatibility
- because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json