huggingface/datasets 0.4.0 on GitHub

Datasets Features

add from_pandas and from_dict
add shard method
add rename/remove/cast columns methods
faster select method
add concatenate datasets
add support for taking samples using numpy arrays
add export to TFRecords
add features parameter when loading from text/json/pandas/csv or when using the map transform
add support for nested features for json
add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
add indexing using FAISS or ElasticSearch:
- add add_faiss_index and add_elasticsearch_index methods
- add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add save_faiss_index/load_faiss_index to save/load a serialized faiss index

new: PG19
new: ANLI
new: WikiSQL
new: qa_zre
new: MWSC
new: AG news
new: SQuADShifts
new: doc red
new: Wiki DPR
new: fever
new: hyperpartisan news detection
new: pandas
new: text
new: emotion
new: quora
new: BioMRC
new: web questions
new: search QA
new: LinCE
new: TREC
new: Style Change Detection
new: 20newsgroup
new: social biais frames
new: Emo
new: web of science
new: sogou news
new: crd3
update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
update: xtreme - add PAWS-X.es
update: xsum - manual download is no longer required.
new processed: Natural Questions

because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json