github huggingface/datasets 0.4.0

latest releases: 3.1.0, 3.0.2, 3.0.1...
4 years ago

Datasets Features

  • add from_pandas and from_dict
  • add shard method
  • add rename/remove/cast columns methods
  • faster select method
  • add concatenate datasets
  • add support for taking samples using numpy arrays
  • add export to TFRecords
  • add features parameter when loading from text/json/pandas/csv or when using the map transform
  • add support for nested features for json
  • add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
  • add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
  • add indexing using FAISS or ElasticSearch:
    • add add_faiss_index and add_elasticsearch_index methods
    • add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
    • add search and search_batch to query the index and return examples ids
    • add save_faiss_index/load_faiss_index to save/load a serialized faiss index

Datasets changes

  • new: PG19
  • new: ANLI
  • new: WikiSQL
  • new: qa_zre
  • new: MWSC
  • new: AG news
  • new: SQuADShifts
  • new: doc red
  • new: Wiki DPR
  • new: fever
  • new: hyperpartisan news detection
  • new: pandas
  • new: text
  • new: emotion
  • new: quora
  • new: BioMRC
  • new: web questions
  • new: search QA
  • new: LinCE
  • new: TREC
  • new: Style Change Detection
  • new: 20newsgroup
  • new: social biais frames
  • new: Emo
  • new: web of science
  • new: sogou news
  • new: crd3
  • update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
  • update: xtreme - add PAWS-X.es
  • update: xsum - manual download is no longer required.
  • new processed: Natural Questions

Metrics Features

  • add seed parameter for metrics that does sampling like rouge
  • better installation messages

Metrics changes

  • new: bleurt
  • update seqeval: fix entities extraction (more info here)

Bug fixes

  • fix bug in map and select that was causing memory issues
  • fix pyarrow version check
  • fix text/json/pandas/csv caching when loading different files in a row
  • fix metrics caching when they have with different config names
  • fix cache that was nto discarded when there's a KeybordInterrupt during .map
  • fix sacrebleu tokenizer's parameter
  • fix docstrings of metrics when multiple instances are created

More Tests

  • add tests for features handling in dataset transforms
  • add tests for dataset builders
  • add tests for metrics loading

Backward compatibility

  • because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json

Don't miss a new datasets release

NewReleases is sending notifications on new releases.