1.0.0 Release: New name, Speed-ups, Multimodal, Serialization
Package Changes
- Rename: nlp -> datasets
Update now with
pip install datasets
Dataset Features
- Keep the dataset format after dataset transforms (#607)
- Pickle support (#536)
- Save and load datasets to/from disk (#571)
- Multiprocessing in
map
andfilter
(#552) - Multi-dimensional arrays support for multi-modal datasets (#533, #363)
- Speed up Tokenization by optimizing casting to python objects (#523)
- Speed up shuffle/shard/select methods - use indices mappings (#513)
- Add
input_column
parameter inmap
andfilter
(#475) - Speed up download and processing (#563)
- Indexed datasets for hybrid models (REALM/RAG/MARGE) (#500)
Dataset Changes
- New: IWSLT 2017 (#470)
- New: CommonGen Dataset (#578)
- New: CLUE Benchmark (11 datasets) (#572)
- New: the KILT knowledge source and tasks (#559)
- New: DailyDialog (#556)
- New: DoQA dataset (ACL 2020) (#473)
- New: reuters21578 (#570)
- New: HANS (#551)
- New: MLSUM (#529)
- New: Guardian authorship (#452)
- New: web_questions (#401)
- New: MS MARCO (#364)
- Update: Germeval14 - update download url (#594)
- Update: LinCE - update download url (#550)
- Update: Hyperpartisan news detection - update download url, manual download no longer required (#504)
- Update: Rotten Tomatoes - update download url (#484)
- Update: Wiki DPR - Use HNSW faiss index (#500)
- Update: Text - Speed up using multi-threaded PyArrow loading (#548)
- Fix: GLUE, PAWS-X - skip header (#497)
[Breaking] Update Dataset and DatasetDict API (#459)
- Rename the flatten, drop and dictionary_encode_column methods in flatten_, drop_ and dictionary_encode_column_ to indicate that these methods have in-place effects
- Remove the dataset.columns property and dataset.nbytes
- Add a few more properties and methods to DatasetDict
Metric Features
- Disallow the use of positional arguments to avoid predictions vs references mistakes (#466)
- Allow to directly feed numpy/pytorch/tensorflow/pandas objects in metrics (#466)
Metric Changes
Loading script Features
- Pin the version of the scripts (reproducibility) (#603, #584)
- Specify default
script_version
with the env variableHF_SCRIPTS_VERSION
(#584) - Save scripts in a modules cache directory that can be controlled with
HF_MODULES_CACHE
(#574)
Caching
- Better support for tokenizers when caching
map
results (#601) - Faster caching for text dataset (#573, #502)
- Use dataset fingerprints, updated after each transform (#536)
- Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics (#518)
Documentation
- Metrics documentation (#579)
Miscellaneous
- Add centralized logging - Bump-up cache loads to warnings (#538)
Bug fixes
- Datasets: [Breaking] fixed typo in "formated_as" method: rename formated to formatted (#516)
- Datasets: fixed the error message when loading text/csv/json without providing data files (#586)
- Datasets: fixed
select
method for pyarrow < 1.0.0 (#585) - Datasets: fixed elasticsearch result ids returning as strings (#487)
- Datasets: fixed config used for slow test on real dataset (#527)
- Datasets: fixed tensorflow-formatted datasets outputs by using ragged tensor by default (#530)
- Datasets: fixed batched map for formatted dataset (#515)
- Datasets: fixed encodings issues on Windows - apply utf-8 encoding to all datasets (#481)
- Datasets: fixed dataset.map for function without outputs (#506)
- Datasets: fixed bad type in overflow check (#496)
- Datasets: fixed dataset info save - dont use beam fs to save info for local cache dir (#498)
- Datasets: fixed arrays outputs - stack vectors in numpy, pytorch and tensorflow (#495, #494)
- Metrics: fixed locking in distributed settings if one process finished before the other started writing (#564, #547)