Dataset Features
- On-the-fly data transforms (#1795)
- ADD S3 support for downloading and uploading processed datasets (#1723)
- Allow loading dataset in-memory (#1792)
- Support future datasets (#1813)
- Enable/disable caching (#1703)
- Offline dataset loading (#1726)
Datasets Hub Features
- Loading from the Datasets Hub (#1860)
This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library.
Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation
Dataset Changes
- New: LJ Speech (#1878)
- New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
- New: cord 19 (#1850)
- New: Tweet Eval Dataset (#1829)
- New: CIFAR-100 Dataset (#1812)
- New: SICK (#1804)
- New: BBC Hindi NLI Dataset (#1158)
- New: Freebase QA Dataset (#1814)
- New: Arabic sarcasm (#1798)
- New: Semantic Scholar Open Research Corpus (#1606)
- New: DuoRC Dataset (#1800)
- New: Aggregated dataset for the GEM benchmark (#1807)
- New: CC-News dataset of English language articles (#1323)
- New: irc disentangle (#1586)
- New: Narrative QA Manual (#1778)
- New: Universal Morphologies (#1174)
- New: SILICONE (#1761)
- New: Librispeech ASR (#1767)
- New: OSCAR (#1694, #1868, #1833)
- New: CANER Corpus (#1684)
- New: Arabic Speech Corpus (#1852)
- New: id_liputan6 (#1740)
- New: Stuctured Argument Extraction for Korean dataset (#1748)
- New: TurkCorpus (#1732)
- New: Hatexplain Dataset (#1716)
- New: adversarialQA (#1714)
- Update: Doc2dial - reading comprehension update to latest version (#1816)
- Update: OPUS Open Subtitles - add with metadata information (#1865)
- Update: SWDA - use all metadata features(#1799)
- Update: SWDA - add metadata and correct splits (#1749)
- Update: CommonGen - update citation information (#1787)
- Update: SciFact - update URL (#1780)
- Update: BrWaC - update features name (#1736)
- Update: TLC - update urls to be github links (#1737)
- Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
- Fix: multi_woz_v22 - fix checksums (#1880)
- Fix: limit - fix url (#1861)
- Fix: WebNLG - fix test test + more field (#1739)
- Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
- Fix: reuters - add missing "brief" entries (#1744)
- Fix: thainer: empty token bug (#1734)
- Fix: lst20: empty token bug (#1734)
Metrics Changes
- New: Word Error Metric (#1847)
- New: COMET (#1577, #1753)
- Fix: bert_score - set version dependency (#1851)
Metric Docs
- Add metrics usage examples and tests (#1820)
CLI Changes
- [BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
- instead, use the huggingface-hub CLI
Bug fixes
- fix writing GPU Faiss index (#1862)
- update pyarrow import warning (#1782)
- Ignore definition line number of functions for caching (#1779)
- update saving and loading methods for faiss index so to accept path like objects (#1663)
- Print error message with filename when malformed CSV (#1826)
- Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)
Refactoring
- Refactoring: Create config module (#1848)
- Use a config id in the cache directory names for custom configs (#1754)
Logging
- Enable logging propagation and remove logging handler (#1845)