Datasets changes
- New: Europarl Bilingual #1874 (@lucadiliello)
- New: Stanford Sentiment Treebank #1961 (@patpizio)
- New: RO-STS #1978 (@lorinczb)
- New: newspop #1871 (@frankier)
- New: FashionMNIST #1999 (@gchhablani)
- New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
- New: Cryptonite #2013 (@theo-m)
- New: RoSent #2011 (@gchhablani)
- New: PersiNLU reading-comprehension #2028 (@danyaljj)
- New: conllpp #1991 (@ZihanWangKi)
- New: LaRoSeDa #2004 (@MihaelaGaman)
- Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
- Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
- Update: Md gender - card update #2018 (@mcmillanmajora)
- Update: XQuAD - add Romanian #2023 (@M-Salti)
- Update: DROP - all answers #1980 (@KaijuML)
- Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
- Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
- Fix: Doc2dial update data_infos and data_loaders #2041 (@songfeng)
- Fix: ZEST - update download link #2057 (@matt-peters)
- Fix: ted_talks_iwslt - fix version error #2064 (@mariosasko)
Datasets Features
- Implement Dataset from CSV #1946 (@albertvillanova)
- Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
- Implement Dataset from text #2030 (@albertvillanova)
- Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset
General Bug fixes and improvements
- Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
- feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
- Fix various typos/grammer in the docs #2008 (@mariosasko)
- Update format columns in Dataset.rename_columns #2027 (@mariosasko)
- Replace print with logging in dataset scripts #2019 (@mariosasko)
- Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
- Not all languages have 2 digit codes. #2016 (@asiddhant)
- Fix arrow memory checks issue in tests #2042 (@lhoestq)
- Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
- Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
- Fix text-classification tags #2049 (@gchhablani)
- Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
- Fixes check of TF_AVAILABLE and TORCH_AVAILABLE #2073 (@philschmid)
- Add and fix docstring for NamedSplit #2069 (@albertvillanova)
- Bump huggingface_hub version #2077 (@SBrandeis)
- Fix docstring issues #2072 (@albertvillanova)