Datasets Changes
- New: iapp_wiki_qa_squad #1873 (@cstorm125)
- New: Financial PhraseBank #1866 (@frankier)
- New: CoVoST2 #1935 (@patil-suraj)
- New: TIMIT #1903 (@vrindaprabhu)
- New: Mlama (multilingual lama) #1931 (@pdufter)
- New: FewRel #1823 (@gchhablani)
- New: CCAligned Multilingual Dataset #1815 (@gchhablani)
- New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
- Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
- Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
- Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
- Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
- Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
- Fix: Wiki_dpr - fix when with_embeddings is False or index_name is "no_index" #1925 (@lhoestq)
- Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
- Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)
Datasets Features
- Add to_dict and to_pandas for Dataset #1889 (@SBrandeis)
- Add to_csv for Dataset #1887 (@SBrandeis)
- Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
- Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
- The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
- Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
- Add cross-platform support for datasets-cli #1951 (@mariosasko)
Metrics Changes
Offline loading
- Handle timeouts #1952 (@lhoestq)
- Add datasets full offline mode with HF_DATASETS_OFFLINE #1976 (@lhoestq)
General improvements and bugfixes
- Replace flatten_nested #1879 (@albertvillanova)
- add missing info on how to add large files #1885 (@stas00)
- Docs for adding new column on formatted dataset #1888 (@lhoestq)
- Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
- Bugfix for string_to_arrow timestamp[ns] support #1900 (@justin-yan)
- Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
- Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
- Standardizing datasets dtypes #1921 (@justin-yan)
- Remove unused py_utils objects #1916 (@albertvillanova)
- Fix save_to_disk with relative path #1923 (@lhoestq)
- Updating old cards #1928 (@mcmillanmajora)
- Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
- Fix builder config creation with data_dir #1932 (@lhoestq)
- Disallow ClassLabel with no names #1938 (@lhoestq)
- Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
- Documentation for to_csv, to_pandas and to_dict #1953 (@lhoestq)
- typos + grammar #1955 (@stas00)
- Fix unused arguments #1962 (@mariosasko)
- Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)