Datasets Changes
- New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
- New: KLUE benchmark #2416 (@jungwhank)
- New: HendrycksTest #2370 (@andyzoujm)
- Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
- Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
- Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
- Fix: flores - fix download link #2448 (@mariosasko)
Datasets Features
- Add
desc
parameter inmap
forDatasetDict
object #2423 (@bhavitvyamalik) - Support sliced list arrays in cast #2461 (@lhoestq)
Dataset.cast
can now change the feature types of Sequence fields
- Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set
keep_in_memory=True
when loading a dataset to load it in memory
Datasets Cards
- adds license information for DailyDialog. #2419 (@aditya2211)
- add english language tags for ~100 datasets #2442 (@VictorSanh)
- Add copyright info to MLSUM dataset #2427 (@PhilipMay)
- Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
- Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)
General improvements and bug fixes
- Add DOI badge to README #2411 (@albertvillanova)
- Make datasets PEP-561 compliant #2417 (@SBrandeis)
- Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
- Fix CI six installation on linux #2432 (@lhoestq)
- Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
- Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
- doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
- add utf-8 while reading README #2418 (@bhavitvyamalik)
- Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
- Rename config and environment variable for in memory max size #2454 (@albertvillanova)
- Add version-specific BibTeX #2430 (@albertvillanova)
- Fix cross-reference typos in documentation #2456 (@albertvillanova)
- Better error message when using the wrong load_from_disk #2437 (@lhoestq)