Dataset Changes
- New: NLU evaluation data #2238 (@dkajtoch)
- New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
- New: Bbaw egyptian #2290 (@phiwi)
- New: GooAQ #2260 (@bhavitvyamalik)
- New: SubjQA #2302 (@lewtun)
- New: Ascent KB #2341, #2349 (@phongnt570)
- New: HLGD #2325 (@tingofurro)
- New: Qasper #2346 (@cceyda)
- New: ConvQuestions benchmark #2372 (@PhilippChr)
- Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
- Update multi_woz_v22 - update checksum #2281 (@lhoestq)
- Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
- Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
- Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
- Update: web_science - fixed download link #2338 (@bhavitvyamalik)
- Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
- Update: conll2003 - correct labels #2369 (@philschmid)
- Update: offenseval_dravidian - update citations #2385 (@adeepH)
- Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
- Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
- Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
- Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
- Fix: head_qa - Fix keys #2408 (@lhoestq)
Dataset Features
- Implement Dataset add_item #1870 (@albertvillanova)
- Implement Dataset add_column #2145 (@albertvillanova)
- Implement Dataset to JSON #2248, #2352 (@albertvillanova)
- Add rename_columnS method #2312 (@SBrandeis)
- add
desc
totqdm
inDataset.map()
#2374 (@bhavitvyamalik) - Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)
Metric Changes
- New: CUAD metrics #2273 (@bhavitvyamalik)
- New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
- Update: CER - Docs, CER above 1 #2342 (@borisdayma)
General improvements and bug fixes
- Update black #2265 (@lhoestq)
- Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
- Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
- Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
- Fix query table with iterable #2269 (@lhoestq)
- Perform minor refactoring: use config #2253 (@albertvillanova)
- Update format, fingerprint and indices after add_item #2254 (@lhoestq)
- Always update metadata in arrow schema #2274 (@lhoestq)
- Make tests run faster #2266 (@lhoestq)
- Fix metadata validation with config names #2286 (@lhoestq)
- Fixed typo seperate->separate #2292 (@laksh9950)
- Allow collaborators to self-assign issues #2289 (@albertvillanova)
- Mapping in the distributed setting #2298 (@TevenLeScao)
- Fix conda release #2309 (@lhoestq)
- Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
- Set default name in init_dynamic_modules #2320 (@albertvillanova)
- Fix duplicate keys #2333 (@lhoestq)
- Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
- Metadata validation #2107 (@theo-m)
- Add Validation For README #2121 (@gchhablani)
- Fix overflow issue in interpolation search #2336 (@mariosasko)
- Datasets cli improvements #2315 (@mariosasko)
- Add
key
type and duplicates verification with hashing #2245 (@NikhilBartwal) - More consistent copy logic #2340 (@mariosasko)
- Update README vallidation rules #2353 (@gchhablani)
- normalized TOCs and titles in data cards #2355 (@yjernite)
- simpllify faiss index save #2351 (@Guitaricet)
- Allow "other-X" in licenses #2368 (@gchhablani)
- Improve ReadInstruction logic and update docs #2261 (@mariosasko)
- Disallow duplicate keys in yaml tags #2379 (@lhoestq)
- maintain YAML structure reading from README #2380 (@bhavitvyamalik)
- add dataset card title #2381 (@bhavitvyamalik)
- Add tests for dataset cards #2348 (@gchhablani)
- Improve example in rounding docs #2383 (@mariosasko)
- Paperswithcode dataset mapping #2404 (@julien-c)
- Free datasets with cache file in temp dir on exit #2403 (@mariosasko)
Experimental and work in progress: Format a dataset for specific tasks
- Task formatting for text classification & question answering #2255 (@SBrandeis)
- Add check for task templates on dataset load #2390 (@lewtun)
- Add args description to DatasetInfo #2384 (@lewtun)
- Improve task api code quality #2376 (@mariosasko)