Datasets Changes
- New: C4 #2575 #2592 (@lhoestq)
- New: mC4 #2576 (@lhoestq)
- New: MasakhaNER #2465 (@dadelani)
- New: Eduge #2492 (@enod)
- Update: xor_tydi_qa - update version #2455 (@cccntu)
- Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
- Update: udpos - change features structure #2466 (@jerryIsHere)
- Update: WebNLG - update checksums #2558 (@lhoestq)
- Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
- Fix: proto_qa - fix download link #2463 (@mariosasko)
- Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
- Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
- Fix: code_search_net - fix keys #2555 (@lhoestq)
- Fix: discofuse - fix link cc #2541 (@VictorSanh)
- Fix: fever - fix keys #2557 (@lhoestq)
Datasets Features
- Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
- JAX integration #2502 (@lhoestq)
- Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
- Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
- Set configurable downloaded datasets path #2488 (@albertvillanova)
- Set configurable extracted datasets path #2487 (@albertvillanova)
- Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
- Add interleave_datasets for map-style datasets #2568 (@lhoestq)
- Add load_dataset_builder #2500 (@mariosasko)
- Support Zstandard compressed files #2578 (@albertvillanova)
Task templates
- Add task templates for tydiqa and xquad #2518 (@lewtun)
- Insert text classification template for Emotion dataset #2521 (@lewtun)
- Add summarization template #2529 (@lewtun)
- Add task template for automatic speech recognition #2533 (@lewtun)
- Remove task templates if required features are removed during
Dataset.map
#2540 (@lewtun) - Inject templates for ASR datasets #2565 (@lewtun)
General improvements and bug fixes
- Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
- Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
- Allow latest pyarrow version #2490 (@albertvillanova)
- Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
- Add Zenodo metadata file with license #2501 (@albertvillanova)
- add tensorflow-macos support #2493 (@slayerjain)
- Keep original features order #2453 (@albertvillanova)
- Add course banner #2506 (@sgugger)
- Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
- Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
- Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
- Improve performance of pandas arrow extractor #2519 (@albertvillanova)
- Fix fingerprint when moving cache dir #2509 (@lhoestq)
- Replace bad
n>1M
size tag #2527 (@lhoestq) - Fix dev version #2531 (@lhoestq)
- Sync with transformers disabling NOTSET #2534 (@albertvillanova)
- Fix logging levels #2544 (@albertvillanova)
- Add support for Split.ALL #2259 (@mariosasko)
- Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
- Make numpy arrow extractor faster #2505 (@lhoestq)
- fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
- Add ASR task and new languages to resources #2567 (@lewtun)
- Filter expected warning log from transformers #2571 (@albertvillanova)
- Fix BibTeX entry #2579 (@albertvillanova)
- Fix Counter import #2580 (@albertvillanova)
- Add aiohttp to tests extras require #2587 (@albertvillanova)
- Add language tags #2590 (@lewtun)
- Support pandas 1.3.0 read_csv #2593 (@lhoestq)
Dataset cards
- Updated Dataset Description #2420 (@binny-mathew)
- Update DatasetMetadata and ReadMe #2436 (@gchhablani)
- CRD3 dataset card #2515 (@wilsonyhlee)
- Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
- wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)
Docs
- no s at load_datasets #2479 (@julien-c)
- Fix docs custom stable version #2477 (@albertvillanova)
- Improve Features docs #2535 (@albertvillanova)
- Update README.md #2414 (@cryoff)
- Fix FileSystems documentation #2551 (@connor-mccarthy)
- Minor fix in loading metrics docs #2562 (@albertvillanova)
- Minor fix docs format for bertscore #2570 (@albertvillanova)
- Add streaming in load a dataset docs #2574 (@lhoestq)