New documentation
- New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference
See the new documentation here !
Datasets changes
- New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
- New: The Pile books3 #2801 (@richarddwang)
- New: The Pile stack exchange #2803 (@richarddwang)
- New: The Pile openwebtext2 #2802 (@richarddwang)
- New: Food-101 #2804 (@nateraw)
- New: Beans #2809 (@nateraw)
- New: cedr #2796 (@naumov-al)
- New: cats_vs_dogs #2807 (@nateraw)
- New: MultiEURLEX #2865 (@iliaschalkidis)
- New: BIOSSES #2881 (@bwang482)
- Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
- Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
- Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
- Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
- Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
- Update: SUPERB - Add SD task #2661 (@albertvillanova)
- Update: SUPERB - Add KS task #2783 (@anton-l)
- Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
- Update: Openwebtext - update size #2857 (@lhoestq)
- Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
- Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
- Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
- Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
- Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
- Fix: linnaeus - fix url #2852 (@lhoestq)
- Fix ToTTo - fix data URL #2864 (@albertvillanova)
- Fix: wikicorpus - fix keys #2844 (@lhoestq)
- Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
- Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)
Datasets features
- Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
- Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
- add multi-proc in
to_json
#2747 (@bhavitvyamalik) - Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)
Dataset streaming - better support for compression:
- Fix streaming zip files #2798 (@albertvillanova)
- Support streaming tar files #2800 (@albertvillanova)
- Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
- Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
- Add url prefix convention for many compression formats #2822 (@lhoestq)
- Support streaming datasets that use pathlib #2874 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)
Metrics changes
- Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
- Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)
Dataset cards
- Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
- Update ELI5 README.md #2848 (@odellus)
General improvements and bug fixes
- Update release instructions #2740 (@albertvillanova)
- Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
- Allow PyArrow from source #2769 (@patrickvonplaten)
- fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
- Fix typo in test_dataset_common #2790 (@nateraw)
- Fix type hint for data_files #2793 (@albertvillanova)
- Bump tqdm version #2814 (@mariosasko)
- Use packaging to handle versions #2777 (@albertvillanova)
- Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
- Rename The Pile subsets #2817 (@lhoestq)
- Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
- Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
- Fix extraction protocol inference from urls with params #2843 (@lhoestq)
- Fix caching when moving script #2854 (@lhoestq)
- Fix windows CI CondaError #2855 (@lhoestq)
- fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
- Update
column_names
showed as:func:
in exploring.st #2851 (@ClementRomac) - Fix s3fs version in CI #2858 (@lhoestq)
- Fix three typos in two files for documentation #2870 (@leny-mi)
- Move checks from _map_single to map #2660 (@mariosasko)
- fix regex to accept negative timezone #2847 (@jadermcs)
- Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
- Fix null sequence encoding #2900 (@lhoestq)