Dataset changes
- New: CaSiNo #2867 (@kushalchawla)
- New: Mostly Basic Python Problems #2893 (@lvwerra)
- New: OpenAI's HumanEval #2897 (@lvwerra)
- New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
- New: SEDE #2942 (@Hazoom)
- New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
- New: AMI #2853 (@cahya-wirawan)
- New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
- New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
- New: KanHope #2985 (@adeepH)
- New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
- New: SwedMedNER #2940 (@bwang482)
- New: SberQuAD #3039 (@Alenush)
- New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
- New: Greek Legal Code #2966 (@christospi)
- New: Story Cloze Test #3067 (@zaidalyafeai)
- Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
- Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
- Update: TriviaQA - add web and wiki config #2949 (@shirte)
- Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
- Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
- Update: Biosses - fix column names #3054 (@bwang482)
- Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
- Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
- Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
- Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
- Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
- Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)
Metric changes
- Update: meteor - update from nltk update #2946 (@lhoestq)
- Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
- Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
- Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)
Dataset features
- Use with TensorFlow:
- Adding
to_tf_dataset
method #2731 #2931 #2951 #2974 (@Rocketknight1)
- Adding
- Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
- Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add
remove_columns
toIterableDataset
#3030 (@cccntu) - All the above ZIP features also work in streaming mode
- New utilities:
- Replace script_version with revision #2933 (@albertvillanova)
- The
script_version
parameter inload_dataset
is now deprecated, in favor ofrevision
- The
- Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed
Dataset cards
- Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)
Documentation
General improvements and bug fixes
- Fix filter leaking #3019 (@lhoestq)
- calling
filter
several times in a row was not returning the right results in 1.12.0 and 1.12.1
- calling
- Update BibTeX entry #2928 (@albertvillanova)
- Fix exception chaining #2911 (@albertvillanova)
- Add regression test for null Sequence #2929 (@albertvillanova)
- Don't use old, incompatible cache for the new
filter
#2947 (@lhoestq) - Fix fn kwargs in filter #2950 (@lhoestq)
- Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
- Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
- Fix missing conda deps #2952 (@lhoestq)
- Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
- Support pandas 1.3 new
read_csv
parameters #2960 (@SBrandeis) - Fix CI doc build #2961 (@albertvillanova)
- Run tests in parallel #2954 (@albertvillanova)
- Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
- Take namespace into account in caching #2938 (@lhoestq)
- Make Dataset.map accept list of np.array #2990 (@albertvillanova)
- Fix loading compressed CSV without streaming #2994 (@albertvillanova)
- Fix json loader when conversion not implemented #3000 (@lhoestq)
- Remove all query parameters when extracting protocol #2996 (@albertvillanova)
- Correct a typo #3007 (@Yann21)
- Fix Windows test suite #3025 (@albertvillanova)
- Remove unused parameter in xdirname #3017 (@albertvillanova)
- Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
- Fix typo #3023 (@qqaatw)
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
- Use cache folder for lockfile #2887 (@Dref360)
- Fix streaming: catch Timeout error #3050 (@borisdayma)
- Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
- Fix task reloading from cache #3059 (@lhoestq)
- Fix test command after refac #3065 (@lhoestq)
- Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
- Update summary on PyPi beyond NLP #3062 (@thomwolf)
- Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
- feat: increase streaming retry config #3068 (@borisdayma)
- Fix pathlib patches for streaming #3072 (@lhoestq)
Breaking changes:
- Due to the big refactoring at #2986, the
prepare_module
function doesn't support thereturn_resolved_file_path
andreturn_associated_base_path
parameters. As an alternative, you may use thedataset_module_factory
instead.