multilingual coref!
- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406
new features
- streamlit visualizer for semgrex/ssurgeon #1396
- updates to the constituency parser ensemble #1387
- accuracy improvements to the IN_ORDER oracle #1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399
new models
- Spanish combined models #1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval
bugfixes
- update tqdm usage to remove some duplicate code: #1413 3de69ca
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0