stanza 1.10.0 on Python PyPI

In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.

Other notable changes:

Include a contextual lemmatizer in English for 's -> be or have in the default_accurate package. Also built is a HI model. Others potentially to follow. #1422
Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
Pytorch compatibility: set weights_only=True when loading models. #1430 #1429
augment MWT tokenization to accommodate unexpected ' characters, including " used in "s - #1437 #1436
when training the lemmatizer, take advantage of CorrectForm annotations in the UD treebanks dbdf429
add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
add VLSP 2023 constituency dataset: 1159d0d

Bugfixes:

raise_for_status earlier when failing to download something, so that the proper error gets displayed.
Thank you @pattersam #1432
Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
missing text for a Document does not cause the NER model to crash: 0732628 #1428
tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423

stanza 1.10.0 v1.10.0 - rebuild with UD 2.15 on Python PyPI

stanza 1.10.0
v1.10.0 - rebuild with UD 2.15

on Python PyPI