stanza 1.11.0 on Python PyPI

Should now be possible to train all annotators on Windows: stanfordnlp/stanza-train#20 #1439 The issue was twofold: a perl script shell call (which could actually be installed, but was annoying for non-perl users) and an overreliance on temp files, which can be opened twice in Unix but not in Windows. Fixed in 2677e77 d5c7b7f #1514

Tokenizer can support the pretrained charlm now. This significantly improves the MWT performance on Hebrew, for example. #1511
Building tokenizers with pretrained charlm exposed a possible issue with the tokenizer including spaces when tokenizing when an MWT is split across two words. The effect occurred in Hebrew, but an English example would be wo n't tokenized as a single token with embedded space. Augmenting the training to enforce word splits across those spaces fixed the issue. 52cea78
use PackedSequence for the tokenizer - is slower, but results are stable when using inputs of different lengths: 4433e83 #1472
If a Tokenizer training set consistently has spaces between the ends of words and punctuations, the resulting trained model may not properly recognize the same text with periods at the end of the word. For example, this is a test . vs this is a test. Reported in #1504 Fixed for VI by 6878d8e
Coref now includes a zeros predictor - this predicts when a mention for certain datasets (such as Spanish) is a pro-drop mention. This behavior occurs by adding an empty node to the sentence. It can be disabled with the coref_use_zeros=False flag to the Pipeline. #1502

Sindhi pipeline based on the ISRA UD dataset, published at SyntaxFest 2025, with annotation support from MLtwist: https://aclanthology.org/2025.udw-1.11/
Tamil coreference model from KBC
update English lemmatizer with more verbs and ADJ from Prof. Lapalme
also, French lemmatizer changes with corrections from Prof. Lapalme
create a German lemmatizer using GSD data and a set of ADJ from Wiktionary
add GRC models mixed with a copy of the data with the diacritics stripped. because those work worse on GRC with diacritics, the originals are still the default: 5beca58
add a Thai TUD dataset from https://github.com/nlp-chula/TUD (not yet included in UD): bca078c
NER model for ANG: 68a56aa https://github.com/dmetola/Old_English-OEDT/tree/main

Fix conparser SyntaxWarning: #1513 thanks to @orenl
improve efficiency of reading conllu documents: f15f0bc
sort CoNLLU features when outputting a doc, as is standard: aa20fbb
semgrex interface improvements: search all files, only output failed matches, process all documents at once
turn coref max_train_len into a parameter: 1f98d8f #1465
allow for combined depparse models with multiple training files in a zip file (easier to mix training data): be94ac6
lemmatizer can skip blank lemmas (useful when training using partially complete lemma data): 7c34714
if using pretokenized text in the NER, try to use the token text to extract the text (previously would crash): ab249f6
don't retokenize pretokenized sentences: #1466 #1464
remove stray test output files: 2e4735a thanks to @otakutyrant

relative attention layer, similar to that used in https://aclanthology.org/2023.findings-emnlp.25/ #1474
output some basic analysis of errors: 5503c4c
current best conparser published at SyntaxFest 2025: https://aclanthology.org/2025.iwpt-1.4/

remove verbose from ReduceLROnPlateau: 1015b6b thanks to @otakutyrant
update usage of xml.etree.ElementTree to match updated python interface: 7ca8750 thanks to @otakutyrant
cover up a jieba warning - package has not been updated in many years, not likely to be updated to fix deprecation errors any time soon. 0afdb61 thanks to @otakutyrant
drop support for Python 3.8: 6420c3d thanks to @otakutyrant
update tomli version requirement, #1444 thanks to @BLKSerene

stanza 1.11.0 v1.11.0 on Python PyPI