pypi stanza 1.11.0
v1.11.0

12 hours ago

Training upgrades

  • Should now be possible to train all annotators on Windows: stanfordnlp/stanza-train#20 #1439 The issue was twofold: a perl script shell call (which could actually be installed, but was annoying for non-perl users) and an overreliance on temp files, which can be opened twice in Unix but not in Windows. Fixed in 2677e77 d5c7b7f #1514

Model upgrades

  • Tokenizer can support the pretrained charlm now. This significantly improves the MWT performance on Hebrew, for example. #1511

  • Building tokenizers with pretrained charlm exposed a possible issue with the tokenizer including spaces when tokenizing when an MWT is split across two words. The effect occurred in Hebrew, but an English example would be wo n't tokenized as a single token with embedded space. Augmenting the training to enforce word splits across those spaces fixed the issue. 52cea78

  • use PackedSequence for the tokenizer - is slower, but results are stable when using inputs of different lengths: 4433e83 #1472

  • If a Tokenizer training set consistently has spaces between the ends of words and punctuations, the resulting trained model may not properly recognize the same text with periods at the end of the word. For example, this is a test . vs this is a test. Reported in #1504 Fixed for VI by 6878d8e

  • Coref now includes a zeros predictor - this predicts when a mention for certain datasets (such as Spanish) is a pro-drop mention. This behavior occurs by adding an empty node to the sentence. It can be disabled with the coref_use_zeros=False flag to the Pipeline. #1502

Model improvements

  • Sindhi pipeline based on the ISRA UD dataset, published at SyntaxFest 2025, with annotation support from MLtwist: https://aclanthology.org/2025.udw-1.11/

  • Tamil coreference model from KBC

  • update English lemmatizer with more verbs and ADJ from Prof. Lapalme

  • also, French lemmatizer changes with corrections from Prof. Lapalme

  • create a German lemmatizer using GSD data and a set of ADJ from Wiktionary

  • add GRC models mixed with a copy of the data with the diacritics stripped. because those work worse on GRC with diacritics, the originals are still the default: 5beca58

  • add a Thai TUD dataset from https://github.com/nlp-chula/TUD (not yet included in UD): bca078c

  • NER model for ANG: 68a56aa https://github.com/dmetola/Old_English-OEDT/tree/main

Other interface improvements

  • Fix conparser SyntaxWarning: #1513 thanks to @orenl

  • improve efficiency of reading conllu documents: f15f0bc

  • sort CoNLLU features when outputting a doc, as is standard: aa20fbb

  • semgrex interface improvements: search all files, only output failed matches, process all documents at once

  • turn coref max_train_len into a parameter: 1f98d8f #1465

  • allow for combined depparse models with multiple training files in a zip file (easier to mix training data): be94ac6

  • lemmatizer can skip blank lemmas (useful when training using partially complete lemma data): 7c34714

  • if using pretokenized text in the NER, try to use the token text to extract the text (previously would crash): ab249f6

  • don't retokenize pretokenized sentences: #1466 #1464

  • remove stray test output files: 2e4735a thanks to @otakutyrant

Constituency parser

Package dependency updates

  • remove verbose from ReduceLROnPlateau: 1015b6b thanks to @otakutyrant

  • update usage of xml.etree.ElementTree to match updated python interface: 7ca8750 thanks to @otakutyrant

  • cover up a jieba warning - package has not been updated in many years, not likely to be updated to fix deprecation errors any time soon. 0afdb61 thanks to @otakutyrant

  • drop support for Python 3.8: 6420c3d thanks to @otakutyrant

  • update tomli version requirement, #1444 thanks to @BLKSerene

Don't miss a new stanza release

NewReleases is sending notifications on new releases.