stanza 1.6.1 on Python PyPI

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the setup.py classifiers. This will be the last release that supports Python 3.6

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: #1287

addresses #1259 and #1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

#1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline #1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though 45ef544
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code #1279 88cd0df
charlm for PT (improves accuracy on non-transformer models): c10763d
build models with transformers for a few additional languages: MR, AR, PT, JA 45b3875 0f3761e c55472a c10763d

Bugfixes

V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: b56f442
Scenegraph CoreNLP connection needed to be checked before sending messages: stanfordnlp/CoreNLP#1346 (comment) c71bf3f
run_ete.py was not correctly processing the charlm, meaning the whole thing wouldn't actually run 16f29f3
Chinese NER model was pointing to the wrong pretrain #1285 82a0215

stanza 1.6.1 Multiple default models and a combined EN NER model on Python PyPI

Multiple model levels

Multiple output heads for one NER model

Other features

Bugfixes

stanza 1.6.1
Multiple default models and a combined EN NER model

on Python PyPI