stanza 1.8.1 on Python PyPI

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
NER also trained with peft: unfortunately, no consistent improvements to scores #1336
depparse includes peft: no consistent improvements yet #1337 #1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
wandb support for coref #1338
Coref annotator breaks length ties using POS if available #1326 c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: #1318 #1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
Coref training rounding error #1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
Clarify error when a language is only partially handled: da01644 #1310

Additional 1.8.1 Bugfixes

Older POS models not loaded correctly... need to use .get() 13ee3d5 #1357
Debug logging for the Constituency retag pipeline to better support someone working on Icelandic 6e2520f #1356
device arg in MultilingualPipeline would crash if device was passed for an individual Pipeline: 44058a0

stanza 1.8.1 PEFT Integration (with bugfixes) on Python PyPI

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

stanza 1.8.1
PEFT Integration (with bugfixes)

on Python PyPI