Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:

Ssurgeon interface! New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules. #1205 https://aclanthology.org/2023.tlt-1.7/
English Morphology class (deterministic English lemmatizer) 6aed177
English constituency -> dependency converter 0987794

Bugfixes:

Bugfix for older versions of torch: 376d7ea
Bugfix for training (integration with new scoring script) #1167 9c39636
Demo was showing constituency parser along with dependency parsing, even with conparse off: cbc13b0
Replace absurdly long characters with UNK (thank you @khughitt) #1137 #1140
Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. 435685f
stanza-train NER training bugfix (wrong pretrain): 2757cb4
Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy #1209 #1159
Fix error in preparing tokenizer datasets (thanks @dvzubarev): #1161
Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): #1162
Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): #1170
When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): b118473
Update use of emoji to match latest releases: #1195 ea345a8

Features:

Mechanism for resplitting tokens into MWT #95 8fac17f
CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) cfd44d1
detach().cpu() speeds things up significantly in some cases ccfbc56
Potentially use a constituency model as a classifier - WIP research project #1190
add an output format "{:C}" for document objects which prints out documents as CoNLL: #1169
If a constituency tree is available, include it when outputting conll format for documents: #1171
Same with sentiment: abb5819
Additional language code coverage (thank you @juanro49) 5802b10 f06bf86 32f83fa 3450575
Allow loading a pipeline for new languages (useful when developing a new suite of models) e7fcd26
Script to count the work done by annotators on aws sagemaker private workforce: #1186
Streaming interface which batch processes items in the stream: 2c9fe3d #550
Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: 70fd2fd #1199
Transformer at bottom layer of POS - currently only available in English as the en_combined_bert model, others to come #1132

New models:

Armenian NER model using an NER labeling of armtdp (thanks to @ShakeHakobyan): https://github.com/myavrum/ArmTDP-NER #1206 #1212
Sindhi tokenization from ISRA #1117
Sindhi NER from SiNER: 2a8ded4
Erzya from UD 2.11 0344ac3

Conparser experiments:

Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 110031e
TREE_LSTM constituent composition method (didn't beat MAX) 2f722c8
Learned weighting between bert layers (this did help a little) 2d0c69e
Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN #1148
New in_order_compound transition scheme: no improvement f560b08
Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency 2706c4b f500936
Report the scores of tags when retagging (does not affect the conparser training) 7663419
FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 90a8337
LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax 5edd724
Maxout layer: didn't help https://arxiv.org/abs/1302.4389 c708ce7
Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. 4954845

stanza 1.5.0 Stanza v1.5.0 on Python PyPI