Stanza v1.4.0: Transformer integration to NER and conparse

Overview

As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

Pipeline interface improvements

Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
#213
#1001
Download models for Pipelines automatically, without needing to call stanza.download(...)
#486
#943
Add ability to turn off downloads
68455d8
Add a new interface where both processors and package can be set
#917
f370429
When using pretokenized tokens, get character offsets from text if available
#967
#975
If Bert or other transformers are used, cache the models rather than loading multiple times
#980
Allow for disabling processors on individual runs of a pipeline
#945
#947

Other general improvements

Add # text and # sent_id to conll output
#918
#983
#995
Add ner to the token conll output
#993
#996
Fix missing Slovak MWT model
#971
5aa19ec
Upgrades to EN, IT, and Indonesian models
#1003
#1008
IT improvements with the help of @attardi and @msimi
Fix improper tokenization of Chinese text with leading whitespace
#920
#924
Check if a CoreNLP model exists before downloading it (thank you @Internull)
#965
Convert the run_charlm script to python
#942
Typing and lint fixes (thank you @asears)
#833
#856
stanza-train examples now compatible with the python training scripts
#896

NER features

Bert integration (not by default, thank you @vythaihn)
#976
Swedish model (thank you @EmilStenstrom)
#912
#857
Persian model
#797
Danish model
3783cc4
Norwegian model (both NB and NN)
31fa23e
Use updated Ukrainian data (thank you @gawy)
#873
Myanmar model (thank you UCSY)
#845
Training improvements for finetuning models
#788
#791
Fix inconsistencies in B/S/I/E tags
#928 (comment)
#961
Add an option for multiple NER models at the same time, merging the results together
#928
#955

Constituency parser

Dynamic oracle (improves accuracy a bit)
#866
Missing tags now okay in the parser
#862
04dbf4f
bugfix of () not being escaped when output in a tree
eaf134c
charlm integration by default
#799
Bert integration (not the default model) (thank you @vythaihn and @hungbui0411)
05a0b04
0bbe8d1
Preemptive bugfix for incompatible devices from @zhaochaocs
#989
#1002
New models:
DA, based on Arboretum
IT, based on the Turin treebank
JA, based on ALT
PT, based on Cintil
TR, based on Starlang
ZH, based on CTB7

stanza 1.4.0 Stanza v1.4.0 on Python PyPI

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

Pipeline interface improvements

Other general improvements

NER features

Constituency parser

stanza 1.4.0
Stanza v1.4.0

on Python PyPI