Stanza v1.4.0: Transformer integration to NER and conparse
Overview
As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.
Pipeline interface improvements
-
Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
#213
#1001 -
Download models for Pipelines automatically, without needing to call
stanza.download(...)
#486
#943 -
Add ability to turn off downloads
68455d8 -
Add a new interface where both processors and package can be set
#917
f370429 -
When using pretokenized tokens, get character offsets from text if available
#967
#975 -
If Bert or other transformers are used, cache the models rather than loading multiple times
#980 -
Allow for disabling processors on individual runs of a pipeline
#945
#947
Other general improvements
-
Upgrades to EN, IT, and Indonesian models
#1003
#1008
IT improvements with the help of @attardi and @msimi -
Fix improper tokenization of Chinese text with leading whitespace
#920
#924 -
Check if a CoreNLP model exists before downloading it (thank you @Internull)
#965 -
Convert the run_charlm script to python
#942 -
stanza-train examples now compatible with the python training scripts
#896
NER features
-
Swedish model (thank you @EmilStenstrom)
#912
#857 -
Persian model
#797 -
Danish model
3783cc4 -
Norwegian model (both NB and NN)
31fa23e -
Myanmar model (thank you UCSY)
#845 -
Fix inconsistencies in B/S/I/E tags
#928 (comment)
#961 -
Add an option for multiple NER models at the same time, merging the results together
#928
#955
Constituency parser
-
Dynamic oracle (improves accuracy a bit)
#866 -
bugfix of () not being escaped when output in a tree
eaf134c -
charlm integration by default
#799 -
Bert integration (not the default model) (thank you @vythaihn and @hungbui0411)
05a0b04
0bbe8d1 -
Preemptive bugfix for incompatible devices from @zhaochaocs
#989
#1002 -
New models:
DA, based on Arboretum
IT, based on the Turin treebank
JA, based on ALT
PT, based on Cintil
TR, based on Starlang
ZH, based on CTB7