Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage
Overview
We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.
New NER models
-
New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
#1070
#1110 -
Make GermEval2014 the default German NER model, including an optional Bert version
#1018
#1022 -
Japanese conversion of GSD by Megagon
#1038 -
Marathi NER dataset from L3Cube. Includes a Sentiment model as well
#1043 -
Thai conversion of LST20
555fc03 -
Kazakh conversion of KazNERD
de6cd25
Other new models
-
Sentiment conversion of Tass2020 for Spanish
#1104 -
VIT constituency dataset for Italian
149f144
... and many subsequent updates -
For UD models with small train dataset & larger test dataset, flip the datasets
UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
#1030
9618d60 -
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
47740c6
Model improvements
-
Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
#1086 -
Pretrained charlm integrated into Sentiment. Improves English, others not so much
#1025 -
LSTM, 2d maxpool as optional items in the Sentiment
from the paperText Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
#1098 -
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
b1d10d3 -
Grad clipping in conparse training
365066a
Pipeline interface improvements
-
GPU memory savings: charlm reused between different processors in the same pipeline
#1028 -
Word vectors not saved in the NER models. Saves bandwidth & disk space
#1033 -
Functions to return tagsets for NER and conparse models
#1066
#1073
36b84db
2db43c8 -
displaCy integration with NER and dependency trees
2071413
Bugfixes
-
Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
#1056 -
Starting a new corenlp client w/o server shouldn't wait for the server to be available
TY to Mariano Crosetti
#1059
#1061 -
Read raw glove word vectors (they have no header information)
#1074 -
Ensure that illegal languages are not chosen by the LangID model
#1076
#1077 -
Fix loading of previously unseen languages in Multilingual pipeline
#1101
e551ebe -
Fix that conparse would occasionally train to NaN early in the training
c4d7857
Improved training tools
-
W&B integration for all models: can be activated with --wandb flag in the training scripts
#1040 -
New webpages for building charlm, NER, and Sentiment
https://stanfordnlp.github.io/stanza/new_language_charlm.html
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/new_language_sentiment.html -
Script to download Oscar 2019 data for charlm from HF (requires
datasets
module)
#1014 -
Unify sentiment training into a Python script, replacing the old shell script
#1021
#1023 -
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese
#1024 -
Slightly faster charlm training
#1026 -
Data conversion of WikiNER generalized for retraining / add new WikiNER models
#1039 -
XPOS factory now determined at start of POS training. Makes addition of new languages easier
#1082 -
Checkpointing and continued training for charlm, conparse, sentiment
#1090
0e6de80
e5793c9 -
Option to write the results of a NER model to a file
#1108 -
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools
6544ef3 -
Convert an AMT NER result to Stanza .json
cfa7e49 -
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters
5a5e918
b32a98e and others