Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
#1070
#1110
Make GermEval2014 the default German NER model, including an optional Bert version
#1018
#1022
Japanese conversion of GSD by Megagon
#1038
Marathi NER dataset from L3Cube. Includes a Sentiment model as well
#1043
Thai conversion of LST20
555fc03
Kazakh conversion of KazNERD
de6cd25

Sentiment conversion of Tass2020 for Spanish
#1104
VIT constituency dataset for Italian
149f144
... and many subsequent updates
Combined UD models for Hebrew
#1109
e4fcf00
For UD models with small train dataset & larger test dataset, flip the datasets
UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
#1030
9618d60
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
47740c6

Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
#1086
Pretrained charlm integrated into Sentiment. Improves English, others not so much
#1025
LSTM, 2d maxpool as optional items in the Sentiment
from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
#1098
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
b1d10d3
Grad clipping in conparse training
365066a

GPU memory savings: charlm reused between different processors in the same pipeline
#1028
Word vectors not saved in the NER models. Saves bandwidth & disk space
#1033
Functions to return tagsets for NER and conparse models
#1066
#1073
36b84db
2db43c8
displaCy integration with NER and dependency trees
2071413

Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
#1056
Starting a new corenlp client w/o server shouldn't wait for the server to be available
TY to Mariano Crosetti
#1059
#1061
Read raw glove word vectors (they have no header information)
#1074
Ensure that illegal languages are not chosen by the LangID model
#1076
#1077
Fix cache in Multilingual pipeline
#1115
cdf18d8
Fix loading of previously unseen languages in Multilingual pipeline
#1101
e551ebe
Fix that conparse would occasionally train to NaN early in the training
c4d7857

W&B integration for all models: can be activated with --wandb flag in the training scripts
#1040
New webpages for building charlm, NER, and Sentiment
https://stanfordnlp.github.io/stanza/new_language_charlm.html
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/new_language_sentiment.html
Script to download Oscar 2019 data for charlm from HF (requires datasets module)
#1014
Unify sentiment training into a Python script, replacing the old shell script
#1021
#1023
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese
#1024
Slightly faster charlm training
#1026
Data conversion of WikiNER generalized for retraining / add new WikiNER models
#1039
XPOS factory now determined at start of POS training. Makes addition of new languages easier
#1082
Checkpointing and continued training for charlm, conparse, sentiment
#1090
0e6de80
e5793c9
Option to write the results of a NER model to a file
#1108
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools
6544ef3
Convert an AMT NER result to Stanza .json
cfa7e49
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters
5a5e918
b32a98e and others