Overview
Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.
New features
-
Langid model and multilingual pipeline
Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021
(154b0e8) -
Constituency parser
Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently anen_wsj
model available, with more to come.
(9031802) -
Evalb interface to CoreNLP
Useful for evaluating the parser - requires CoreNLP 4.3.0 or later -
Dictonary tokenizer feature
Noticeably improved performance for ZH, VI, TH
(#776)