Overview
This is a maintenance release of Stanza. It features new support for jieba
as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!
Enhancements
-
Supporting
jieba
library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using thejieba
Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with:nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}
, or by specifying argumenttokenize_with_jieba=True
. -
Setting resource directory with environment variable. You can now override the default model location
$HOME/stanza_resources
by setting an environmental variableSTANZA_RESOURCES_DIR
(#227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature. -
Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (#249). Thanks to @mahdiman for identifying the original issue.
-
Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.
Bugfixes
-
Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (#229). Thanks to @RyanElliott10 for reporting the issue.
-
Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an
AssertionError
on sentences that begin with a punctuation (#217). Thanks to @aryamccarthy for reporting this issue. -
Correct pytorch version requirement. Stanza is now asking for
pytorch>=1.3.0
to avoid a runtime error raised by pytorch ((#231)). Thanks to @Vodkazy for reporting this.
Known Model Issues & Solutions
-
Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (#276). Switching to the Korean
GSD
model may solve this issue. -
Default Polish LFG POS tagger incorrectly labeling last word in sentence as
PUNCT
. The default Polish model trained on theLFG
treebank may incorrectly tag the last word in a sentence asPUNCT
(#220). This issue may be solved by switching to the PolishPDB
model.