stanza 1.0.1 on Python PyPI

Overview

This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

Enhancements

Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.
Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (#227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.
Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (#249). Thanks to @mahdiman for identifying the original issue.
Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

Bugfixes

Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (#229). Thanks to @RyanElliott10 for reporting the issue.
Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (#217). Thanks to @aryamccarthy for reporting this issue.
Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((#231)). Thanks to @Vodkazy for reporting this.

Known Model Issues & Solutions

Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (#276). Switching to the Korean GSD model may solve this issue.
Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (#220). This issue may be solved by switching to the Polish PDB model.

stanza 1.0.1 Stanza v1.0.1 on Python PyPI

Overview

Enhancements

Bugfixes

Known Model Issues & Solutions

stanza 1.0.1
Stanza v1.0.1

on Python PyPI