Stanza v1.0.1 Release Notes

Release Date: 2020-04-27 // almost 4 years ago
  • Overview

    🚀 This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

    ✨ Enhancements

    👌 Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.

    Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (#227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.

    Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (#249). Thanks to @mahdiman for identifying the original issue.

    👌 Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

    🛠 Bugfixes

    Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (#229). Thanks to @RyanElliott10 for reporting the issue.

    Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (#217). Thanks to @aryamccarthy for reporting this issue.

    Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((#231)). Thanks to @Vodkazy for reporting this.

    Known Model Issues & Solutions

    0️⃣ Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (#276). Switching to the Korean GSD model may solve this issue.

    💅 Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (#220). This issue may be solved by switching to the Polish PDB model.