Stanza v1.0.1 release notes (2020-04-27)

« Changelog History

Stanza v1.0.1 Release Notes

Release Date: 2020-04-27 // almost 4 years ago

Overview

🚀 This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

✨ Enhancements
👌 Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.
Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (#227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.
Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (#249). Thanks to @mahdiman for identifying the original issue.
👌 Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

🛠 Bugfixes
Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (#229). Thanks to @RyanElliott10 for reporting the issue.
Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (#217). Thanks to @aryamccarthy for reporting this issue.
Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((#231)). Thanks to @Vodkazy for reporting this.

Known Model Issues & Solutions
0️⃣ Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (#276). Switching to the Korean GSD model may solve this issue.
💅 Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (#220). This issue may be solved by switching to the Polish PDB model.

Stanza v1.0.1

Version Release Notes from April 27, 2020 (almost 4 years ago)

« Changelog History

Stanza v1.0.1 Release Notes

Overview

✨ Enhancements

🛠 Bugfixes

Known Model Issues & Solutions