Stanza v1.0.0 Release Notes

Release Date: 2020-03-17 // about 4 years ago
  • Overview

    ๐Ÿš€ This is the first major release of Stanza (previously known as StanfordNLP), a software package to process many human languages. The main features of this release are

    • ๐Ÿ‘ Multi-lingual named entity recognition support. Stanza supports named entity recognition in 8 languages (and 12 datasets): Arabic, Chinese, Dutch, English, French, German, Russian, and Spanish. The most comprehensive NER models in each language is now part of the default model download of that model, along with other models trained on the largest dataset available.
    • Accurate neural network models. Stanza features highly accurate data-driven neural network models for a wide collection of natural language processing tasks, including tokenization, sentence segmentation, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition.
    • State-of-the-art pretrained models freely available. Stanza features a few hundred pretrained models for 60+ languages, all freely availble and easily downloadable from native Python code. Most of these models achieve state-of-the-art (or competitive) performance on these tasks.
    • ๐Ÿ‘ Expanded language support. Stanza now supports more than 60 human languages, representing a wide-range of language families.
    • Easy-to-use native Python interface. We've improved the usability of the interface to maximize transparency. Now intermediate processing results are more easily viewed and accessed as native Python objects.
    • ๐Ÿ‘ Anaconda support. Stanza now officially supports installation from Anaconda. You can install Stanza through Stanford NLP Group's Anaconda channel conda install -c stanfordnlp stanza.
    • ๐Ÿ‘Œ Improved documentation. We have improved our documentation to include a comprehensive coverage of the basic and advanced functionalities supported by Stanza.
    • ๐Ÿ‘Œ Improved CoreNLP support in Python. We have improved the robustness and efficiency of the CoreNLPClient to access the Java CoreNLP software from Python code. It is also forward compatible with the next major release of CoreNLP.

    โœจ Enhancements and Bugfixes

    ๐Ÿ›  This release also contains many enhancements and bugfixes:

    • ๐Ÿ‘ [Enhancement] Improved lemmatization support with proper conditioning on POS tags (#143). Thanks to @nljubesi for the report!
    • [Enhancement] Get the text corresponding to sentences in the document. Access it through sentence.text. (#80)
    • ๐Ÿ‘€ [Enhancement] Improved logging. Stanza now uses Python's logging for all procedual logging, which can be controlled globally either through logging_level or a verbose shortcut. See this page for more information. (#81)
    • [Enhancement] Allow the user to use the Stanza tokenizer with their own sentence split, which might be useful for applications like machine translation. Simply set tokenize_no_ssplit to True at pipeline instantiation. (#108)
    • ๐Ÿ“œ [Enhancement] Support running the dependency parser only given tokenized, sentence segmented, and POS/morphological feature tagged data. Simply set depparse_pretagged to True at pipeline instantiation. (#141) Thanks @mrapacz for the contribution!
    • ๐Ÿ“š [Enhancement] Added spaCy as an option for tokenizing (and sentence segmenting) English text for efficiency. See this documentation page for a quick example.
    • [Enhancement] Add character offsets to tokens, sentences, and spans.
    • ๐Ÿ›  [Bugfix] Correctly decide whether to load pretrained embedding files given training flags. (#120)
    • ๐Ÿ›  [Bugfix] Google proto buffers reporting errors for long input when using the CoreNLPClient. (#154)
    • ๐Ÿ›  [Bugfix] Remove deprecation warnings from newer versions of PyTorch. (#162)

    ๐Ÿ’ฅ Breaking Changes

    ๐Ÿ“š Note that if your code was developed on a previous version of the package, there are potentially many breaking changes in this release. The most notable changes are in the Document objects, which contain all the annotations for the raw text or document fed into the Stanza pipeline. The underlying implementation of Document and all related data objects have broken away from using the CoNLL-U format as its internal representation for more flexibility and efficiency accessing their attributes, although it is still compatible with CoNLL-U to maintain ease of conversion between the two. Moreover, many properties have been renamed for clarity and sometimes aliased for ease of access. Please see our documentation page about these data objects for more information.