All Versions
6
Latest Version
Avg Release Cycle
112 days
Latest Release
429 days ago

Changelog History

  • v1.1.1 Changes

    August 13, 2020

    Overview

    πŸ›  This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the CoreNLPClient functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.

    πŸ†• New Features and Enhancements

    πŸ†• New Sentiment Analysis Models for English, German, Chinese : The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.

    πŸ†• New Biomedical and Clinical English Model Packages : Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit Stanza's biomedical models page.

    πŸ‘Œ Support for Adding User Customized Processors via Python Decorators : Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via @register_processor or @register_processor_variant decorators. See Stanza website for more information and examples (see custom Processors and Processor variants). (PR #322)

    πŸ‘Œ Support for Editable Properties For Data Objects : We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., Document, Sentence, Token, etc). Aside from the annotation they already support, additional annotation can be easily attached through data_object.add_property(). See our documentation for more information and examples. (PR #323)

    πŸ‘Œ Support for Automated CoreNLP Installation and CoreNLP Model Download : CoreNLP can now be easily downloaded in Stanza with stanza.install_corenlp(dir='path/to/corenlp/installation'); CoreNLP models can now be downloaded with stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation'). For more details please see the Stanza website. (PR #363)

    πŸ‘ Japanese Pipeline Supports SudachiPy as External Tokenizer : You can now use the SudachiPy library as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}. Note that this will require a separate installation of the SudachiPy library via pip. (PR #365)

    πŸ†• New Alternative Server for Stable Download of Resource Files : Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new resources_url argument. For example, stanza.download(lang='en', resources_url='stanford') will now download the resource file and English pipeline from Stanford servers. (Issue #331, PR #356)

    πŸ‘ CoreNLPClient Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server : The CoreNLPClient now supports a new Enum values with better semantics for its start_server argument for finer-grained control over how the server is launched, including a new option called StartServer.TRY_START that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for CoreNLPClient to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend StartServer.FORCE_START and StartSerer.DONT_START for better readability. (PR #302)

    πŸ†• New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages : Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue #399, PR #392)

    πŸ†• New Tokenizer for Thai Language : The available UD data for Thai is quite small. The authors of pythainlp helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue #148)

    πŸ‘Œ Support for Serialization of Document Objects : Now you can serialize and deserialize the entire document by running serialized_string = doc.to_serialized() and doc = Document.from_serialized(serialized_string). The serialized string can be decoded into Python objects by running objs = pickle.loads(serialized_string). (Issue #361, PR #366)

    πŸ‘Œ Improved Tokenization Speed : Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: 546ed13, 8e2076c, 7f5be82, etc.)

    πŸ‘‰ User provided Ukrainian NER model : We now have a model built from the lang-uk NER dataset, provided by a user for redistribution.

    πŸ’₯ Breaking Interface Changes

    Token.id is Tuple and Word.id is Integer : The id attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the id for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: #211, PR: #357)

    πŸ”„ Changed Default Pipeline Packages for Several Languages for Improved Robustness : Languages that have changed default packages include: Polish (default is now PDB model, from previous LFG, #220), Korean (default is now GSD, from previous Kaist, #276), Lithuanian (default is now ALKSNIS, from previous HSE, #415).

    CoreNLP 4.1.0 is required : CoreNLPClient requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.

    🚚 Properties Cache removed from CoreNLP client : The properties_cache has been removed from CoreNLPClient and the CoreNLPClient's annotate() method no longer has a properties_key argument. Python dictionaries with custom request properties should be directly supplied to annotate() via the properties argument.

    πŸ›  Bugfixes and Other Improvements

    πŸ›  Fixed Logging Behavior : This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue #278, PR #290)

    Compatibility Fix for PyTorch v1.6.0 : We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues #412 #417, PR #406)

    πŸ‘Œ Improved Batching for Long Sentences in Dependency Parser : This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue #387)

    πŸ‘Œ Improved neural tokenizer robustness to whitespaces : the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR #380)

    Resolved properties issue when switching languages with requests to CoreNLP server : An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.

  • v1.0.1 Changes

    April 27, 2020

    Overview

    πŸš€ This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

    ✨ Enhancements

    πŸ‘Œ Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.

    Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (#227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.

    Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (#249). Thanks to @mahdiman for identifying the original issue.

    πŸ‘Œ Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

    πŸ›  Bugfixes

    Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (#229). Thanks to @RyanElliott10 for reporting the issue.

    Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (#217). Thanks to @aryamccarthy for reporting this issue.

    Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((#231)). Thanks to @Vodkazy for reporting this.

    Known Model Issues & Solutions

    0️⃣ Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (#276). Switching to the Korean GSD model may solve this issue.

    πŸ’… Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (#220). This issue may be solved by switching to the Polish PDB model.

  • v1.0.0 Changes

    March 17, 2020

    Overview

    πŸš€ This is the first major release of Stanza (previously known as StanfordNLP), a software package to process many human languages. The main features of this release are

    • πŸ‘ Multi-lingual named entity recognition support. Stanza supports named entity recognition in 8 languages (and 12 datasets): Arabic, Chinese, Dutch, English, French, German, Russian, and Spanish. The most comprehensive NER models in each language is now part of the default model download of that model, along with other models trained on the largest dataset available.
    • Accurate neural network models. Stanza features highly accurate data-driven neural network models for a wide collection of natural language processing tasks, including tokenization, sentence segmentation, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition.
    • State-of-the-art pretrained models freely available. Stanza features a few hundred pretrained models for 60+ languages, all freely availble and easily downloadable from native Python code. Most of these models achieve state-of-the-art (or competitive) performance on these tasks.
    • πŸ‘ Expanded language support. Stanza now supports more than 60 human languages, representing a wide-range of language families.
    • Easy-to-use native Python interface. We've improved the usability of the interface to maximize transparency. Now intermediate processing results are more easily viewed and accessed as native Python objects.
    • πŸ‘ Anaconda support. Stanza now officially supports installation from Anaconda. You can install Stanza through Stanford NLP Group's Anaconda channel conda install -c stanfordnlp stanza.
    • πŸ‘Œ Improved documentation. We have improved our documentation to include a comprehensive coverage of the basic and advanced functionalities supported by Stanza.
    • πŸ‘Œ Improved CoreNLP support in Python. We have improved the robustness and efficiency of the CoreNLPClient to access the Java CoreNLP software from Python code. It is also forward compatible with the next major release of CoreNLP.

    ✨ Enhancements and Bugfixes

    πŸ›  This release also contains many enhancements and bugfixes:

    • πŸ‘ [Enhancement] Improved lemmatization support with proper conditioning on POS tags (#143). Thanks to @nljubesi for the report!
    • [Enhancement] Get the text corresponding to sentences in the document. Access it through sentence.text. (#80)
    • πŸ‘€ [Enhancement] Improved logging. Stanza now uses Python's logging for all procedual logging, which can be controlled globally either through logging_level or a verbose shortcut. See this page for more information. (#81)
    • [Enhancement] Allow the user to use the Stanza tokenizer with their own sentence split, which might be useful for applications like machine translation. Simply set tokenize_no_ssplit to True at pipeline instantiation. (#108)
    • πŸ“œ [Enhancement] Support running the dependency parser only given tokenized, sentence segmented, and POS/morphological feature tagged data. Simply set depparse_pretagged to True at pipeline instantiation. (#141) Thanks @mrapacz for the contribution!
    • πŸ“š [Enhancement] Added spaCy as an option for tokenizing (and sentence segmenting) English text for efficiency. See this documentation page for a quick example.
    • [Enhancement] Add character offsets to tokens, sentences, and spans.
    • πŸ›  [Bugfix] Correctly decide whether to load pretrained embedding files given training flags. (#120)
    • πŸ›  [Bugfix] Google proto buffers reporting errors for long input when using the CoreNLPClient. (#154)
    • πŸ›  [Bugfix] Remove deprecation warnings from newer versions of PyTorch. (#162)

    πŸ’₯ Breaking Changes

    πŸ“š Note that if your code was developed on a previous version of the package, there are potentially many breaking changes in this release. The most notable changes are in the Document objects, which contain all the annotations for the raw text or document fed into the Stanza pipeline. The underlying implementation of Document and all related data objects have broken away from using the CoNLL-U format as its internal representation for more flexibility and efficiency accessing their attributes, although it is still compatible with CoNLL-U to maintain ease of conversion between the two. Moreover, many properties have been renamed for clarity and sometimes aliased for ease of access. Please see our documentation page about these data objects for more information.

  • v0.2.0 Changes

    May 16, 2019

    πŸ›  This release features major improvements on memory efficiency and speed of the neural network pipeline in stanfordnlp and various bugfixes. These features include:

    🐎 The downloadable pretrained neural network models are now substantially smaller in size (due to the use of smaller pretrained vocabularies) with comparable performance. Notably, the default English model is now ~9x smaller in size, German ~11x, French ~6x and Chinese ~4x. As a result, memory efficiency of the neural pipelines for most languages are substantially improved.

    Substantial speedup of the neural lemmatizer via reduced neural sequence-to-sequence operations.

    The neural network pipeline can now take in a Python list of strings representing pre-tokenized text. (https://github.com/stanfordnlp/stanfordnlp/issues/58)

    πŸ”§ A requirements checking framework is now added in the neural pipeline, ensuring the proper processors are specified for a given pipeline configuration. The pipeline will now raise an exception when a requirement is not satisfied. (https://github.com/stanfordnlp/stanfordnlp/issues/42)

    πŸ›  Bugfix related to alignment between tokens and words post the multi-word expansion processor. (https://github.com/stanfordnlp/stanfordnlp/issues/71)

    πŸ“š More options are added for customizing the Stanford CoreNLP server at start time, including specifying properties for the default pipeline, and setting all server options such as username/password. For more details on different options, please checkout the client documentation page.

    0️⃣ CoreNLPClient instance can now be created with CoreNLP default language properties as:

    client = CoreNLPClient(properties='chinese')
    
    • Alternatively, a properties file can now be used during the creation of a CoreNLPClient:

      client = CoreNLPClient(properties='/path/to/corenlp.props')

    • 0️⃣ All specified CoreNLP annotators are now preloaded by default when a CoreNLPClient instance is created. (https://github.com/stanfordnlp/stanfordnlp/issues/56)

  • v0.1.2 Changes

    February 26, 2019

    πŸš€ This is a maintenance release of stanfordnlp. This release features:

    • πŸ‘ Allowing the tokenizer to treat the incoming document as pretokenized with space separated words in newline separated sentences. Set tokenize_pretokenized to True when building the pipeline to skip the neural tokenizer, and run all downstream components with your own tokenized text. (#24, #34)
    • Speedup in the POS/Feats tagger in evaluation (up to 2 orders of magnitude). (#18)
    • πŸ“š Various minor fixes and documentation improvements

    We would also like to thank the following community members for their contribution:
    Code improvements: @lwolfsonkin
    πŸ“š Documentation improvements: @0xflotus
    And thanks to everyone that raised issues and helped improve stanfordnlp!

  • v0.1.0 Changes

    January 30, 2019

    πŸš€ The initial release of StanfordNLP. StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Stanford CoreNLP software. This package is built with highly accurate neural network components that enables efficient training and evaluation with your own annotated data. The modules are built on top of PyTorch (v1.0.0).

    StanfordNLP features:

    • Native Python implementation requiring minimal efforts to set up;
    • πŸ“ˆ Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
    • πŸ‘ Pretrained neural models supporting 53 (human) languages featured in 73 treebanks;
    • A stable, officially maintained Python interface to CoreNLP.