textacy v0.9.0 Release Notes

Release Date: 2019-09-03 // over 4 years ago
  • ๐Ÿ‘€ Note: textacy is now PY3-only! ๐ŸŽ‰ Specifically, support for PY2.7 has been dropped, and the minimum PY3 version has been bumped to 3.6 (PR #261). See below for related changes.

    ๐Ÿ†• New:

    • โž• Added augmentation subpackage for basic text data augmentation (PR #268, #269)
      • implemented several transformer functions for substituting, inserting, swapping, and deleting elements of text at both the word- and character-level
      • implemented an Augmenter class for combining multiple transforms and applying them to spaCy Docs in a randomized but configurable manner
      • Note: This API is provisional, and subject to change in future releases.
    • โž• Added resources subpackage for standardized access to linguistic resources (PR #265)
      • DepecheMood++: high-coverage emotion lexicons for understanding the emotions evoked by a text. Updated from a previous version, and now features better English data and Italian data with expanded, consistent functionality.
      • removed lexicon_methods.py module with previous implementation
      • ConceptNet: multilingual knowledge base for representing relationships between words, similar to WordNet. Currently supports getting word antonyms, hyponyms, meronyms, and synonyms in dozens of languages.
    • โž• Added UDHR dataset, a collection of translations of the Universal Declaration of Human Rights (PR #271)

    ๐Ÿ”„ Changed:

    • โšก๏ธ Updated and extended functionality previously blocked by PY2 compatibility while reducing code bloat / complexity
      • made many args keyword-only, to prevent user error
      • args accepting strings for directory / file paths now also accept pathlib.Path objects, with pathlib adopted widely under the hood
      • increased minimum versions and/or uncapped maximum versions of several dependencies, including jellyfish, networkx, and numpy
    • โž• Added a Portuguese-specific formulation of Flesch Reading Ease score to text_stats (PR #263)
    • Reorganized and grouped together some like functionality
      • moved core functionality for loading spaCy langs and making spaCy docs into spacier.core, out of cache.py and doc.py
      • moved some general-purpose functionality from dataset.utils to io.utils and utils.py
      • moved function for loading "hyphenator" out of cache.py and into text_stats.py, where it's used
    • ๐Ÿš€ Re-trained and released language identification pipelines using a better mix of training data, for slightly improved performance; also added the script used to train the pipeline
    • ๐Ÿ”„ Changed API Reference docs to show items in source code rather than alphabetical order, which should make the ordering more human-friendly
    • ๐Ÿ“‡ Updated repo README and PyPi metadata to be more consistent and representative of current functionality
    • Removed previously deprecated textacy.io.split_record_fields() function

    ๐Ÿ›  Fixed:

    • Fixed a regex for cleaning up crufty terms to prevent catastrophic backtracking in certain edge cases (true story: this bug was encountered in production code, and ruined my day)
    • ๐Ÿ›  Fixed bad handling of edge cases in sCAKE keyterm extraction (Issue #270)
    • ๐Ÿ”„ Changed order in which URL regexes are applied in preprocessing.replace_urls() to properly handle certain edge case URLs (Issue #267)

    Contributors:

    ๐Ÿฑ Thanks much to @hugoabonizio for the contribution. ๐Ÿค