textacy v0.11.0 Release Notes

Release Date: 2021-04-12 // 8 months ago
    • ♻️ Refactored, standardized, and extended several areas of functionality
      • text preprocessing (textacy.preprocessing)
      • Added functions for normalizing bullet points in lists (normalize.bullet_points()), removing HTML tags (remove.html_tags()), and removing bracketed contents such as in-line citations (remove.brackets()).
      • Added make_pipeline() function for combining multiple preprocessors applied sequentially to input text into a single callable.
      • Renamed functions for flexibility and clarity of use; in most cases, this entails replacing an underscore with a period, e.g. preprocessing.normalize_whitespace() => preprocessing.normalize.whitespace().
      • Renamed and standardized some funcs' args; for example, all "replace" functions had their (optional) second argument renamed from replace_with => repl, and remove.punctuation(text, marks=".?!") => remove.punctuation(text, only=[".", "?", "!"]).
      • structured information extraction (textacy.extract)
      • Consolidated and restructured functionality previously spread across the extract.py and text_utils.py modules and ke subpackage. For the latter two, imports have changed:
        • from textacy import ke; ke.textrank() => from textacy import extract; extract.keyterms.textrank()
        • from textacy import text_utils; text_utils.keywords_in_context() => from textacy import extract; extract.keywords_in_context()
      • Added new extraction functions:
        • extract.regex_matches(): For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.
        • extract.acronyms(): For extracting acronym-like tokens, without looking around for related definitions.
        • extract.terms(): For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.
      • Improved the generality and quality of extracted "triples" such as Subject-Verb-Objects, and changed the structure of returned objects accordingly. Previously, only contiguous spans were permitted for each element, but this was overly restrictive: A sentence like "I did not really like the movie." would produce an SVO of ("I", "like", "movie") which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces (["I"], ["did", "not", "like"], ["movie"]). For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g. svo.subject == svo[0]).
      • Changed extract.keywords_in_context() to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to accept Doc or str objects as input.
      • Removed deprecated extract.pos_regex_matches() function, which is superseded by the more powerful extract.token_matches().
      • string and sequence similarity metrics (textacy.similarity)
      • Refactored top-level similarity.py module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics.
      • Added several similarity metrics:
        • edit-based Jaro (similarity.jaro())
        • token-based Cosine (similarity.cosine()), Bag (similarity.bag()), and Tversky (similarity.tvserky())
        • sequence-based Matching Subsequences Ratio (similarity.matching_subsequences_ratio())
        • hybrid Monge-Elkan (similarity.monge_elkan())
      • Removed a couple similarity metrics: Word Movers Distance relied on a troublesome external dependency, and Word2Vec+Cosine is available in spaCy via Doc.similarity.
      • network- and vector-based document representations (textacy.representations)
      • Consolidated and reworked networks functionality in representations.network module
        • Added build_cooccurrence_network() function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred.
        • Added build_similarity_network() function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity.
        • Removed obsolete network.py module and duplicative extract.keyterms.graph_base.py module.
      • Refined vectorizer initialization, and moved from vsm.vectorizers to representations.vectorizers module.
        • For both Vectorizer and GroupVectorizer, applying global inverse document frequency weights is now handled by a single arg: idf_type: Optional[str], rather than a combination of apply_idf: bool, idf_type: str; similarly, applying document-length weight normalizations is handled by dl_type: Optional[str] instead of apply_dl: bool, dl_type: str
      • Added representations.sparse_vec module for higher-level access to document vectorization via build_doc_term_matrix() and build_grp_term_matrix() functions, for cases when a single fit+transform is all you need.
      • automatic language identification (textacy.lang_id)
      • Moved functionality from lang_utils.py module into a subpackage, and added the primary user interface (identify_lang() and identify_topn_langs()) as package-level imports.
      • Implemented and trained a more accurate thinc-based language identification model that's closer to the original CLD3 inspiration, replacing the simpler sklearn-based pipeline.
    • ⚡️ Updated interface with spaCy for v3, and better leveraged the new functionality
      • Restricted textacy.load_spacy_lang() to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately, textacy can no longer play fast and loose with automatic language identification => pipeline loading...
      • Extended textacy.make_spacy_doc() to accept a chunk_size arg that splits input text into chunks, processes each individually, then joins them into a single Doc; supersedes spacier.utils.make_doc_from_text_chunks(), which is now deprecated.
      • Moved core Doc extensions into a top-level extensions.py module, and improved/streamlined the collection
      • Refactored and improved performance of Doc._.to_bag_of_words() and Doc._.to_bag_of_terms(), leveraging related functionality in extract.words() and extract.terms()
      • Removed redundant/awkward extensions:
        • Doc._.lang => use Doc.lang_
        • Doc._.tokens => use iter(Doc)
        • Doc._.n_tokens => len(Doc)
        • Doc._.to_terms_list() => extract.terms(doc) or Doc._.extract_terms()
        • Doc._.to_tagged_text() => NA, this was an old holdover that's not used in practice anymore
        • Doc._.to_semantic_network() => NA, use a function in textacy.representations.networks
      • Added Doc extensions for textacy.extract functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use either textacy.extract.acronyms(doc) or doc._.extract_acronyms(). Keyterm extraction functions share a single extension: textacy.extract.keyterms.textrank(doc) <> doc._.extract_keyterms(method="textrank")
      • Leveraged spaCy's new DocBin for efficiently saving/loading Docs in binary format, with corresponding arg changes in io.write_spacy_docs() and Corpus.save()+.load()
    • 👌 Improved package documentation, tests, dependencies, and type annotations
      • Added two beginner-oriented tutorials to documentation, showing how to use various aspects of the package in the context of specific tasks.
      • Reorganized API reference docs to put like functionality together and more consistently provide summary tables up top
      • Updated dependencies list and package versions
      • Removed: pyemd and srsly
      • Un-capped max versions: numpy and scikit-learn
      • Bumped min versions: cytoolz, jellyfish, matplotlib, pyphen, and spacy (v3.0+ only!)
      • Bumped min Python version from 3.6 => 3.7, and added PY3.9 support
      • Removed textacy.export module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency on gensim and CONLL-U that wasn't enforced or guaranteed, so better to remove.
      • Added types.py module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base.
      • Improved, added, and parametrized literally hundreds of tests.

    Contributors

    🔀 Many thanks to @timgates42, @datanizing, @8W9aG, @0x2b3bfa0, and @gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.


Previous changes from v0.10.1

  • 🆕 New and Changed:

    • ♻️ Expanded text statistics and refactored into a sub-package (PR #307)
      • Refactored text_stats module into a sub-package with the same name and top-level API, but restructured under the hood for better consistency
      • Improved performance, API, and documentation on the main TextStats class, and improved documentation on many of the individual stats functions
      • Added new readability tests for texts in Arabic (Automated Arabic Readability Index), Spanish (µ-legibility and perspecuity index), and Turkish (a lang-specific formulation of Flesch Reading Ease)
      • Breaking change: Removed TextStats.basic_counts and TextStats.readability_stats attributes, since typically only one or a couple needed for a given use case; also, some of the readability tests are language-specific, which meant bad results could get mixed in with good ones
    • 👌 Improved and standardized some code quality and performance (PR #305, #306)
      • Standardized error messages via top-level errors.py module
      • Replaced str.format() with f-strings (almost) everywhere, for performance and readability
      • Fixed a whole mess of linting errors, significantly improving code quality and consistency
    • 👌 Improved package configuration, and maintenance (PRs #298, #305, #306)
      • Added automated GitHub workflows for building and testing the package, linting and formatting, publishing new releases to PyPi, and building documentation (and ripped out Travis CI)
      • Added a makefile with common commands for dev work, plus instructions
      • Adopted the new pyproject.toml package configuration standard; updated and streamlined setup.py and setup.cfg accordingly; and removed requirements.txt
      • Moved all source code into a /src directory, for technical reasons
      • Added mypy-specific config file to reduce output noisiness when type-checking
    • 👌 Improved and moved package documentation (PR #309)
      • Moved the docs site back to ReadTheDocs (https://textacy.readthedocs.io)! Pardon the years-long detour into GitHub Pages...
      • Enabled markdown-based documentation using recommonmark instead of m2r, and migrated all "narrative" docs from .rst to equivalent .md files
      • Added auto-generated summary tables to many sections of the API Reference, to help users get an overview of functionality and better find what they're looking for; also added auto-generated section heading references
      • Tidied up and further standardized docstrings throughout the code
    • Kept up with the Python ecosystem
      • Trained a v1.1 language identifier model using scikit-learn==0.23.0, and bumped the upper bound on that dependency's version accordingly
      • Updated and parametrized many tests using modern pytest functionality (PR #306)
      • Got textacy versions 0.9.1 and 0.10.0 up on conda-forge (Issue #294)
      • Added spectral seriation as a term-ordering technique when making a "Termite" visualization by taking advantage of pandas.DataFrame functionality, and otherwise tidied up the default for nice-looking plots (PR #295)

    🛠 Fixed:

    • 📄 Corrected an incorrect and misleading reference in the quickstart docs (Issue #300, PR #302)
    • 🛠 Fixed a bug in the delete_words() augmentation transform (Issue #308)

    Contributors:

    🍱 Special thanks to @tbsexton, @marius-mather, and @rmax for their contributions! 💐