textacy v0.12.0 Release Notes
Release Date: 2021-12-06 // almost 2 years ago-
- ๐จ Refactored and extended text statistics functionality (PR #350)
- Added functions for computing measures of lexical diversity, such as the clasic Type-Token-Ratio and modern Hypergeometric Distribution Diversity
- Added functions for counting token-level attributes, including morphological features and parts-of-speech, in a convenient form
- Refactored all text stats functions to accept a
Doc
as their first positional arg, suitable for use as custom doc extensions (see below) - Deprecated the
TextStats
class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.
Standardized functionality for getting/setting/removing doc extensions (PR #352)
- Now, custom extensions are accessed by name, and users have more control over the process:
>>> import textacy >>> from textacy import extract, text_stats >>> textacy.set_doc_extensions("extract") >>> textacy.set_doc_extensions("text_stats.readability") >>> textacy.remove_doc_extensions("extract.matches") >>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease() 118.17500000000001
- Moved top-level extensions into
spacier.core
andextract.bags
- Standardized
extract
andtext_stats
subpackage extensions to use the new setup, and made them more customizable
๐ Improved package code, tests, and docs
- Fixed outdated code and comments in the "Quickstart" guide, then renamed it "Walkthrough" since it wasn't actually quick; added a new and, yes, quick "Quickstart" guide to fill the gap (PR #353)
- Added a
pytest
conftest file to improve maintainability and consistency of unit test suite (PR #353) - Improved quality and consistency of type annotations, everywhere (PR #349)
- Note: Bumped Python version support from 3.7โ3.9 to 3.8โ3.10 in order to take advantage of new typing features in PY3.8 and formally support the current major version (PR #348)
- Modernized and streamlined package builds and configuration (PR #347)
- Removed deprecated
setup.py
and switched fromsetuptools
tobuild
for builds - Consolidated tool configuration in
pyproject.toml
- Extended and tidied up dev-oriented
Makefile
- Addressed some CI/CD issues
๐ Fixed
- โ Added missing import, args in
TextStats
docs (PR #331, Issue #334) - ๐ Fixed normalization in YAKE keyword extraction (PR #332)
- ๐ Fixed text encoding issue when loading
ConceptNet
data on Windows systems (Issue #345)
Contributors
Thanks to @austinjp, @scarroll32, @MirkoLenz for their help!
- ๐จ Refactored and extended text statistics functionality (PR #350)
Previous changes from v0.11.0
-
- ๐จ Refactored, standardized, and extended several areas of functionality
- text preprocessing (
textacy.preprocessing
) - Added functions for normalizing bullet points in lists (
normalize.bullet_points()
), removing HTML tags (remove.html_tags()
), and removing bracketed contents such as in-line citations (remove.brackets()
). - Added
make_pipeline()
function for combining multiple preprocessors applied sequentially to input text into a single callable. - Renamed functions for flexibility and clarity of use; in most cases, this entails replacing an underscore with a period, e.g.
preprocessing.normalize_whitespace()
=>preprocessing.normalize.whitespace()
. - Renamed and standardized some funcs' args; for example, all "replace" functions had their (optional) second argument renamed from
replace_with
=>repl
, andremove.punctuation(text, marks=".?!")
=>remove.punctuation(text, only=[".", "?", "!"])
. - structured information extraction (
textacy.extract
) - Consolidated and restructured functionality previously spread across the
extract.py
andtext_utils.py
modules andke
subpackage. For the latter two, imports have changed:from textacy import ke; ke.textrank()
=>from textacy import extract; extract.keyterms.textrank()
from textacy import text_utils; text_utils.keywords_in_context()
=>from textacy import extract; extract.keywords_in_context()
- Added new extraction functions:
extract.regex_matches()
: For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.extract.acronyms()
: For extracting acronym-like tokens, without looking around for related definitions.extract.terms()
: For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.
- Improved the generality and quality of extracted "triples" such as Subject-Verb-Objects, and changed the structure of returned objects accordingly. Previously, only contiguous spans were permitted for each element, but this was overly restrictive: A sentence like "I did not really like the movie." would produce an SVO of
("I", "like", "movie")
which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces(["I"], ["did", "not", "like"], ["movie"])
. For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g.svo.subject
==svo[0]
). - Changed
extract.keywords_in_context()
to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to acceptDoc
orstr
objects as input. - Removed deprecated
extract.pos_regex_matches()
function, which is superseded by the more powerfulextract.token_matches()
. - string and sequence similarity metrics (
textacy.similarity
) - Refactored top-level
similarity.py
module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics. - Added several similarity metrics:
- edit-based Jaro (
similarity.jaro()
) - token-based Cosine (
similarity.cosine()
), Bag (similarity.bag()
), and Tversky (similarity.tvserky()
) - sequence-based Matching Subsequences Ratio (
similarity.matching_subsequences_ratio()
) - hybrid Monge-Elkan (
similarity.monge_elkan()
)
- edit-based Jaro (
- Removed a couple similarity metrics: Word Movers Distance relied on a troublesome external dependency, and Word2Vec+Cosine is available in spaCy via
Doc.similarity
. - network- and vector-based document representations (
textacy.representations
) - Consolidated and reworked networks functionality in
representations.network
module- Added
build_cooccurrence_network()
function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred. - Added
build_similarity_network()
function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity. - Removed obsolete
network.py
module and duplicativeextract.keyterms.graph_base.py
module.
- Added
- Refined vectorizer initialization, and moved from
vsm.vectorizers
torepresentations.vectorizers
module.- For both
Vectorizer
andGroupVectorizer
, applying global inverse document frequency weights is now handled by a single arg:idf_type: Optional[str]
, rather than a combination ofapply_idf: bool, idf_type: str
; similarly, applying document-length weight normalizations is handled bydl_type: Optional[str]
instead ofapply_dl: bool, dl_type: str
- For both
- Added
representations.sparse_vec
module for higher-level access to document vectorization viabuild_doc_term_matrix()
andbuild_grp_term_matrix()
functions, for cases when a single fit+transform is all you need. - automatic language identification (
textacy.lang_id
) - Moved functionality from
lang_utils.py
module into a subpackage, and added the primary user interface (identify_lang()
andidentify_topn_langs()
) as package-level imports. - Implemented and trained a more accurate
thinc
-based language identification model that's closer to the original CLD3 inspiration, replacing the simplersklearn
-based pipeline.
- text preprocessing (
- โก๏ธ Updated interface with spaCy for v3, and better leveraged the new functionality
- Restricted
textacy.load_spacy_lang()
to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately,textacy
can no longer play fast and loose with automatic language identification => pipeline loading... - Extended
textacy.make_spacy_doc()
to accept achunk_size
arg that splits input text into chunks, processes each individually, then joins them into a singleDoc
; supersedesspacier.utils.make_doc_from_text_chunks()
, which is now deprecated. - Moved core
Doc
extensions into a top-levelextensions.py
module, and improved/streamlined the collection - Refactored and improved performance of
Doc._.to_bag_of_words()
andDoc._.to_bag_of_terms()
, leveraging related functionality inextract.words()
andextract.terms()
- Removed redundant/awkward extensions:
Doc._.lang
=> useDoc.lang_
Doc._.tokens
=> useiter(Doc)
Doc._.n_tokens
=>len(Doc)
Doc._.to_terms_list()
=>extract.terms(doc)
orDoc._.extract_terms()
Doc._.to_tagged_text()
=> NA, this was an old holdover that's not used in practice anymoreDoc._.to_semantic_network()
=> NA, use a function intextacy.representations.networks
- Added
Doc
extensions fortextacy.extract
functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use eithertextacy.extract.acronyms(doc)
ordoc._.extract_acronyms()
. Keyterm extraction functions share a single extension:textacy.extract.keyterms.textrank(doc)
<>doc._.extract_keyterms(method="textrank")
- Leveraged spaCy's new
DocBin
for efficiently saving/loadingDoc
s in binary format, with corresponding arg changes inio.write_spacy_docs()
andCorpus.save()
+.load()
- Restricted
- ๐ Improved package documentation, tests, dependencies, and type annotations
- Added two beginner-oriented tutorials to documentation, showing how to use various aspects of the package in the context of specific tasks.
- Reorganized API reference docs to put like functionality together and more consistently provide summary tables up top
- Updated dependencies list and package versions
- Removed:
pyemd
andsrsly
- Un-capped max versions:
numpy
andscikit-learn
- Bumped min versions:
cytoolz
,jellyfish
,matplotlib
,pyphen
, andspacy
(v3.0+ only!) - Bumped min Python version from 3.6 => 3.7, and added PY3.9 support
- Removed
textacy.export
module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency ongensim
and CONLL-U that wasn't enforced or guaranteed, so better to remove. - Added
types.py
module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base. - Improved, added, and parametrized literally hundreds of tests.
Contributors
๐ Many thanks to @timgates42, @datanizing, @8W9aG, @0x2b3bfa0, and @gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.
- ๐จ Refactored, standardized, and extended several areas of functionality