Changelog History
Page 1
-
v0.12.0 Changes
December 06, 2021- π¨ Refactored and extended text statistics functionality (PR #350)
- Added functions for computing measures of lexical diversity, such as the clasic Type-Token-Ratio and modern Hypergeometric Distribution Diversity
- Added functions for counting token-level attributes, including morphological features and parts-of-speech, in a convenient form
- Refactored all text stats functions to accept a
Doc
as their first positional arg, suitable for use as custom doc extensions (see below) - Deprecated the
TextStats
class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.
Standardized functionality for getting/setting/removing doc extensions (PR #352)
- Now, custom extensions are accessed by name, and users have more control over the process:
>>> import textacy >>> from textacy import extract, text_stats >>> textacy.set_doc_extensions("extract") >>> textacy.set_doc_extensions("text_stats.readability") >>> textacy.remove_doc_extensions("extract.matches") >>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease() 118.17500000000001
- Moved top-level extensions into
spacier.core
andextract.bags
- Standardized
extract
andtext_stats
subpackage extensions to use the new setup, and made them more customizable
π Improved package code, tests, and docs
- Fixed outdated code and comments in the "Quickstart" guide, then renamed it "Walkthrough" since it wasn't actually quick; added a new and, yes, quick "Quickstart" guide to fill the gap (PR #353)
- Added a
pytest
conftest file to improve maintainability and consistency of unit test suite (PR #353) - Improved quality and consistency of type annotations, everywhere (PR #349)
- Note: Bumped Python version support from 3.7β3.9 to 3.8β3.10 in order to take advantage of new typing features in PY3.8 and formally support the current major version (PR #348)
- Modernized and streamlined package builds and configuration (PR #347)
- Removed deprecated
setup.py
and switched fromsetuptools
tobuild
for builds - Consolidated tool configuration in
pyproject.toml
- Extended and tidied up dev-oriented
Makefile
- Addressed some CI/CD issues
π Fixed
- β Added missing import, args in
TextStats
docs (PR #331, Issue #334) - π Fixed normalization in YAKE keyword extraction (PR #332)
- π Fixed text encoding issue when loading
ConceptNet
data on Windows systems (Issue #345)
Contributors
Thanks to @austinjp, @scarroll32, @MirkoLenz for their help!
- π¨ Refactored and extended text statistics functionality (PR #350)
-
v0.11.0 Changes
April 12, 2021- π¨ Refactored, standardized, and extended several areas of functionality
- text preprocessing (
textacy.preprocessing
) - Added functions for normalizing bullet points in lists (
normalize.bullet_points()
), removing HTML tags (remove.html_tags()
), and removing bracketed contents such as in-line citations (remove.brackets()
). - Added
make_pipeline()
function for combining multiple preprocessors applied sequentially to input text into a single callable. - Renamed functions for flexibility and clarity of use; in most cases, this entails replacing an underscore with a period, e.g.
preprocessing.normalize_whitespace()
=>preprocessing.normalize.whitespace()
. - Renamed and standardized some funcs' args; for example, all "replace" functions had their (optional) second argument renamed from
replace_with
=>repl
, andremove.punctuation(text, marks=".?!")
=>remove.punctuation(text, only=[".", "?", "!"])
. - structured information extraction (
textacy.extract
) - Consolidated and restructured functionality previously spread across the
extract.py
andtext_utils.py
modules andke
subpackage. For the latter two, imports have changed:from textacy import ke; ke.textrank()
=>from textacy import extract; extract.keyterms.textrank()
from textacy import text_utils; text_utils.keywords_in_context()
=>from textacy import extract; extract.keywords_in_context()
- Added new extraction functions:
extract.regex_matches()
: For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.extract.acronyms()
: For extracting acronym-like tokens, without looking around for related definitions.extract.terms()
: For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.
- Improved the generality and quality of extracted "triples" such as Subject-Verb-Objects, and changed the structure of returned objects accordingly. Previously, only contiguous spans were permitted for each element, but this was overly restrictive: A sentence like "I did not really like the movie." would produce an SVO of
("I", "like", "movie")
which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces(["I"], ["did", "not", "like"], ["movie"])
. For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g.svo.subject
==svo[0]
). - Changed
extract.keywords_in_context()
to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to acceptDoc
orstr
objects as input. - Removed deprecated
extract.pos_regex_matches()
function, which is superseded by the more powerfulextract.token_matches()
. - string and sequence similarity metrics (
textacy.similarity
) - Refactored top-level
similarity.py
module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics. - Added several similarity metrics:
- edit-based Jaro (
similarity.jaro()
) - token-based Cosine (
similarity.cosine()
), Bag (similarity.bag()
), and Tversky (similarity.tvserky()
) - sequence-based Matching Subsequences Ratio (
similarity.matching_subsequences_ratio()
) - hybrid Monge-Elkan (
similarity.monge_elkan()
)
- edit-based Jaro (
- Removed a couple similarity metrics: Word Movers Distance relied on a troublesome external dependency, and Word2Vec+Cosine is available in spaCy via
Doc.similarity
. - network- and vector-based document representations (
textacy.representations
) - Consolidated and reworked networks functionality in
representations.network
module- Added
build_cooccurrence_network()
function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred. - Added
build_similarity_network()
function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity. - Removed obsolete
network.py
module and duplicativeextract.keyterms.graph_base.py
module.
- Added
- Refined vectorizer initialization, and moved from
vsm.vectorizers
torepresentations.vectorizers
module.- For both
Vectorizer
andGroupVectorizer
, applying global inverse document frequency weights is now handled by a single arg:idf_type: Optional[str]
, rather than a combination ofapply_idf: bool, idf_type: str
; similarly, applying document-length weight normalizations is handled bydl_type: Optional[str]
instead ofapply_dl: bool, dl_type: str
- For both
- Added
representations.sparse_vec
module for higher-level access to document vectorization viabuild_doc_term_matrix()
andbuild_grp_term_matrix()
functions, for cases when a single fit+transform is all you need. - automatic language identification (
textacy.lang_id
) - Moved functionality from
lang_utils.py
module into a subpackage, and added the primary user interface (identify_lang()
andidentify_topn_langs()
) as package-level imports. - Implemented and trained a more accurate
thinc
-based language identification model that's closer to the original CLD3 inspiration, replacing the simplersklearn
-based pipeline.
- text preprocessing (
- β‘οΈ Updated interface with spaCy for v3, and better leveraged the new functionality
- Restricted
textacy.load_spacy_lang()
to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately,textacy
can no longer play fast and loose with automatic language identification => pipeline loading... - Extended
textacy.make_spacy_doc()
to accept achunk_size
arg that splits input text into chunks, processes each individually, then joins them into a singleDoc
; supersedesspacier.utils.make_doc_from_text_chunks()
, which is now deprecated. - Moved core
Doc
extensions into a top-levelextensions.py
module, and improved/streamlined the collection - Refactored and improved performance of
Doc._.to_bag_of_words()
andDoc._.to_bag_of_terms()
, leveraging related functionality inextract.words()
andextract.terms()
- Removed redundant/awkward extensions:
Doc._.lang
=> useDoc.lang_
Doc._.tokens
=> useiter(Doc)
Doc._.n_tokens
=>len(Doc)
Doc._.to_terms_list()
=>extract.terms(doc)
orDoc._.extract_terms()
Doc._.to_tagged_text()
=> NA, this was an old holdover that's not used in practice anymoreDoc._.to_semantic_network()
=> NA, use a function intextacy.representations.networks
- Added
Doc
extensions fortextacy.extract
functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use eithertextacy.extract.acronyms(doc)
ordoc._.extract_acronyms()
. Keyterm extraction functions share a single extension:textacy.extract.keyterms.textrank(doc)
<>doc._.extract_keyterms(method="textrank")
- Leveraged spaCy's new
DocBin
for efficiently saving/loadingDoc
s in binary format, with corresponding arg changes inio.write_spacy_docs()
andCorpus.save()
+.load()
- Restricted
- π Improved package documentation, tests, dependencies, and type annotations
- Added two beginner-oriented tutorials to documentation, showing how to use various aspects of the package in the context of specific tasks.
- Reorganized API reference docs to put like functionality together and more consistently provide summary tables up top
- Updated dependencies list and package versions
- Removed:
pyemd
andsrsly
- Un-capped max versions:
numpy
andscikit-learn
- Bumped min versions:
cytoolz
,jellyfish
,matplotlib
,pyphen
, andspacy
(v3.0+ only!) - Bumped min Python version from 3.6 => 3.7, and added PY3.9 support
- Removed
textacy.export
module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency ongensim
and CONLL-U that wasn't enforced or guaranteed, so better to remove. - Added
types.py
module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base. - Improved, added, and parametrized literally hundreds of tests.
Contributors
π Many thanks to @timgates42, @datanizing, @8W9aG, @0x2b3bfa0, and @gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.
- π¨ Refactored, standardized, and extended several areas of functionality
-
v0.10.1 Changes
August 29, 2020π New and Changed:
- π¨ Expanded text statistics and refactored into a sub-package (PR #307)
- Refactored
text_stats
module into a sub-package with the same name and top-level API, but restructured under the hood for better consistency - Improved performance, API, and documentation on the main
TextStats
class, and improved documentation on many of the individual stats functions - Added new readability tests for texts in Arabic (Automated Arabic Readability Index), Spanish (Β΅-legibility and perspecuity index), and Turkish (a lang-specific formulation of Flesch Reading Ease)
- Breaking change: Removed
TextStats.basic_counts
andTextStats.readability_stats
attributes, since typically only one or a couple needed for a given use case; also, some of the readability tests are language-specific, which meant bad results could get mixed in with good ones
- Refactored
- π Improved and standardized some code quality and performance (PR #305, #306)
- Standardized error messages via top-level
errors.py
module - Replaced
str.format()
with f-strings (almost) everywhere, for performance and readability - Fixed a whole mess of linting errors, significantly improving code quality and consistency
- Standardized error messages via top-level
- π Improved package configuration, and maintenance (PRs #298, #305, #306)
- Added automated GitHub workflows for building and testing the package, linting and formatting, publishing new releases to PyPi, and building documentation (and ripped out Travis CI)
- Added a makefile with common commands for dev work, plus instructions
- Adopted the new
pyproject.toml
package configuration standard; updated and streamlinedsetup.py
andsetup.cfg
accordingly; and removedrequirements.txt
- Moved all source code into a
/src
directory, for technical reasons - Added
mypy
-specific config file to reduce output noisiness when type-checking
- π Improved and moved package documentation (PR #309)
- Moved the docs site back to ReadTheDocs (https://textacy.readthedocs.io)! Pardon the years-long detour into GitHub Pages...
- Enabled markdown-based documentation using
recommonmark
instead ofm2r
, and migrated all "narrative" docs from.rst
to equivalent.md
files - Added auto-generated summary tables to many sections of the API Reference, to help users get an overview of functionality and better find what they're looking for; also added auto-generated section heading references
- Tidied up and further standardized docstrings throughout the code
- Kept up with the Python ecosystem
- Trained a v1.1 language identifier model using
scikit-learn==0.23.0
, and bumped the upper bound on that dependency's version accordingly - Updated and parametrized many tests using modern
pytest
functionality (PR #306) - Got
textacy
versions 0.9.1 and 0.10.0 up onconda-forge
(Issue #294) - Added spectral seriation as a term-ordering technique when making a "Termite" visualization by taking advantage of
pandas.DataFrame
functionality, and otherwise tidied up the default for nice-looking plots (PR #295)
- Trained a v1.1 language identifier model using
π Fixed:
- π Corrected an incorrect and misleading reference in the quickstart docs (Issue #300, PR #302)
- π Fixed a bug in the
delete_words()
augmentation transform (Issue #308)
Contributors:
π± Special thanks to @tbsexton, @marius-mather, and @rmax for their contributions! π
- π¨ Expanded text statistics and refactored into a sub-package (PR #307)
-
v0.10.0 Changes
March 01, 2020π New:
- Added a logo to textacy's documentation and social preview π
- β Added type hints throughout the code base, for more expressive type indicators in docstrings and for static type checkers used by developers to code more effectively (PR #289)
- β Added a preprocessing function to normalize sequences of repeating characters (Issue #275)
π Changed:
- π Improved core
Corpus
functionality using recent additions to spacy (PR #285)- Re-implemented
Corpus.save()
andCorpus.load()
using spacy's newDocBin
class, which resolved a few bugs/issues (Issue #254) - Added
n_process
arg toCorpus.add()
to set the number of parallel processes used when adding many items to a corpus, following spacy's updates tonlp.pipe()
(Issue #277) - Bumped minimum spaCy version from 2.0.12 => 2.2.0, accordingly
- Re-implemented
- β Added handling for zero-width whitespaces into
normalize_whitespace()
function (Issue #278) - π Improved a couple rough spots in package administration:
- Moved package setup information into a declarative configuration file, in an attempt to keep up with evolving best practices for Python packaging
- Simplified the configuration and interoperability of sphinx + github pages for generating package documentation
π Fixed:
-
v0.9.1 Changes
September 03, 2019π Changed:
- β‘οΈ Tweaked
TopicModel
class to work with newer versions ofscikit-learn
, and updated version requirements accordingly from>=0.18.0,<0.21.0
to>=0.19
π Fixed:
- π Fixed residual bugs in the script for training language identification pipelines, then trained and released one using
scikit-learn==0.19
to prevent errors for users on that version
- β‘οΈ Tweaked
-
v0.9.0 Changes
September 03, 2019π Note:
textacy
is now PY3-only! π Specifically, support for PY2.7 has been dropped, and the minimum PY3 version has been bumped to 3.6 (PR #261). See below for related changes.π New:
- β Added
augmentation
subpackage for basic text data augmentation (PR #268, #269)- implemented several transformer functions for substituting, inserting, swapping, and deleting elements of text at both the word- and character-level
- implemented an
Augmenter
class for combining multiple transforms and applying them to spaCyDoc
s in a randomized but configurable manner - Note: This API is provisional, and subject to change in future releases.
- β Added
resources
subpackage for standardized access to linguistic resources (PR #265)- DepecheMood++: high-coverage emotion lexicons for understanding the emotions evoked by a text. Updated from a previous version, and now features better English data and Italian data with expanded, consistent functionality.
- removed
lexicon_methods.py
module with previous implementation - ConceptNet: multilingual knowledge base for representing relationships between words, similar to WordNet. Currently supports getting word antonyms, hyponyms, meronyms, and synonyms in dozens of languages.
- β Added
UDHR
dataset, a collection of translations of the Universal Declaration of Human Rights (PR #271)
π Changed:
- β‘οΈ Updated and extended functionality previously blocked by PY2 compatibility while reducing code bloat / complexity
- made many args keyword-only, to prevent user error
- args accepting strings for directory / file paths now also accept
pathlib.Path
objects, withpathlib
adopted widely under the hood - increased minimum versions and/or uncapped maximum versions of several dependencies, including
jellyfish
,networkx
, andnumpy
- β Added a Portuguese-specific formulation of Flesch Reading Ease score to
text_stats
(PR #263) - Reorganized and grouped together some like functionality
- moved core functionality for loading spaCy langs and making spaCy docs into
spacier.core
, out ofcache.py
anddoc.py
- moved some general-purpose functionality from
dataset.utils
toio.utils
andutils.py
- moved function for loading "hyphenator" out of
cache.py
and intotext_stats.py
, where it's used
- moved core functionality for loading spaCy langs and making spaCy docs into
- π Re-trained and released language identification pipelines using a better mix of training data, for slightly improved performance; also added the script used to train the pipeline
- π Changed API Reference docs to show items in source code rather than alphabetical order, which should make the ordering more human-friendly
- π Updated repo README and PyPi metadata to be more consistent and representative of current functionality
- Removed previously deprecated
textacy.io.split_record_fields()
function
π Fixed:
- Fixed a regex for cleaning up crufty terms to prevent catastrophic backtracking in certain edge cases (true story: this bug was encountered in production code, and ruined my day)
- π Fixed bad handling of edge cases in sCAKE keyterm extraction (Issue #270)
- π Changed order in which URL regexes are applied in
preprocessing.replace_urls()
to properly handle certain edge case URLs (Issue #267)
Contributors:
π± Thanks much to @hugoabonizio for the contribution. π€
- β Added
-
v0.8.0 Changes
July 14, 2019π New and Changed:
- π¨ Refactored and expanded text preprocessing functionality (PR #253)
- Moved code from a top-level
preprocess
module into apreprocessing
sub-package, and reorganized it in the process - Added new functions:
replace_hashtags()
to replace hashtags like#FollowFriday
or#spacyIRL2019
with_TAG_
replace_user_handles()
to replace user handles like@bjdewilde
or@spacy_io
with_USER_
replace_emojis()
to replace emoji symbols like π or π with_EMOJI_
normalize_hyphenated_words()
to join hyphenated words back together, likeantici- pation
=>anticipation
normalize_quotation_marks()
to replace "fancy" quotation marks with simple ascii equivalents, likeβthe god particleβ
=>"the god particle"
- Changed a couple functions for clarity and consistency:
replace_currency_symbols()
now replaces all dedicated ascii and unicode currency symbols with_CUR_
, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like$
=>USD
)remove_punct()
now has afast (bool)
kwarg rather thanmethod (str)
- Removed
normalize_contractions()
,preprocess_text()
, andfix_bad_unicode()
functions, since they were bad/awkward and more trouble than they were worth
- Moved code from a top-level
- π¨ Refactored and expanded keyterm extraction functionality (PR #257)
- Moved code from a top-level
keyterms
module into ake
sub-package, and cleaned it up / standardized arg names / better shared functionality in the process - Added new unsupervised keyterm extraction algorithms: YAKE (
ke.yake()
), sCAKE (ke.scake()
), and PositionRank (ke.textrank()
, with non-default parameter values) - Added new methods for selecting candidate keyterms: longest matching subsequence candidates (
ke.utils.get_longest_subsequence_candidates()
) and pattern-matching candidates (ke.utils.get_pattern_matching_candidates()
) - Improved speed of SGRank implementation, and generally optimized much of the code
- Moved code from a top-level
- π Improved document similarity functionality (PR #256)
- Added a character ngram-based similarity measure (
similarity.character_ngrams()
), for something that's useful in different contexts than the other measures - Removed Jaro-Winkler string similarity measure (
similarity.jaro_winkler()
), since it didn't add much beyond other measures - Improved speed of Token Sort Ratio implementation
- Replaced
python-levenshtein
dependency withjellyfish
, for its active development, better documentation, and actually-compliant license
- Added a character ngram-based similarity measure (
- β Added customizability to certain functionality
- Added options to
Doc._.to_bag_of_words()
andCorpus.word_counts()
for filtering out stop words, punctuation, and/or numbers (PR #249) - Allowed for objects that look like
sklearn
-style topic modeling classes to be passed intotm.TopicModel()
(PR #248) - Added options to customize rc params used by
matplotlib
when drawing a "termite" plot inviz.draw_termite_plot()
(PR #248)
- Added options to
- π Removed deprecated functions with direct replacements:
io.utils.get_filenames()
andspacier.components.merge_entities()
Contributors:
- π¨ Refactored and expanded text preprocessing functionality (PR #253)
-
v0.7.1 Changes
June 25, 2019π New:
- β Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR #247]
- Implemented a Google CLD3-inspired model in
scikit-learn
and trained it on ~1.5M texts in ~130 different languages spanning a wide variety of subject matter and stylistic formality; overall, speed and performance compare favorably to other open-source options (langid
,langdetect
,cld2-cffi
, andcld3
) - Dropped
cld2-cffi
dependency [Issue #246]
- Implemented a Google CLD3-inspired model in
- Added
extract.matches()
function to extract spans from a document matching one or more pattern of per-token (attribute, value) pairs, with optional quantity qualifiers; this is a convenient interface to spaCy's rule-basedMatcher
and a more powerful replacement for textacy's existing (now deprecated)extract.pos_regex_matches()
- π Added
preprocess.normalize_unicode()
function to transform unicode characters into their canonical forms; this is a less-intensive consolation prize for the previously-removedfix_unicode()
function
π Changed:
- Enabled loading blank spaCy
Language
pipelines (tokenization only -- no model-based tagging, parsing, etc.) viaload_spacy_lang(name, allow_blank=True)
for use cases that don't rely on annotations; disabled by default to avoid unwelcome surprises - Changed inclusion/exclusion and de-duplication of entities and ngrams in
to_terms_list()
[Issues #169, #179]entities = True
=> include entities, and drop exact duplicate ngramsentities = False
=> don't include entities, and also drop exact duplicate ngramsentities = None
=> use ngrams as-is without checking against entities
- π Moved
to_collection()
function from thedatasets.utils
module to the top-levelutils
module, for use throughout the code base - Added
quoting
option toio.read_csv()
andio.write_csv()
, for problematic cases - π Deprecated the
spacier.components.merge_entities()
pipeline component, an implementation of which has since been added into spaCy itself - π Updated documentation for developer convenience and reader clarity
- Split API reference docs into related chunks, rather than having them all together in one long page, and tidied up headers
- Fixed errors / inconsistencies in various docstrings (a never-ending struggle...)
- Ported package readme and changelog from
.rst
to.md
format
π Fixed:
- The
NotImplementedError
previously added topreprocess.fix_unicode()
is now raised rather than returned [Issue #243]
- β Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR #247]
-
v0.7.0 Changes
May 13, 2019π New and Changed:
- β Removed textacy.Doc, and split its functionality into two parts
- New: Added
textacy.make_spacy_doc()
as a convenient and flexible entry point for making spaCyDoc
s from text or (text, metadata) pairs, with optional spaCy language pipeline specification. It's similar totextacy.Doc.__init__
, with the exception that text and metadata are passed in together as a 2-tuple. - New: Added a variety of custom doc property and method extensions to
the global
spacy.tokens.Doc
class, accessible via itsDoc._
"underscore" property. These are similar to the properties/methods ontextacy.Doc
, they just require an interstitial underscore. For example,textacy.Doc.to_bag_of_words()
=>spacy.tokens.Doc._.to_bag_of_words()
. - New: Added functions for setting, getting, and removing these extensions. Note that they are set automatically when textacy is imported.
- New: Added
- π Simplified and improved performance of textacy.Corpus
- Documents are now added through a simpler API, either in
Corpus.__init__
orCorpus.add()
; they may be one or a stream of texts, (text, metadata) pairs, or existing spaCyDoc
s. When adding many documents, the spaCy language processing pipeline is used in a faster and more efficient way. - Saving / loading corpus data to disk is now more efficient and robust.
- Note:
Corpus
is now a collection of spaCyDoc
s rather thantextacy.Doc
s.
- Documents are now added through a simpler API, either in
- Simplified, standardized, and added Dataset functionality
- New: Added an
IMDB
dataset, built on the classic 2011 dataset commonly used to train sentiment analysis models. - New: Added a base
Wikimedia
dataset, from which a reworkedWikipedia
dataset and a separateWikinews
dataset inherit. The underlying data source has changed, from XML db dumps of raw wiki markup to JSON db dumps of (relatively) clean text and metadata; now, the code is simpler, faster, and totally language-agnostic. Dataset.records()
now streams (text, metadata) pairs rather than a dict containing both text and metadata, so users don't need to know field names and split them into separate streams before creatingDoc
orCorpus
objects from the data.- Filtering and limiting the number of texts/records produced is now clearer
and more consistent between
.texts()
and.records()
methods on a givenDataset
--- and more performant! - Downloading datasets now always shows progress bars and saves to the same file names. When appropriate, downloaded archive files' contents are automatically extracted for easy inspection.
- Common functionality (such as validating filter values) is now standardized
and consolidated in the
datasets.utils
module.
- New: Added an
- Quality of life improvements
- Reduced load time for
import textacy
from ~2-3 seconds to ~1 second, by lazy-loading expensive variables, deferring a couple heavy imports, and dropping a couple dependencies. Specifically: ftfy
was dropped, and aNotImplementedError
is now raised in textacy's wrapper function,textacy.preprocess.fix_bad_unicode()
. Users with bad unicode should now directly callftfy.fix_text()
.ijson
was dropped, and the behavior oftextacy.read_json()
is now simpler and consistent with other functions for line-delimited data.mwparserfromhell
was dropped, since the reworkedWikipedia
dataset no longer requires complicated and slow parsing of wiki markup.- Renamed certain functions and variables for clarity, and for consistency with existing conventions:
textacy.load_spacy()
=>textacy.load_spacy_lang()
textacy.extract.named_entities()
=>textacy.extract.entities()
textacy.data_dir
=>textacy.DEFAULT_DATA_DIR
filename
=>filepath
anddirname
=>dirpath
when specifying full paths to files/dirs on disk, andtextacy.io.utils.get_filenames()
=>textacy.io.utils.get_filepaths()
accordingly- compiled regular expressions now consistently start with
RE_
SpacyDoc
=>Doc
,SpacySpan
=>Span
,SpacyToken
=>Token
,SpacyLang
=>Language
as variables and in docs- Removed deprecated functionality
- top-level
spacy_utils.py
andspacy_pipelines.py
are gone; use equivalent functionality in thespacier
subpackage instead math_utils.py
is gone; it was long neglected, and never actually used- Replaced
textacy.compat.bytes_to_unicode()
andtextacy.compat.unicode_to_bytes()
withtextacy.compat.to_unicode()
andtextacy.compat.to_bytes()
, which are safer and accept either binary or text strings as input. - Moved and renamed language detection functionality,
textacy.text_utils.detect_language()
=>textacy.lang_utils.detect_lang()
. The idea is to add more/better lang-related functionality here in the future. - Updated and cleaned up documentation throughout the code base.
- Added and refactored many tests, for both new and old functionality, significantly increasing test coverage while significantly reducing run-time. Also, added a proper coverage report to CI builds. This should help prevent future errors and inspire better test-writing.
- Bumped the minimum required spaCy version:
v2.0.0
=>v2.0.12
, for access to their full set of custom extension functionality.
- Reduced load time for
π Fixed:
- The progress bar during an HTTP download now always closes, preventing weird nesting issues if another bar is subsequently displayed.
- Filtering datasets by multiple values performed either a logical AND or OR over the values, which was confusing; now, a logical OR is always performed.
- The existence of files/directories on disk is now checked properly via
os.path.isfile()
oros.path.isdir()
, rather thanos.path.exists()
. - π Fixed a variety of formatting errors raised by sphinx when generating HTML docs.
- β Removed textacy.Doc, and split its functionality into two parts
-
v0.6.3 Changes
March 23, 2019π New:
- β Added a proper contributing guide and code of conduct, as well as separate
GitHub issue templates for different user situations. This should help folks
contribute to the project more effectively, and make maintaining it a bit easier,
too. [Issue #212] - π Gave the documentation a new look, using a template popularized by
requests
.
β Added documentation on dealing with multi-lingual datasets. [Issue #233] - π¦ Made some minor adjustments to package dependencies, the way they're specified,
π· and the Travis CI setup, making for a faster and better development experience. - π± Confirmed and enabled compatibility with v2.1+ of
spacy
. π«
π Changed:
- π Improved the
Wikipedia
dataset class in a variety of ways: it can now read
Wikinews db dumps; access records in namespaces other than the usual "0"
π (such as category pages in namespace "14"); parse and extract category pages
in several languages, including in the case of bad wiki markup; and filter out
section headings from the accompanying text via aninclude_headings
kwarg.
[PR #219, #220, #223, #224, #231] - β Removed the
transliterate_unicode()
preprocessing function that transliterated
non-ascii text into a reasonable ascii approximation, for technical and
π philosophical reasons. Also removed its GPL-licensedunidecode
dependency,
for legal-ish reasons. [Issue #203] - β Added convention-abiding
exclude
argument to the function that writes
πspacy
docs to disk, to limit which pipeline annotations are serialized.
Replaced the existing but non-standardinclude_tensor
arg. - Deprecated the
n_threads
argument inCorpus.add_texts()
, which had not
been working inspacy.pipe
for some time and, as of v2.1, is defunct. - β
Made many tests model- and python-version agnostic and thus less likely to break
π whenspacy
releases new and improved models. - Auto-formatted the entire code base using
black
; the results aren't always
more readable, but they are pleasingly consistent.
π Fixed:
- Fixed bad behavior of
key_terms_from_semantic_network()
, where an error
would be raised if no suitable key terms could be found; now, an empty list
is returned instead. [Issue #211] - π Fixed variable name typo so
GroupVectorizer.fit()
actually works. [Issue #215] - π Fixed a minor typo in the quick-start docs. [PR #217]
- Check for and filter out any named entities that are entirely whitespace,
π seemingly caused by an issue inspacy
. - π Fixed an undefined variable error when merging spans. [Issue #225]
- π Fixed a unicode/bytes issue in experimental function for deserializing
spacy
π docs in "binary" format. [Issue #228, PR #229]
Contributors:
Many thanks to @abevieiramota, @ckot, @Jude188, and @digest0r for their help!
- β Added a proper contributing guide and code of conduct, as well as separate