textacy v0.7.0 Release Notes
Release Date: 2019-05-13 // almost 5 years ago-
๐ New and Changed:
- โ Removed textacy.Doc, and split its functionality into two parts
- New: Added
textacy.make_spacy_doc()
as a convenient and flexible entry point for making spaCyDoc
s from text or (text, metadata) pairs, with optional spaCy language pipeline specification. It's similar totextacy.Doc.__init__
, with the exception that text and metadata are passed in together as a 2-tuple. - New: Added a variety of custom doc property and method extensions to
the global
spacy.tokens.Doc
class, accessible via itsDoc._
"underscore" property. These are similar to the properties/methods ontextacy.Doc
, they just require an interstitial underscore. For example,textacy.Doc.to_bag_of_words()
=>spacy.tokens.Doc._.to_bag_of_words()
. - New: Added functions for setting, getting, and removing these extensions. Note that they are set automatically when textacy is imported.
- New: Added
- ๐ Simplified and improved performance of textacy.Corpus
- Documents are now added through a simpler API, either in
Corpus.__init__
orCorpus.add()
; they may be one or a stream of texts, (text, metadata) pairs, or existing spaCyDoc
s. When adding many documents, the spaCy language processing pipeline is used in a faster and more efficient way. - Saving / loading corpus data to disk is now more efficient and robust.
- Note:
Corpus
is now a collection of spaCyDoc
s rather thantextacy.Doc
s.
- Documents are now added through a simpler API, either in
- Simplified, standardized, and added Dataset functionality
- New: Added an
IMDB
dataset, built on the classic 2011 dataset commonly used to train sentiment analysis models. - New: Added a base
Wikimedia
dataset, from which a reworkedWikipedia
dataset and a separateWikinews
dataset inherit. The underlying data source has changed, from XML db dumps of raw wiki markup to JSON db dumps of (relatively) clean text and metadata; now, the code is simpler, faster, and totally language-agnostic. Dataset.records()
now streams (text, metadata) pairs rather than a dict containing both text and metadata, so users don't need to know field names and split them into separate streams before creatingDoc
orCorpus
objects from the data.- Filtering and limiting the number of texts/records produced is now clearer
and more consistent between
.texts()
and.records()
methods on a givenDataset
--- and more performant! - Downloading datasets now always shows progress bars and saves to the same file names. When appropriate, downloaded archive files' contents are automatically extracted for easy inspection.
- Common functionality (such as validating filter values) is now standardized
and consolidated in the
datasets.utils
module.
- New: Added an
- Quality of life improvements
- Reduced load time for
import textacy
from ~2-3 seconds to ~1 second, by lazy-loading expensive variables, deferring a couple heavy imports, and dropping a couple dependencies. Specifically: ftfy
was dropped, and aNotImplementedError
is now raised in textacy's wrapper function,textacy.preprocess.fix_bad_unicode()
. Users with bad unicode should now directly callftfy.fix_text()
.ijson
was dropped, and the behavior oftextacy.read_json()
is now simpler and consistent with other functions for line-delimited data.mwparserfromhell
was dropped, since the reworkedWikipedia
dataset no longer requires complicated and slow parsing of wiki markup.- Renamed certain functions and variables for clarity, and for consistency with existing conventions:
textacy.load_spacy()
=>textacy.load_spacy_lang()
textacy.extract.named_entities()
=>textacy.extract.entities()
textacy.data_dir
=>textacy.DEFAULT_DATA_DIR
filename
=>filepath
anddirname
=>dirpath
when specifying full paths to files/dirs on disk, andtextacy.io.utils.get_filenames()
=>textacy.io.utils.get_filepaths()
accordingly- compiled regular expressions now consistently start with
RE_
SpacyDoc
=>Doc
,SpacySpan
=>Span
,SpacyToken
=>Token
,SpacyLang
=>Language
as variables and in docs- Removed deprecated functionality
- top-level
spacy_utils.py
andspacy_pipelines.py
are gone; use equivalent functionality in thespacier
subpackage instead math_utils.py
is gone; it was long neglected, and never actually used- Replaced
textacy.compat.bytes_to_unicode()
andtextacy.compat.unicode_to_bytes()
withtextacy.compat.to_unicode()
andtextacy.compat.to_bytes()
, which are safer and accept either binary or text strings as input. - Moved and renamed language detection functionality,
textacy.text_utils.detect_language()
=>textacy.lang_utils.detect_lang()
. The idea is to add more/better lang-related functionality here in the future. - Updated and cleaned up documentation throughout the code base.
- Added and refactored many tests, for both new and old functionality, significantly increasing test coverage while significantly reducing run-time. Also, added a proper coverage report to CI builds. This should help prevent future errors and inspire better test-writing.
- Bumped the minimum required spaCy version:
v2.0.0
=>v2.0.12
, for access to their full set of custom extension functionality.
- Reduced load time for
๐ Fixed:
- The progress bar during an HTTP download now always closes, preventing weird nesting issues if another bar is subsequently displayed.
- Filtering datasets by multiple values performed either a logical AND or OR over the values, which was confusing; now, a logical OR is always performed.
- The existence of files/directories on disk is now checked properly via
os.path.isfile()
oros.path.isdir()
, rather thanos.path.exists()
. - ๐ Fixed a variety of formatting errors raised by sphinx when generating HTML docs.
- โ Removed textacy.Doc, and split its functionality into two parts