textacy v0.7.1 Release Notes
Release Date: 2019-06-25 // almost 5 years ago-
๐ New:
- โ Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR #247]
- Implemented a Google CLD3-inspired model in
scikit-learn
and trained it on ~1.5M texts in ~130 different languages spanning a wide variety of subject matter and stylistic formality; overall, speed and performance compare favorably to other open-source options (langid
,langdetect
,cld2-cffi
, andcld3
) - Dropped
cld2-cffi
dependency [Issue #246]
- Implemented a Google CLD3-inspired model in
- Added
extract.matches()
function to extract spans from a document matching one or more pattern of per-token (attribute, value) pairs, with optional quantity qualifiers; this is a convenient interface to spaCy's rule-basedMatcher
and a more powerful replacement for textacy's existing (now deprecated)extract.pos_regex_matches()
- ๐ Added
preprocess.normalize_unicode()
function to transform unicode characters into their canonical forms; this is a less-intensive consolation prize for the previously-removedfix_unicode()
function
๐ Changed:
- Enabled loading blank spaCy
Language
pipelines (tokenization only -- no model-based tagging, parsing, etc.) viaload_spacy_lang(name, allow_blank=True)
for use cases that don't rely on annotations; disabled by default to avoid unwelcome surprises - Changed inclusion/exclusion and de-duplication of entities and ngrams in
to_terms_list()
[Issues #169, #179]entities = True
=> include entities, and drop exact duplicate ngramsentities = False
=> don't include entities, and also drop exact duplicate ngramsentities = None
=> use ngrams as-is without checking against entities
- ๐ Moved
to_collection()
function from thedatasets.utils
module to the top-levelutils
module, for use throughout the code base - Added
quoting
option toio.read_csv()
andio.write_csv()
, for problematic cases - ๐ Deprecated the
spacier.components.merge_entities()
pipeline component, an implementation of which has since been added into spaCy itself - ๐ Updated documentation for developer convenience and reader clarity
- Split API reference docs into related chunks, rather than having them all together in one long page, and tidied up headers
- Fixed errors / inconsistencies in various docstrings (a never-ending struggle...)
- Ported package readme and changelog from
.rst
to.md
format
๐ Fixed:
- The
NotImplementedError
previously added topreprocess.fix_unicode()
is now raised rather than returned [Issue #243]
- โ Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR #247]