textacy v0.6.1 Release Notes
Release Date: 2018-04-12 // about 6 years ago-
๐ Changes:
โ Add a new
spacier
sub-package for spaCy-oriented functionality (#168, #187)- Thus far, this includes a
components
module with two custom spaCy
๐ pipeline components: one to compute text stats on parsed documents, and
๐ another to merge named entities into single tokens in an efficient manner.
More to come! - Similar functionality in the top-level
spacy_pipelines
module has been
๐ deprecated; it will be removed in v0.7.0.
โก๏ธ Update the readme, usage, and API reference docs to be clearer and (I hope)
more useful. (#186)Removing punctuation from a text via the
preprocessing
module now replaces
punctuation marks with a single space rather than an empty string. This gives
๐ better behavior in many situations; for example, "won't" => "won t" rather than
"wont", the latter of which is a valid word with a different meaning.Categories are now correctly extracted from non-English language Wikipedia
datasets, starting with French and German and extendable to others. (#175)๐ฒ Log progress when adding documents to a corpus. At the debug level, every
doc's addition is logged; at the info level, only one message per batch
of documents is logged. (#183)๐ Bugfixes:
- ๐ Fix two breaking typos in
extract.direct_quotations()
. (issue #177) - ๐ Prevent crashes when adding non-parsed documents to a
Corpus
. (#180) - Fix bugs in
keyterms.most_discriminating_terms()
that usedvsm
functionality as it was before the changes in v0.6.0. (#189) - Fix a breaking typo in
vsm.matrix_utils.apply_idf_weighting()
, and rename
the problematic kwarg for consistency with related functions. (#190)
Contributors:
Big thanks to @sammous, @dixiekong (nice name!), and @SandyRogers for the pull
requests, and many more for pointing out various bugs and the rougher edges /
๐ฆ unsupported use cases of this package. - Thus far, this includes a