textacy v0.8.0 Release Notes
Release Date: 2019-07-14 // almost 5 years ago-
π New and Changed:
- π¨ Refactored and expanded text preprocessing functionality (PR #253)
- Moved code from a top-level
preprocess
module into apreprocessing
sub-package, and reorganized it in the process - Added new functions:
replace_hashtags()
to replace hashtags like#FollowFriday
or#spacyIRL2019
with_TAG_
replace_user_handles()
to replace user handles like@bjdewilde
or@spacy_io
with_USER_
replace_emojis()
to replace emoji symbols like π or π with_EMOJI_
normalize_hyphenated_words()
to join hyphenated words back together, likeantici- pation
=>anticipation
normalize_quotation_marks()
to replace "fancy" quotation marks with simple ascii equivalents, likeβthe god particleβ
=>"the god particle"
- Changed a couple functions for clarity and consistency:
replace_currency_symbols()
now replaces all dedicated ascii and unicode currency symbols with_CUR_
, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like$
=>USD
)remove_punct()
now has afast (bool)
kwarg rather thanmethod (str)
- Removed
normalize_contractions()
,preprocess_text()
, andfix_bad_unicode()
functions, since they were bad/awkward and more trouble than they were worth
- Moved code from a top-level
- π¨ Refactored and expanded keyterm extraction functionality (PR #257)
- Moved code from a top-level
keyterms
module into ake
sub-package, and cleaned it up / standardized arg names / better shared functionality in the process - Added new unsupervised keyterm extraction algorithms: YAKE (
ke.yake()
), sCAKE (ke.scake()
), and PositionRank (ke.textrank()
, with non-default parameter values) - Added new methods for selecting candidate keyterms: longest matching subsequence candidates (
ke.utils.get_longest_subsequence_candidates()
) and pattern-matching candidates (ke.utils.get_pattern_matching_candidates()
) - Improved speed of SGRank implementation, and generally optimized much of the code
- Moved code from a top-level
- π Improved document similarity functionality (PR #256)
- Added a character ngram-based similarity measure (
similarity.character_ngrams()
), for something that's useful in different contexts than the other measures - Removed Jaro-Winkler string similarity measure (
similarity.jaro_winkler()
), since it didn't add much beyond other measures - Improved speed of Token Sort Ratio implementation
- Replaced
python-levenshtein
dependency withjellyfish
, for its active development, better documentation, and actually-compliant license
- Added a character ngram-based similarity measure (
- β Added customizability to certain functionality
- Added options to
Doc._.to_bag_of_words()
andCorpus.word_counts()
for filtering out stop words, punctuation, and/or numbers (PR #249) - Allowed for objects that look like
sklearn
-style topic modeling classes to be passed intotm.TopicModel()
(PR #248) - Added options to customize rc params used by
matplotlib
when drawing a "termite" plot inviz.draw_termite_plot()
(PR #248)
- Added options to
- π Removed deprecated functions with direct replacements:
io.utils.get_filenames()
andspacier.components.merge_entities()
Contributors:
- π¨ Refactored and expanded text preprocessing functionality (PR #253)