textacy v0.4.0 Release Notes
Release Date: 2017-06-21 // almost 7 years ago-
๐ New and Changed:
- ๐จ Refactored and expanded built-in
corpora
, now calleddatasets
(PR #112)- The various classes in the old
corpora
subpackage had a similar but frustratingly not-identical API. Also, some fetched the corresponding dataset automatically, while others required users to do it themselves. Ugh. - These classes have been ported over to a new
datasets
subpackage; they now have a consistent API, consistent features, and consistent documentation. They also have some new functionality, including pain-free downloading of the data and saving it to disk in a stream (so as not to use all your RAM). - Also, there's a new dataset: A collection of 2.7k Creative Commons texts from the Oxford Text Archive, which rounds out the included datasets with English-language, 16th-20th century literary works. (h/t @JonathanReeve)
- The various classes in the old
- A
Vectorizer
class to convert tokenized texts into variously weighted document-term matrices (Issue #69, PR #113)- This class uses the familiar
scikit-learn
API (which is also consistent with thetextacy.tm.TopicModel
class) to convert one or more documents in the form of "term lists" into weighted vectors. An initial set of documents is used to build up the matrix vocabulary (via.fit()
), which can then be applied to new documents (via.transform()
). - It's similar in concept and usage to sklearn's
CountVectorizer
orTfidfVectorizer
, but doesn't convolve the tokenization task as they do. This means users have more flexibility in deciding which terms to vectorize. This class outright replaces thetextacy.vsm.doc_term_matrix()
function.
- This class uses the familiar
- Customizable automatic language detection for
Doc
s- Although
cld2-cffi
is fast and accurate, its installation is problematic for some users. Since other language detection libraries are available (e.g.langdetect
andlangid
), it makes sense to let users choose, as needed or desired. - First,
cld2-cffi
is now an optional dependency, i.e. is not installed by default. To install it, dopip install textacy[lang]
or (for it and all other optional deps) dopip install textacy[all]
. (PR #86) - Second, the
lang
param used to instantiateDoc
objects may now be a callable that accepts a unicode string and returns a standard 2-letter language code. This could be a function that useslangdetect
under the hood, or a function that always returns "de" -- it's up to users. Note that the default value is nowtextacy.text_utils.detect_language()
, which usescld2-cffi
, so the default behavior is unchanged.
- Although
- Customizable punctuation removal in the
preprocessing
module (Issue #91)- Users can now specify which punctuation marks they wish to remove, rather than always removing all marks.
- In the case that all marks are removed, however, performance is now 5-10x
faster by using Python's built-in
str.translate()
method instead of a regular expression.
textacy
, installable viaconda
(PR #100)- The package has been added to Conda-Forge (here), and installation instructions have been added to the docs. Hurray!
textacy
, now with helpful badges- Builds are now automatically tested via Travis CI, and there's a badge in
the docs showing whether the build passed or not. The days of my ignoring
broken tests in
master
are (probably) over... - There are also badges showing the latest releases on GitHub, pypi, and conda-forge (see above).
- Builds are now automatically tested via Travis CI, and there's a badge in
the docs showing whether the build passed or not. The days of my ignoring
broken tests in
๐ Fixed:
- ๐ Fixed the check for overlap between named entities and unigrams in the
Doc.to_terms_list()
method (PR #111) Corpus.add_texts()
uses CPU_COUNT - 1 threads by default, rather than always assuming that 4 cores are available (Issue #89)- โ Added a missing coding declaration to a test file, without which tests failed for Python 2 (PR #99)
- ๐ป
readability_stats()
now catches an exception raised on empty documents and logs a message, rather than barfing with an unhelpfulZeroDivisionError
. (Issue #88) - Added a check for empty terms list in
terms_to_semantic_network
(Issue #105) - โ Added and standardized module-specific loggers throughout the code base; not a bug per sรฉ, but certainly some much-needed housecleaning
- โ Added a note to the docs about expectations for bytes vs. unicode text (PR #103)
Contributors:
Thanks to @henridwyer, @rolando, @pavlin99th, and @kyocum for their contributions! :raised_hands:
- ๐จ Refactored and expanded built-in