textacy v0.6.0 Release Notes
Release Date: 2018-02-25 // about 6 years ago-
๐ Changes:
๐จ Rename, refactor, and extend I/O functionality (PR #151)
- Related read/write functions were moved from
read.py
andwrite.py
into
format-specific modules, and similar functions were consolidated into one
with the addition of an arg. For example,write.write_json()
and
write.write_json_lines()
=>json.write_json(lines=True|False)
. - Useful functionality was added to a few readers/writers. For example,
write_json()
now automatically handles python dates/datetimes, writing
them to disk as ISO-formatted strings rather than raising a TypeError
("datetime is not JSON serializable", ugh). CSVs can now be written to /
read from disk when each row is a dict rather than a list. Reading/writing
HTTP streams now allows for basic authentication. - Several things were renamed to improve clarity and consistency from a user's
๐ฆ perspective, most notably the subpackage name:fileio
=>io
. Others:
read_file()
andwrite_file()
=>read_text()
andwrite_text()
;
split_record_fields()
=>split_records()
, although I kept an alias
๐ to the old function for folks;auto_make_dirs
boolean kwarg =>make_dirs
. io.open_sesame()
now handles zip files (provided they contain only 1 file)
as it already does for gzip, bz2, and lzma files. On a related note, Python 2
๐ users can now open lzma (.xz
) files if they've installedbackports.lzma
.
๐ Improve, refactor, and extend vector space model functionality (PRs #156 and #167)
BM25 term weighting and document-length normalization were implemented, and
and users can now flexibly add and customize individual components of an
overall weighting scheme (local scaling + global scaling + doc-wise normalization).
For API sanity, several additions and changes to theVectorizer
init
params were required --- sorry bout it!Given all the new weighting possibilities, a
Vectorizer.weighting
attribute
was added for curious users, to give a mathematical representation of how
values in a doc-term matrix are being calculated. Here's a simple and a
not-so-simple case:\>\>\> Vectorizer(apply\_idf=True, idf\_type='smooth').weighting 'tf \* log((n\_docs + 1) / (df + 1)) + 1'\>\>\> Vectorizer(tf\_type='bm25', apply\_idf=True, idf\_type='bm25', apply\_dl=True).weighting '(tf \* (k + 1)) / (tf + k \* (1 - b + b \* (length / avg(lengths))) \* log((n\_docs - df + 0.5) / (df + 0.5))'
Terms are now sorted alphabetically after fitting, so you'll have a consistent
and interpretable ordering in your vocabulary and doc-term-matrix.A
GroupVectorizer
class was added, as a child ofVectorizer
and
an extension of typical document-term matrix vectorization, in which each
row vector corresponds to the weighted terms co-occurring in a single document.
This allows for customized grouping, such as by a shared author or publication year,
๐ that may span multiple documents, without forcing users to merge /concatenate
those documents themselves.๐จ Lastly, the
vsm.py
module was refactored into avsm
subpackage with
two modules. Imports should stay the same, but the code structure is now
more amenable to future additions.Miscellaneous additions and improvements
- Flesch Reading Ease in the
textstats
module is now multi-lingual! Language-
specific formulations for German, Spanish, French, Italian, Dutch, and Russian
0๏ธโฃ were added, in addition to (the default) English. (PR #158, prompted by Issue #155) - Runtime performance, as well as docs and error messages, of functions for
generating semantic networks from lists of terms or sentences were improved. (PR #163) - Labels on named entities from which determiners have been dropped are now
๐ preserved. There's still a minor gotcha, but it's explained in the docs. - The size of
textacy
's data cache can now be set via an environment
variable,TEXTACY_MAX_CACHE_SIZE
, in case the default 2GB cache doesn't
meet your needs. - Docstrings were improved in many ways, large and small, throughout the code.
May they guide you even more effectively than before! - The package version is now set from a single source. This isn't for you so
much as me, but it does prevent confusing version mismatches b/w code, pypi,
๐ and docs. - All tests have been converted from
unittest
topytest
style. They
โ run faster, they're more informative in failure, and they're easier to extend.
๐ Bugfixes:
- ๐ Fixed an issue where existing metadata associated with a spacy Doc was being
overwritten with an empty dict when using it to initialize a textacy Doc.
๐ Users can still overwrite existing metadata, but only if they pass in new data. - โ Added a missing import to the README's usage example. (#149)
- ๐ The intersphinx mapping to
numpy
got fixed (and items forscipy
and
matplotlib
were added, too). Taking advantage of that, a bunch of broken
๐ object links scattered throughout the docs got fixed. - ๐ Fixed broken formatting of old entries in the changelog, for your reading pleasure.
- Related read/write functions were moved from