textacy v0.6.0 release notes (2018-02-25)

« Changelog History

textacy v0.6.0 Release Notes

Release Date: 2018-02-25 // about 6 years ago

🔄 Changes:
🔨 Rename, refactor, and extend I/O functionality (PR #151)
- Related read/write functions were moved from read.py and write.py into
  format-specific modules, and similar functions were consolidated into one
  with the addition of an arg. For example, write.write_json() and
  write.write_json_lines() => json.write_json(lines=True|False).
- Useful functionality was added to a few readers/writers. For example,
  write_json() now automatically handles python dates/datetimes, writing
  them to disk as ISO-formatted strings rather than raising a TypeError
  ("datetime is not JSON serializable", ugh). CSVs can now be written to /
  read from disk when each row is a dict rather than a list. Reading/writing
  HTTP streams now allows for basic authentication.
- Several things were renamed to improve clarity and consistency from a user's
  📦 perspective, most notably the subpackage name: fileio => io. Others:
  read_file() and write_file() => read_text() and write_text();
  split_record_fields() => split_records(), although I kept an alias
  👉 to the old function for folks; auto_make_dirs boolean kwarg => make_dirs.
- io.open_sesame() now handles zip files (provided they contain only 1 file)
  as it already does for gzip, bz2, and lzma files. On a related note, Python 2
  👉 users can now open lzma (.xz) files if they've installed backports.lzma.
👌 Improve, refactor, and extend vector space model functionality (PRs #156 and #167)
BM25 term weighting and document-length normalization were implemented, and
and users can now flexibly add and customize individual components of an
overall weighting scheme (local scaling + global scaling + doc-wise normalization).
For API sanity, several additions and changes to the Vectorizer init
params were required --- sorry bout it!
Given all the new weighting possibilities, a Vectorizer.weighting attribute
was added for curious users, to give a mathematical representation of how
values in a doc-term matrix are being calculated. Here's a simple and a
not-so-simple case:
```
 \>\>\> Vectorizer(apply\_idf=True, idf\_type='smooth').weighting 'tf \* log((n\_docs + 1) / (df + 1)) + 1'\>\>\> Vectorizer(tf\_type='bm25', apply\_idf=True, idf\_type='bm25', apply\_dl=True).weighting '(tf \* (k + 1)) / (tf + k \* (1 - b + b \* (length / avg(lengths))) \* log((n\_docs - df + 0.5) / (df + 0.5))'
```
Terms are now sorted alphabetically after fitting, so you'll have a consistent
and interpretable ordering in your vocabulary and doc-term-matrix.
A GroupVectorizer class was added, as a child of Vectorizer and
an extension of typical document-term matrix vectorization, in which each
row vector corresponds to the weighted terms co-occurring in a single document.
This allows for customized grouping, such as by a shared author or publication year,
🔀 that may span multiple documents, without forcing users to merge /concatenate
those documents themselves.
🔨 Lastly, the vsm.py module was refactored into a vsm subpackage with
two modules. Imports should stay the same, but the code structure is now
more amenable to future additions.
Miscellaneous additions and improvements
- Flesch Reading Ease in the textstats module is now multi-lingual! Language-
  specific formulations for German, Spanish, French, Italian, Dutch, and Russian
  0️⃣ were added, in addition to (the default) English. (PR #158, prompted by Issue #155)
- Runtime performance, as well as docs and error messages, of functions for
  generating semantic networks from lists of terms or sentences were improved. (PR #163)
- Labels on named entities from which determiners have been dropped are now
  📄 preserved. There's still a minor gotcha, but it's explained in the docs.
- The size of textacy's data cache can now be set via an environment
  variable, TEXTACY_MAX_CACHE_SIZE, in case the default 2GB cache doesn't
  meet your needs.
- Docstrings were improved in many ways, large and small, throughout the code.
  May they guide you even more effectively than before!
- The package version is now set from a single source. This isn't for you so
  much as me, but it does prevent confusing version mismatches b/w code, pypi,
  📄 and docs.
- All tests have been converted from unittest to pytest style. They
  ⚙ run faster, they're more informative in failure, and they're easier to extend.
🛠 Bugfixes:
- 🛠 Fixed an issue where existing metadata associated with a spacy Doc was being
  overwritten with an empty dict when using it to initialize a textacy Doc.
  📇 Users can still overwrite existing metadata, but only if they pass in new data.
- ➕ Added a missing import to the README's usage example. (#149)
- 🛠 The intersphinx mapping to numpy got fixed (and items for scipy and
  matplotlib were added, too). Taking advantage of that, a bunch of broken
  🛠 object links scattered throughout the docs got fixed.
- 🛠 Fixed broken formatting of old entries in the changelog, for your reading pleasure.

textacy v0.6.0

Version Release Notes from February 25, 2018 (about 6 years ago)

« Changelog History

textacy v0.6.0 Release Notes