All Versions
30
Latest Version
Avg Release Cycle
86 days
Latest Release
865 days ago

Changelog History
Page 2

  • v0.6.2 Changes

    July 19, 2018

    ๐Ÿ”„ Changes:

    • โž• Add a spacier.util module, and add / reorganize relevant functionality
      • move (most) spacy_util functions here, and add a deprecation warning to
        the spacy_util module
      • rename normalized_str() => get_normalized_text(), for consistency and clarity
      • add a function to split long texts up into chunks but combine them into
        โ†ช a single Doc. This is a workaround for a current limitation of spaCy's
        neural models, whose RAM usage scales with the length of input text.
    • โž• Add experimental support for reading and writing spaCy docs in binary format,
      ๐Ÿ“„ where multiple docs are contained in a single file. This functionality was
      ๐Ÿ‘Œ supported by spaCy v1, but is not in spaCy v2; I've implemented a workaround
      that should work well in most situations, but YMMV.
    • ๐Ÿ“š Package documentation is now "officially" hosted on GitHub pages. The docs
      ๐Ÿš€ are automatically built on and deployed from Travis via doctr, so they
      stay up-to-date with the master branch on GitHub. Maybe someday I'll get
      ๐Ÿ— ReadTheDocs to successfully build textacy once again...
      • Minor improvements/updates to documentation

    ๐Ÿ›  Bugfixes:

    • Add missing return statement in deprecated text_stats.flesch_readability_ease()
      function (Issue #191)
    • ๐Ÿ’… Catch an empty graph error in bestcoverage-style keyterm ranking (Issue #196)
    • ๐Ÿ›  Fix mishandling when specifying a single named entity type to in/exclude in
      extract.named_entities (Issue #202)
    • ๐Ÿ‘‰ Make networkx usage in keyterms module compatible with v1.11+ (Issue #199)
  • v0.6.1 Changes

    April 12, 2018

    ๐Ÿ”„ Changes:

    โž• Add a new spacier sub-package for spaCy-oriented functionality (#168, #187)

    • Thus far, this includes a components module with two custom spaCy
      ๐Ÿ“œ pipeline components: one to compute text stats on parsed documents, and
      ๐Ÿ”€ another to merge named entities into single tokens in an efficient manner.
      More to come!
    • Similar functionality in the top-level spacy_pipelines module has been
      ๐Ÿšš deprecated; it will be removed in v0.7.0.

    โšก๏ธ Update the readme, usage, and API reference docs to be clearer and (I hope)
    more useful. (#186)

    Removing punctuation from a text via the preprocessing module now replaces
    punctuation marks with a single space rather than an empty string. This gives
    ๐Ÿ‘ better behavior in many situations; for example, "won't" => "won t" rather than
    "wont", the latter of which is a valid word with a different meaning.

    Categories are now correctly extracted from non-English language Wikipedia
    datasets, starting with French and German and extendable to others. (#175)

    ๐ŸŒฒ Log progress when adding documents to a corpus. At the debug level, every
    doc's addition is logged; at the info level, only one message per batch
    of documents is logged. (#183)

    ๐Ÿ›  Bugfixes:

    • ๐Ÿ›  Fix two breaking typos in extract.direct_quotations(). (issue #177)
    • ๐Ÿ“œ Prevent crashes when adding non-parsed documents to a Corpus. (#180)
    • Fix bugs in keyterms.most_discriminating_terms() that used vsm
      functionality as it was before the changes in v0.6.0. (#189)
    • Fix a breaking typo in vsm.matrix_utils.apply_idf_weighting(), and rename
      the problematic kwarg for consistency with related functions. (#190)

    Contributors:

    Big thanks to @sammous, @dixiekong (nice name!), and @SandyRogers for the pull
    requests, and many more for pointing out various bugs and the rougher edges /
    ๐Ÿ“ฆ unsupported use cases of this package.

  • v0.6.0 Changes

    February 25, 2018

    ๐Ÿ”„ Changes:

    ๐Ÿ”จ Rename, refactor, and extend I/O functionality (PR #151)

    • Related read/write functions were moved from read.py and write.py into
      format-specific modules, and similar functions were consolidated into one
      with the addition of an arg. For example, write.write_json() and
      write.write_json_lines() => json.write_json(lines=True|False).
    • Useful functionality was added to a few readers/writers. For example,
      write_json() now automatically handles python dates/datetimes, writing
      them to disk as ISO-formatted strings rather than raising a TypeError
      ("datetime is not JSON serializable", ugh). CSVs can now be written to /
      read from disk when each row is a dict rather than a list. Reading/writing
      HTTP streams now allows for basic authentication.
    • Several things were renamed to improve clarity and consistency from a user's
      ๐Ÿ“ฆ perspective, most notably the subpackage name: fileio => io. Others:
      read_file() and write_file() => read_text() and write_text();
      split_record_fields() => split_records(), although I kept an alias
      ๐Ÿ‘‰ to the old function for folks; auto_make_dirs boolean kwarg => make_dirs.
    • io.open_sesame() now handles zip files (provided they contain only 1 file)
      as it already does for gzip, bz2, and lzma files. On a related note, Python 2
      ๐Ÿ‘‰ users can now open lzma (.xz) files if they've installed backports.lzma.

    ๐Ÿ‘Œ Improve, refactor, and extend vector space model functionality (PRs #156 and #167)

    BM25 term weighting and document-length normalization were implemented, and
    and users can now flexibly add and customize individual components of an
    overall weighting scheme (local scaling + global scaling + doc-wise normalization).
    For API sanity, several additions and changes to the Vectorizer init
    params were required --- sorry bout it!

    Given all the new weighting possibilities, a Vectorizer.weighting attribute
    was added for curious users, to give a mathematical representation of how
    values in a doc-term matrix are being calculated. Here's a simple and a
    not-so-simple case:

     \>\>\> Vectorizer(apply\_idf=True, idf\_type='smooth').weighting 'tf \* log((n\_docs + 1) / (df + 1)) + 1'\>\>\> Vectorizer(tf\_type='bm25', apply\_idf=True, idf\_type='bm25', apply\_dl=True).weighting '(tf \* (k + 1)) / (tf + k \* (1 - b + b \* (length / avg(lengths))) \* log((n\_docs - df + 0.5) / (df + 0.5))'
    

    Terms are now sorted alphabetically after fitting, so you'll have a consistent
    and interpretable ordering in your vocabulary and doc-term-matrix.

    A GroupVectorizer class was added, as a child of Vectorizer and
    an extension of typical document-term matrix vectorization, in which each
    row vector corresponds to the weighted terms co-occurring in a single document.
    This allows for customized grouping, such as by a shared author or publication year,
    ๐Ÿ”€ that may span multiple documents, without forcing users to merge /concatenate
    those documents themselves.

    ๐Ÿ”จ Lastly, the vsm.py module was refactored into a vsm subpackage with
    two modules. Imports should stay the same, but the code structure is now
    more amenable to future additions.

    Miscellaneous additions and improvements

    • Flesch Reading Ease in the textstats module is now multi-lingual! Language-
      specific formulations for German, Spanish, French, Italian, Dutch, and Russian
      0๏ธโƒฃ were added, in addition to (the default) English. (PR #158, prompted by Issue #155)
    • Runtime performance, as well as docs and error messages, of functions for
      generating semantic networks from lists of terms or sentences were improved. (PR #163)
    • Labels on named entities from which determiners have been dropped are now
      ๐Ÿ“„ preserved. There's still a minor gotcha, but it's explained in the docs.
    • The size of textacy's data cache can now be set via an environment
      variable, TEXTACY_MAX_CACHE_SIZE, in case the default 2GB cache doesn't
      meet your needs.
    • Docstrings were improved in many ways, large and small, throughout the code.
      May they guide you even more effectively than before!
    • The package version is now set from a single source. This isn't for you so
      much as me, but it does prevent confusing version mismatches b/w code, pypi,
      ๐Ÿ“„ and docs.
    • All tests have been converted from unittest to pytest style. They
      โš™ run faster, they're more informative in failure, and they're easier to extend.

    ๐Ÿ›  Bugfixes:

    • ๐Ÿ›  Fixed an issue where existing metadata associated with a spacy Doc was being
      overwritten with an empty dict when using it to initialize a textacy Doc.
      ๐Ÿ“‡ Users can still overwrite existing metadata, but only if they pass in new data.
    • โž• Added a missing import to the README's usage example. (#149)
    • ๐Ÿ›  The intersphinx mapping to numpy got fixed (and items for scipy and
      matplotlib were added, too). Taking advantage of that, a bunch of broken
      ๐Ÿ›  object links scattered throughout the docs got fixed.
    • ๐Ÿ›  Fixed broken formatting of old entries in the changelog, for your reading pleasure.
  • v0.5.0 Changes

    December 04, 2017

    ๐Ÿ”„ Changes:

    โฌ†๏ธ Bumped version requirement for spaCy from < 2.0 to >= 2.0 --- textacy no longer
    โฌ†๏ธ works with spaCy 1.x! It's worth the upgrade, though. v2.0's new features and
    API enabled (or required) a few changes on textacy's end

    • textacy.load_spacy() takes the same inputs as the new spacy.load(),
      ๐Ÿ“ฆ i.e. a package name string and an optional list of pipes to disable
    • textacy's Doc metadata and language string are now stored in user_data
      directly on the spaCy Doc object; although the API from a user's perspective
      is unchanged, this made the next change possible
    • Doc and Corpus classes are now de/serialized via pickle into a single
      ๐Ÿ“‡ file --- no more side-car JSON files for metadata! Accordingly, the .save()
      and .load() methods on both classes have a simpler API: they take
      a single string specifying the file on disk where data is stored.

    โœ… Cleaned up docs, imports, and tests throughout the entire code base.

    • docstrings and https://textacy.readthedocs.io 's API reference are easier to
      ๐ŸŒ read, with better cross-referencing and far fewer broken web links
    • namespaces are less cluttered, and textacy's source code is easier to follow
    • import textacy takes less than half the time from before
    • the full test suite also runs about twice as fast, and most tests are now
      ๐ŸŽ more robust to changes in the performance of spaCy's models

    - consistent adherence to conventions eases users' cognitive load :)

    The module responsible for caching loaded data in memory was cleaned up and
    ๐Ÿ‘Œ improved
    , as well as renamed: from data.py to cache.py, which is more
    descriptive of its purpose. Otherwise, you shouldn't notice much of a difference
    besides things working correctly.

    • All loaded data (e.g. spacy language pipelines) is now cached together in a
      single LRU cache whose max size is set to 2GB, and the size of each element
      in the cache is now accurately computed. (tl;dr: sys.getsizeof does not
      work on non-built-in objects like, say, a spacy.tokens.Doc.)
    • Loading and downloading of the DepecheMood resource is now less hacky and
      weird, and much closer to how users already deal with textacy's various
      Dataset s, In fact, it can be downloaded in exactly the same way as the
      datasets via textacy's new CLI: $ python -m textacy download depechemood.
      P.S. A brief guide for using the CLI got added to the README.

    ๐Ÿšš Several function/method arguments marked for deprecation have been removed.
    โš  If you've been ignoring the warnings that print out when you use lemmatize=True
    โšก๏ธ instead of normalize='lemma' (etc.), now is the time to update your calls!

    • Of particular note: The readability_stats() function has been removed;
      ๐Ÿ‘‰ use TextStats(doc).readability_stats instead.

    ๐Ÿ›  Bugfixes:

    • In certain situations, the text of a spaCy span was being returned without
      whitespace between tokens; that has been avoided in textacy, and the source bug
      ๐Ÿ›  in spaCy got fixed (by yours truly! explosion/spaCy#1621).
    • ๐Ÿ“‡ When adding already-parsed Docs to a Corpus, including metadata
      ๐Ÿ“‡ now correctly overwrites any existing metadata on those docs.
    • ๐Ÿ›  Fixed a couple related issues involving the assignment of a 2-letter language
      string to the .lang attribute of Doc and Corpus objects.
    • textacy's CLI wasn't correctly handling certain dataset kwargs in all cases;
      now, all kwargs get to their intended destinations.
  • v0.4.2 Changes

    November 29, 2017

    ๐Ÿ”„ Changes:

    • โž• Added a CLI for downloading textacy-related data, inspired by the spaCy
      equivalent. It's temporarily undocumented, but to see available commands and
      options, just pass the usual flag: $ python -m textacy --help. Expect more
      ๐Ÿ“„ functionality (and docs!) to be added soonish. (#144)
      • Note: The existing Dataset.download() methods work as before, and in fact,
        ๐Ÿ’ป they are being called under the hood from the command line.
    • Made usage of networkx v2.0-compatible, and therefore dropped the <2.0
      ๐Ÿ”– version requirement on that dependency. Upgrade as you please! (#131)
    • ๐Ÿ‘Œ Improved the regex for identifying phone numbers so that it's easier to view
      and interpret its matches. (#128)

    ๐Ÿ›  Bugfixes:

    • ๐Ÿ›  Fixed caching of counts on textacy.Doc to make it instance-specific, rather than
      shared by all instances of the class. Oops.
    • ๐Ÿ›  Fixed currency symbols regex, so as not to replace all instances of the letter "z"
      when a custom string is passed into replace_currency_symbols(). (#137)
    • ๐Ÿ›  Fixed README usage example, which skipped downloading of dataset data. Btw,
      ๐Ÿ‘€ see above for another way! (#124)
    • ๐Ÿ›  Fixed typo in the API reference, which included the SupremeCourt dataset twice
      and omitted the RedditComments dataset. (#129)
    • ๐Ÿ›  Fixed typo in RedditComments.download() that prevented it from downloading
      any data. (#143)

    Contributors:

    Many thanks to @asifm, @harryhoch, and @mdlynch37 for submitting PRs!

  • v0.4.1 Changes

    July 27, 2017

    ๐Ÿ”„ Changes:

    • โž• Added key classes to the top-level textacy imports, for convenience:
      • textacy.text_stats.TextStats => textacy.TextStats
      • textacy.vsm.Vectorizer => textacy.Vectorizer
      • textacy.tm.TopicModel => textacy.TopicModel
    • โž• Added tests for textacy.Doc and updated the README's usage example

    ๐Ÿ›  Bugfixes:

    • โž• Added explicit encoding when opening Wikipedia database files in text mode to
      ๐Ÿ resolve an issue when doing so without encoding on Windows (PR #118)
    • Fixed keyterms.most_discriminating_terms to use the vsm.Vectorizer class
      rather than the vsm.doc_term_matrix function that it replaced (PR #120)
    • Fixed mishandling of a couple optional args in Doc.to_terms_list

    Contributors:

    ๐Ÿ›  Thanks to @minketeer and @Gregory-Howard for the fixes!

  • v0.4.0 Changes

    June 21, 2017

    ๐Ÿ†• New and Changed:

    • ๐Ÿ”จ Refactored and expanded built-in corpora, now called datasets (PR #112)
      • The various classes in the old corpora subpackage had a similar but frustratingly not-identical API. Also, some fetched the corresponding dataset automatically, while others required users to do it themselves. Ugh.
      • These classes have been ported over to a new datasets subpackage; they now have a consistent API, consistent features, and consistent documentation. They also have some new functionality, including pain-free downloading of the data and saving it to disk in a stream (so as not to use all your RAM).
      • Also, there's a new dataset: A collection of 2.7k Creative Commons texts from the Oxford Text Archive, which rounds out the included datasets with English-language, 16th-20th century literary works. (h/t @JonathanReeve)
    • A Vectorizer class to convert tokenized texts into variously weighted document-term matrices (Issue #69, PR #113)
      • This class uses the familiar scikit-learn API (which is also consistent with the textacy.tm.TopicModel class) to convert one or more documents in the form of "term lists" into weighted vectors. An initial set of documents is used to build up the matrix vocabulary (via .fit()), which can then be applied to new documents (via .transform()).
      • It's similar in concept and usage to sklearn's CountVectorizer or TfidfVectorizer, but doesn't convolve the tokenization task as they do. This means users have more flexibility in deciding which terms to vectorize. This class outright replaces the textacy.vsm.doc_term_matrix() function.
    • Customizable automatic language detection for Doc s
      • Although cld2-cffi is fast and accurate, its installation is problematic for some users. Since other language detection libraries are available (e.g. langdetect and langid), it makes sense to let users choose, as needed or desired.
      • First, cld2-cffi is now an optional dependency, i.e. is not installed by default. To install it, do pip install textacy[lang] or (for it and all other optional deps) do pip install textacy[all]. (PR #86)
      • Second, the lang param used to instantiate Doc objects may now be a callable that accepts a unicode string and returns a standard 2-letter language code. This could be a function that uses langdetect under the hood, or a function that always returns "de" -- it's up to users. Note that the default value is now textacy.text_utils.detect_language(), which uses cld2-cffi, so the default behavior is unchanged.
    • Customizable punctuation removal in the preprocessing module (Issue #91)
      • Users can now specify which punctuation marks they wish to remove, rather than always removing all marks.
      • In the case that all marks are removed, however, performance is now 5-10x faster by using Python's built-in str.translate() method instead of a regular expression.
    • textacy, installable via conda (PR #100)
      • The package has been added to Conda-Forge (here), and installation instructions have been added to the docs. Hurray!
    • textacy, now with helpful badges
      • Builds are now automatically tested via Travis CI, and there's a badge in the docs showing whether the build passed or not. The days of my ignoring broken tests in master are (probably) over...
      • There are also badges showing the latest releases on GitHub, pypi, and conda-forge (see above).

    ๐Ÿ›  Fixed:

    • ๐Ÿ›  Fixed the check for overlap between named entities and unigrams in the Doc.to_terms_list() method (PR #111)
    • Corpus.add_texts() uses CPU_COUNT - 1 threads by default, rather than always assuming that 4 cores are available (Issue #89)
    • โž• Added a missing coding declaration to a test file, without which tests failed for Python 2 (PR #99)
    • ๐Ÿ‘ป readability_stats() now catches an exception raised on empty documents and logs a message, rather than barfing with an unhelpful ZeroDivisionError. (Issue #88)
    • Added a check for empty terms list in terms_to_semantic_network (Issue #105)
    • โž• Added and standardized module-specific loggers throughout the code base; not a bug per sรฉ, but certainly some much-needed housecleaning
    • โž• Added a note to the docs about expectations for bytes vs. unicode text (PR #103)

    Contributors:

    Thanks to @henridwyer, @rolando, @pavlin99th, and @kyocum for their contributions! :raised_hands:

  • v0.3.4 Changes

    April 17, 2017

    ๐Ÿ†• New and Changed:

    • ๐Ÿ‘Œ Improved and expanded calculation of basic counts and readability statistics in text_stats module.
      • Added a TextStats() class for more convenient, granular access to individual values. See usage docs for more info. When calculating, say, just one readability statistic, performance with this class should be slightly better; if calculating all statistics, performance is worse owing to unavoidable, added overhead in Python for variable lookups. The legacy function text_stats.readability_stats() still exists and behaves as before, but a deprecation warning is displayed.
      • Added functions for calculating Wiener Sachtextformel (PR #77), LIX, and GULPease readability statistics.
      • Added number of long words and number of monosyllabic words to basic counts.
    • Clarified the need for having spacy models installed for most use cases of textacy, in addition to just the spacy package.
      • README updated with comments on this, including links to more extensive spacy documentation. (Issues #66 and #68)
      • Added a function, compat.get_config() that includes information about which (if any) spacy models are installed.
      • Recent changes to spacy, including a warning message, will also make model problems more apparent.
    • โž• Added an ngrams parameter to keyterms.sgrank(), allowing for more flexibility in specifying valid keyterm candidates for the algorithm. (PR #75)
    • โฌ‡๏ธ Dropped dependency on fuzzywuzzy package, replacing usage of fuzz.token_sort_ratio() with a textacy equivalent in order to avoid license incompatibilities. As a bonus, the new code seems to perform faster! (Issue #62)
      • Note: Outputs are now floats in [0.0, 1.0], consistent with other similarity functions, whereas before outputs were ints in [0, 100]. This has implications for match_threshold values passed to similarity.jaccard(); a warning is displayed and the conversion is performed automatically, for now.
    • โœ… A MANIFEST.in file was added to include docs, tests, and distribution files in the source distribution. This is just good practice. (PR #65)

    ๐Ÿ›  Fixed:

    • Known acronym-definition pairs are now properly handled in extract.acronyms_and_definitions() (Issue #61)
    • ๐Ÿ“œ WikiReader no longer crashes on null page element content while parsing (PR #64)
    • ๐Ÿ›  Fixed a rare but perfectly legal edge case exception in keyterms.sgrank(), and added a window width sanity check. (Issue #72)
    • ๐Ÿ›  Fixed assignment of 2-letter language codes to Doc and Corpus objects when the lang parameter is specified as a full spacy model name.
    • ๐Ÿ–จ Replaced several leftover print statements with proper logging functions.

    Contributors:

    Big thanks to @oroszgy, @rolando, @covuworie, and @RolandColored for the pull requests!

  • v0.3.3 Changes

    February 10, 2017

    ๐Ÿ†• New and Changed:

    • โž• Added a consistent normalize param to functions and methods that require token/span text normalization. Typically, it takes one of the following values: 'lemma' to lemmatize tokens, 'lower' to lowercase tokens, False-y to not normalize tokens, or a function that converts a spacy token or span into a string, in whatever way the user prefers (e.g. spacy_utils.normalized_str()).
      • Functions modified to use this param: Doc.to_bag_of_terms(), Doc.to_bag_of_words(), Doc.to_terms_list(), Doc.to_semantic_network(), Corpus.word_freqs(), Corpus.word_doc_freqs(), keyterms.sgrank(), keyterms.textrank(), keyterms.singlerank(), keyterms.key_terms_from_semantic_network(), network.terms_to_semantic_network(), network.sents_to_semantic_network()
    • ๐Ÿ‘‰ Tweaked keyterms.sgrank() for higher quality results and improved internal performance.
    • When getting both n-grams and named entities with Doc.to_terms_list(), filtering out numeric spans for only one is automatically extended to the other. This prevents unexpected behavior, such as passing filter_nums=True but getting numeric named entities back in the terms list.

    ๐Ÿ›  Fixed:

    • keyterms.sgrank() no longer crashes if a term is missing from idfs mapping. (@jeremybmerrill, issue #53)
    • Proper nouns are no longer excluded from consideration as keyterms in keyterms.sgrank() and keyterms.textrank(). (@jeremybmerrill, issue #53)
    • Empty strings are now excluded from consideration as keyterms โ€” a bug inherited from spaCy. (@mlehl88, issue #58)
  • v0.3.2 Changes

    November 15, 2016

    ๐Ÿ†• New and Changed:

    • Preliminary inclusion of custom spaCy pipelines
      • updated load_spacy() to include explicit path and create_pipeline kwargs, and removed the already-deprecated load_spacy_pipeline() function to avoid confusion around spaCy languages and pipelines
      • added spacy_pipelines module to hold implementations of custom spaCy pipelines, including a basic one that merges entities into single tokens
      • note: necessarily bumped minimum spaCy version to 1.1.0+
      • see the announcement here: https://explosion.ai/blog/spacy-deep-learning-keras
    • To reduce code bloat, made the matplotlib dependency optional and dropped the gensim dependency
      • to install matplotlib at the same time as textacy, do $ pip install textacy[viz]
      • bonus: backports.csv is now only installed for Py2 users
      • thanks to @mbatchkarov for the request
    • ๐Ÿ‘Œ Improved performance of textacy.corpora.WikiReader().texts(); results should stream faster and have cleaner plaintext content than when they were produced by gensim. This should also fix a bug reported in Issue #51 by @baisk
    • โž• Added a Corpus.vectors property that returns a matrix of shape (# documents, vector dim) containing the average word2vec-style vector representation of constituent tokens for all Doc s