All Versions
26
Latest Version
Avg Release Cycle
54 days
Latest Release
78 days ago

Changelog History
Page 1

  • v0.9.1

    September 03, 2019

    πŸ”„ Changed:

    • ⚑️ Tweaked TopicModel class to work with newer versions of scikit-learn, and updated version requirements accordingly from >=0.18.0,<0.21.0 to >=0.19

    πŸ›  Fixed:

    • πŸ›  Fixed residual bugs in the script for training language identification pipelines, then trained and released one using scikit-learn==0.19 to prevent errors for users on that version
  • v0.9.0

    September 03, 2019

    πŸ‘€ Note: textacy is now PY3-only! πŸŽ‰ Specifically, support for PY2.7 has been dropped, and the minimum PY3 version has been bumped to 3.6 (PR #261). See below for related changes.

    πŸ†• New:

    • βž• Added augmentation subpackage for basic text data augmentation (PR #268, #269)
      • implemented several transformer functions for substituting, inserting, swapping, and deleting elements of text at both the word- and character-level
      • implemented an Augmenter class for combining multiple transforms and applying them to spaCy Docs in a randomized but configurable manner
      • Note: This API is provisional, and subject to change in future releases.
    • βž• Added resources subpackage for standardized access to linguistic resources (PR #265)
      • DepecheMood++: high-coverage emotion lexicons for understanding the emotions evoked by a text. Updated from a previous version, and now features better English data and Italian data with expanded, consistent functionality.
      • removed lexicon_methods.py module with previous implementation
      • ConceptNet: multilingual knowledge base for representing relationships between words, similar to WordNet. Currently supports getting word antonyms, hyponyms, meronyms, and synonyms in dozens of languages.
    • βž• Added UDHR dataset, a collection of translations of the Universal Declaration of Human Rights (PR #271)

    πŸ”„ Changed:

    • ⚑️ Updated and extended functionality previously blocked by PY2 compatibility while reducing code bloat / complexity
      • made many args keyword-only, to prevent user error
      • args accepting strings for directory / file paths now also accept pathlib.Path objects, with pathlib adopted widely under the hood
      • increased minimum versions and/or uncapped maximum versions of several dependencies, including jellyfish, networkx, and numpy
    • βž• Added a Portuguese-specific formulation of Flesch Reading Ease score to text_stats (PR #263)
    • Reorganized and grouped together some like functionality
      • moved core functionality for loading spaCy langs and making spaCy docs into spacier.core, out of cache.py and doc.py
      • moved some general-purpose functionality from dataset.utils to io.utils and utils.py
      • moved function for loading "hyphenator" out of cache.py and into text_stats.py, where it's used
    • πŸš€ Re-trained and released language identification pipelines using a better mix of training data, for slightly improved performance; also added the script used to train the pipeline
    • πŸ”„ Changed API Reference docs to show items in source code rather than alphabetical order, which should make the ordering more human-friendly
    • πŸ“‡ Updated repo README and PyPi metadata to be more consistent and representative of current functionality
    • Removed previously deprecated textacy.io.split_record_fields() function

    πŸ›  Fixed:

    • Fixed a regex for cleaning up crufty terms to prevent catastrophic backtracking in certain edge cases (true story: this bug was encountered in production code, and ruined my day)
    • πŸ›  Fixed bad handling of edge cases in sCAKE keyterm extraction (Issue #270)
    • πŸ”„ Changed order in which URL regexes are applied in preprocessing.replace_urls() to properly handle certain edge case URLs (Issue #267)

    Contributors:

    🍱 Thanks much to @hugoabonizio for the contribution. 🀝

  • v0.8.0

    July 14, 2019

    πŸ†• New and Changed:

    • ♻️ Refactored and expanded text preprocessing functionality (PR #253)
      • Moved code from a top-level preprocess module into a preprocessing sub-package, and reorganized it in the process
      • Added new functions:
      • replace_hashtags() to replace hashtags like #FollowFriday or #spacyIRL2019 with _TAG_
      • replace_user_handles() to replace user handles like @bjdewilde or @spacy_io with _USER_
      • replace_emojis() to replace emoji symbols like πŸ˜‰ or πŸš€ with _EMOJI_
      • normalize_hyphenated_words() to join hyphenated words back together, like antici- pation => anticipation
      • normalize_quotation_marks() to replace "fancy" quotation marks with simple ascii equivalents, like β€œthe god particle” => "the god particle"
      • Changed a couple functions for clarity and consistency:
      • replace_currency_symbols() now replaces all dedicated ascii and unicode currency symbols with _CUR_, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like $ => USD)
      • remove_punct() now has a fast (bool) kwarg rather than method (str)
      • Removed normalize_contractions(), preprocess_text(), and fix_bad_unicode() functions, since they were bad/awkward and more trouble than they were worth
    • ♻️ Refactored and expanded keyterm extraction functionality (PR #257)
      • Moved code from a top-level keyterms module into a ke sub-package, and cleaned it up / standardized arg names / better shared functionality in the process
      • Added new unsupervised keyterm extraction algorithms: YAKE (ke.yake()), sCAKE (ke.scake()), and PositionRank (ke.textrank(), with non-default parameter values)
      • Added new methods for selecting candidate keyterms: longest matching subsequence candidates (ke.utils.get_longest_subsequence_candidates()) and pattern-matching candidates (ke.utils.get_pattern_matching_candidates())
      • Improved speed of SGRank implementation, and generally optimized much of the code
    • πŸ‘Œ Improved document similarity functionality (PR #256)
      • Added a character ngram-based similarity measure (similarity.character_ngrams()), for something that's useful in different contexts than the other measures
      • Removed Jaro-Winkler string similarity measure (similarity.jaro_winkler()), since it didn't add much beyond other measures
      • Improved speed of Token Sort Ratio implementation
      • Replaced python-levenshtein dependency with jellyfish, for its active development, better documentation, and actually-compliant license
    • βž• Added customizability to certain functionality
      • Added options to Doc._.to_bag_of_words() and Corpus.word_counts() for filtering out stop words, punctuation, and/or numbers (PR #249)
      • Allowed for objects that look like sklearn-style topic modeling classes to be passed into tm.TopicModel() (PR #248)
      • Added options to customize rc params used by matplotlib when drawing a "termite" plot in viz.draw_termite_plot() (PR #248)
    • πŸ”€ Removed deprecated functions with direct replacements: io.utils.get_filenames() and spacier.components.merge_entities()

    Contributors:

    🍱 Huge thanks to @kjoshi and @zf109 for the PRs! πŸ™Œ

  • v0.7.1

    June 25, 2019

    πŸ†• New:

    • βž• Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR #247]
      • Implemented a Google CLD3-inspired model in scikit-learn and trained it on ~1.5M texts in ~130 different languages spanning a wide variety of subject matter and stylistic formality; overall, speed and performance compare favorably to other open-source options (langid, langdetect, cld2-cffi, and cld3)
      • Dropped cld2-cffi dependency [Issue #246]
    • Added extract.matches() function to extract spans from a document matching one or more pattern of per-token (attribute, value) pairs, with optional quantity qualifiers; this is a convenient interface to spaCy's rule-based Matcher and a more powerful replacement for textacy's existing (now deprecated) extract.pos_regex_matches()
    • 🚚 Added preprocess.normalize_unicode() function to transform unicode characters into their canonical forms; this is a less-intensive consolation prize for the previously-removed fix_unicode() function

    πŸ”„ Changed:

    • Enabled loading blank spaCy Language pipelines (tokenization only -- no model-based tagging, parsing, etc.) via load_spacy_lang(name, allow_blank=True) for use cases that don't rely on annotations; disabled by default to avoid unwelcome surprises
    • Changed inclusion/exclusion and de-duplication of entities and ngrams in to_terms_list() [Issues #169, #179]
      • entities = True => include entities, and drop exact duplicate ngrams
      • entities = False => don't include entities, and also drop exact duplicate ngrams
      • entities = None => use ngrams as-is without checking against entities
    • 🚚 Moved to_collection() function from the datasets.utils module to the top-level utils module, for use throughout the code base
    • Added quoting option to io.read_csv() and io.write_csv(), for problematic cases
    • πŸ”€ Deprecated the spacier.components.merge_entities() pipeline component, an implementation of which has since been added into spaCy itself
    • πŸ“š Updated documentation for developer convenience and reader clarity
      • Split API reference docs into related chunks, rather than having them all together in one long page, and tidied up headers
      • Fixed errors / inconsistencies in various docstrings (a never-ending struggle...)
      • Ported package readme and changelog from .rst to .md format

    πŸ›  Fixed:

    • The NotImplementedError previously added to preprocess.fix_unicode() is now raised rather than returned [Issue #243]
  • v0.7.0

    May 13, 2019

    πŸ†• New and Changed:

    • βœ‚ Removed textacy.Doc, and split its functionality into two parts
      • New: Added textacy.make_spacy_doc() as a convenient and flexible entry point for making spaCy Doc s from text or (text, metadata) pairs, with optional spaCy language pipeline specification. It's similar to textacy.Doc.__init__, with the exception that text and metadata are passed in together as a 2-tuple.
      • New: Added a variety of custom doc property and method extensions to the global spacy.tokens.Doc class, accessible via its Doc._ "underscore" property. These are similar to the properties/methods on textacy.Doc, they just require an interstitial underscore. For example, textacy.Doc.to_bag_of_words() => spacy.tokens.Doc._.to_bag_of_words().
      • New: Added functions for setting, getting, and removing these extensions. Note that they are set automatically when textacy is imported.
    • 🐎 Simplified and improved performance of textacy.Corpus
      • Documents are now added through a simpler API, either in Corpus.__init__ or Corpus.add(); they may be one or a stream of texts, (text, metadata) pairs, or existing spaCy Doc s. When adding many documents, the spaCy language processing pipeline is used in a faster and more efficient way.
      • Saving / loading corpus data to disk is now more efficient and robust.
      • Note: Corpus is now a collection of spaCy Doc s rather than textacy.Doc s.
    • Simplified, standardized, and added Dataset functionality
      • New: Added an IMDB dataset, built on the classic 2011 dataset commonly used to train sentiment analysis models.
      • New: Added a base Wikimedia dataset, from which a reworked Wikipedia dataset and a separate Wikinews dataset inherit. The underlying data source has changed, from XML db dumps of raw wiki markup to JSON db dumps of (relatively) clean text and metadata; now, the code is simpler, faster, and totally language-agnostic.
      • Dataset.records() now streams (text, metadata) pairs rather than a dict containing both text and metadata, so users don't need to know field names and split them into separate streams before creating Doc or Corpus objects from the data.
      • Filtering and limiting the number of texts/records produced is now clearer and more consistent between .texts() and .records() methods on a given Dataset --- and more performant!
      • Downloading datasets now always shows progress bars and saves to the same file names. When appropriate, downloaded archive files' contents are automatically extracted for easy inspection.
      • Common functionality (such as validating filter values) is now standardized and consolidated in the datasets.utils module.
    • Quality of life improvements
      • Reduced load time for import textacy from ~2-3 seconds to ~1 second, by lazy-loading expensive variables, deferring a couple heavy imports, and dropping a couple dependencies. Specifically:
      • ftfy was dropped, and a NotImplementedError is now raised in textacy's wrapper function, textacy.preprocess.fix_bad_unicode(). Users with bad unicode should now directly call ftfy.fix_text().
      • ijson was dropped, and the behavior of textacy.read_json() is now simpler and consistent with other functions for line-delimited data.
      • mwparserfromhell was dropped, since the reworked Wikipedia dataset no longer requires complicated and slow parsing of wiki markup.
      • Renamed certain functions and variables for clarity, and for consistency with existing conventions:
      • textacy.load_spacy() => textacy.load_spacy_lang()
      • textacy.extract.named_entities() => textacy.extract.entities()
      • textacy.data_dir => textacy.DEFAULT_DATA_DIR
      • filename => filepath and dirname => dirpath when specifying full paths to files/dirs on disk, and textacy.io.utils.get_filenames() => textacy.io.utils.get_filepaths() accordingly
      • compiled regular expressions now consistently start with RE_
      • SpacyDoc => Doc, SpacySpan => Span, SpacyToken => Token, SpacyLang => Language as variables and in docs
      • Removed deprecated functionality
      • top-level spacy_utils.py and spacy_pipelines.py are gone; use equivalent functionality in the spacier subpackage instead
      • math_utils.py is gone; it was long neglected, and never actually used
      • Replaced textacy.compat.bytes_to_unicode() and textacy.compat.unicode_to_bytes() with textacy.compat.to_unicode() and textacy.compat.to_bytes(), which are safer and accept either binary or text strings as input.
      • Moved and renamed language detection functionality, textacy.text_utils.detect_language() => textacy.lang_utils.detect_lang(). The idea is to add more/better lang-related functionality here in the future.
      • Updated and cleaned up documentation throughout the code base.
      • Added and refactored many tests, for both new and old functionality, significantly increasing test coverage while significantly reducing run-time. Also, added a proper coverage report to CI builds. This should help prevent future errors and inspire better test-writing.
      • Bumped the minimum required spaCy version: v2.0.0 => v2.0.12, for access to their full set of custom extension functionality.

    πŸ›  Fixed:

    • The progress bar during an HTTP download now always closes, preventing weird nesting issues if another bar is subsequently displayed.
    • Filtering datasets by multiple values performed either a logical AND or OR over the values, which was confusing; now, a logical OR is always performed.
    • The existence of files/directories on disk is now checked properly via os.path.isfile() or os.path.isdir(), rather than os.path.exists().
    • πŸ›  Fixed a variety of formatting errors raised by sphinx when generating HTML docs.
  • v0.6.3

    March 23, 2019

    πŸ†• New:

    • βž• Added a proper contributing guide and code of conduct, as well as separate
      GitHub issue templates for different user situations. This should help folks
      contribute to the project more effectively, and make maintaining it a bit easier,
      too. [Issue #212]
    • πŸ“š Gave the documentation a new look, using a template popularized by requests.
      βž• Added documentation on dealing with multi-lingual datasets. [Issue #233]
    • πŸ“¦ Made some minor adjustments to package dependencies, the way they're specified,
      πŸ‘· and the Travis CI setup, making for a faster and better development experience.
    • 🍱 Confirmed and enabled compatibility with v2.1+ of spacy. πŸ’«

    πŸ”„ Changed:

    • πŸ‘Œ Improved the Wikipedia dataset class in a variety of ways: it can now read
      Wikinews db dumps; access records in namespaces other than the usual "0"
      πŸ“œ (such as category pages in namespace "14"); parse and extract category pages
      in several languages, including in the case of bad wiki markup; and filter out
      section headings from the accompanying text via an include_headings kwarg.
      [PR #219, #220, #223, #224, #231]
    • βœ‚ Removed the transliterate_unicode() preprocessing function that transliterated
      non-ascii text into a reasonable ascii approximation, for technical and
      🚚 philosophical reasons. Also removed its GPL-licensed unidecode dependency,
      for legal-ish reasons. [Issue #203]
    • βž• Added convention-abiding exclude argument to the function that writes
      πŸ“„ spacy docs to disk, to limit which pipeline annotations are serialized.
      Replaced the existing but non-standard include_tensor arg.
    • Deprecated the n_threads argument in Corpus.add_texts(), which had not
      been working in spacy.pipe for some time and, as of v2.1, is defunct.
    • βœ… Made many tests model- and python-version agnostic and thus less likely to break
      πŸš€ when spacy releases new and improved models.
    • Auto-formatted the entire code base using black; the results aren't always
      more readable, but they are pleasingly consistent.

    πŸ›  Fixed:

    • Fixed bad behavior of key_terms_from_semantic_network(), where an error
      would be raised if no suitable key terms could be found; now, an empty list
      is returned instead. [Issue #211]
    • πŸ›  Fixed variable name typo so GroupVectorizer.fit() actually works. [Issue #215]
    • πŸ›  Fixed a minor typo in the quick-start docs. [PR #217]
    • Check for and filter out any named entities that are entirely whitespace,
      πŸ‘€ seemingly caused by an issue in spacy.
    • πŸ›  Fixed an undefined variable error when merging spans. [Issue #225]
    • πŸ›  Fixed a unicode/bytes issue in experimental function for deserializing spacy
      πŸ“„ docs in "binary" format. [Issue #228, PR #229]

    Contributors:

    Many thanks to @abevieiramota, @ckot, @Jude188, and @digest0r for their help!

  • v0.6.2

    July 19, 2018

    πŸ”„ Changes:

    • βž• Add a spacier.util module, and add / reorganize relevant functionality
      • move (most) spacy_util functions here, and add a deprecation warning to
        the spacy_util module
      • rename normalized_str() => get_normalized_text(), for consistency and clarity
      • add a function to split long texts up into chunks but combine them into
        β†ͺ a single Doc. This is a workaround for a current limitation of spaCy's
        neural models, whose RAM usage scales with the length of input text.
    • βž• Add experimental support for reading and writing spaCy docs in binary format,
      πŸ“„ where multiple docs are contained in a single file. This functionality was
      πŸ‘Œ supported by spaCy v1, but is not in spaCy v2; I've implemented a workaround
      that should work well in most situations, but YMMV.
    • πŸ“š Package documentation is now "officially" hosted on GitHub pages. The docs
      πŸš€ are automatically built on and deployed from Travis via doctr, so they
      stay up-to-date with the master branch on GitHub. Maybe someday I'll get
      πŸ— ReadTheDocs to successfully build textacy once again...
      • Minor improvements/updates to documentation

    πŸ›  Bugfixes:

    • Add missing return statement in deprecated text_stats.flesch_readability_ease()
      function (Issue #191)
    • πŸ’… Catch an empty graph error in bestcoverage-style keyterm ranking (Issue #196)
    • πŸ›  Fix mishandling when specifying a single named entity type to in/exclude in
      extract.named_entities (Issue #202)
    • πŸ‘‰ Make networkx usage in keyterms module compatible with v1.11+ (Issue #199)
  • v0.6.1

    April 12, 2018

    πŸ”„ Changes:

    βž• Add a new spacier sub-package for spaCy-oriented functionality (#168, #187)

    • Thus far, this includes a components module with two custom spaCy
      πŸ“œ pipeline components: one to compute text stats on parsed documents, and
      πŸ”€ another to merge named entities into single tokens in an efficient manner.
      More to come!
    • Similar functionality in the top-level spacy_pipelines module has been
      🚚 deprecated; it will be removed in v0.7.0.

    ⚑️ Update the readme, usage, and API reference docs to be clearer and (I hope)
    more useful. (#186)

    Removing punctuation from a text via the preprocessing module now replaces
    punctuation marks with a single space rather than an empty string. This gives
    πŸ‘ better behavior in many situations; for example, "won't" => "won t" rather than
    "wont", the latter of which is a valid word with a different meaning.

    Categories are now correctly extracted from non-English language Wikipedia
    datasets, starting with French and German and extendable to others. (#175)

    🌲 Log progress when adding documents to a corpus. At the debug level, every
    doc's addition is logged; at the info level, only one message per batch
    of documents is logged. (#183)

    πŸ›  Bugfixes:

    • πŸ›  Fix two breaking typos in extract.direct_quotations(). (issue #177)
    • πŸ“œ Prevent crashes when adding non-parsed documents to a Corpus. (#180)
    • Fix bugs in keyterms.most_discriminating_terms() that used vsm
      functionality as it was before the changes in v0.6.0. (#189)
    • Fix a breaking typo in vsm.matrix_utils.apply_idf_weighting(), and rename
      the problematic kwarg for consistency with related functions. (#190)

    Contributors:

    Big thanks to @sammous, @dixiekong (nice name!), and @SandyRogers for the pull
    requests, and many more for pointing out various bugs and the rougher edges /
    πŸ“¦ unsupported use cases of this package.

  • v0.6.0

    February 25, 2018

    πŸ”„ Changes:

    ♻️ Rename, refactor, and extend I/O functionality (PR #151)

    • Related read/write functions were moved from read.py and write.py into
      format-specific modules, and similar functions were consolidated into one
      with the addition of an arg. For example, write.write_json() and
      write.write_json_lines() => json.write_json(lines=True|False).
    • Useful functionality was added to a few readers/writers. For example,
      write_json() now automatically handles python dates/datetimes, writing
      them to disk as ISO-formatted strings rather than raising a TypeError
      ("datetime is not JSON serializable", ugh). CSVs can now be written to /
      read from disk when each row is a dict rather than a list. Reading/writing
      HTTP streams now allows for basic authentication.
    • Several things were renamed to improve clarity and consistency from a user's
      πŸ“¦ perspective, most notably the subpackage name: fileio => io. Others:
      read_file() and write_file() => read_text() and write_text();
      split_record_fields() => split_records(), although I kept an alias
      πŸ‘‰ to the old function for folks; auto_make_dirs boolean kwarg => make_dirs.
    • io.open_sesame() now handles zip files (provided they contain only 1 file)
      as it already does for gzip, bz2, and lzma files. On a related note, Python 2
      πŸ‘‰ users can now open lzma (.xz) files if they've installed backports.lzma.

    πŸ‘Œ Improve, refactor, and extend vector space model functionality (PRs #156 and #167)

    BM25 term weighting and document-length normalization were implemented, and
    and users can now flexibly add and customize individual components of an
    overall weighting scheme (local scaling + global scaling + doc-wise normalization).
    For API sanity, several additions and changes to the Vectorizer init
    params were required --- sorry bout it!

    Given all the new weighting possibilities, a Vectorizer.weighting attribute
    was added for curious users, to give a mathematical representation of how
    values in a doc-term matrix are being calculated. Here's a simple and a
    not-so-simple case:

     \>\>\> Vectorizer(apply\_idf=True, idf\_type='smooth').weighting 'tf \* log((n\_docs + 1) / (df + 1)) + 1'\>\>\> Vectorizer(tf\_type='bm25', apply\_idf=True, idf\_type='bm25', apply\_dl=True).weighting '(tf \* (k + 1)) / (tf + k \* (1 - b + b \* (length / avg(lengths))) \* log((n\_docs - df + 0.5) / (df + 0.5))'
    

    Terms are now sorted alphabetically after fitting, so you'll have a consistent
    and interpretable ordering in your vocabulary and doc-term-matrix.

    A GroupVectorizer class was added, as a child of Vectorizer and
    an extension of typical document-term matrix vectorization, in which each
    row vector corresponds to the weighted terms co-occurring in a single document.
    This allows for customized grouping, such as by a shared author or publication year,
    πŸ”€ that may span multiple documents, without forcing users to merge /concatenate
    those documents themselves.

    ♻️ Lastly, the vsm.py module was refactored into a vsm subpackage with
    two modules. Imports should stay the same, but the code structure is now
    more amenable to future additions.

    Miscellaneous additions and improvements

    • Flesch Reading Ease in the textstats module is now multi-lingual! Language-
      specific formulations for German, Spanish, French, Italian, Dutch, and Russian
      0️⃣ were added, in addition to (the default) English. (PR #158, prompted by Issue #155)
    • Runtime performance, as well as docs and error messages, of functions for
      generating semantic networks from lists of terms or sentences were improved. (PR #163)
    • Labels on named entities from which determiners have been dropped are now
      πŸ“„ preserved. There's still a minor gotcha, but it's explained in the docs.
    • The size of textacy's data cache can now be set via an environment
      variable, TEXTACY_MAX_CACHE_SIZE, in case the default 2GB cache doesn't
      meet your needs.
    • Docstrings were improved in many ways, large and small, throughout the code.
      May they guide you even more effectively than before!
    • The package version is now set from a single source. This isn't for you so
      much as me, but it does prevent confusing version mismatches b/w code, pypi,
      πŸ“„ and docs.
    • All tests have been converted from unittest to pytest style. They
      βš™ run faster, they're more informative in failure, and they're easier to extend.

    πŸ›  Bugfixes:

    • πŸ›  Fixed an issue where existing metadata associated with a spacy Doc was being
      overwritten with an empty dict when using it to initialize a textacy Doc.
      πŸ“‡ Users can still overwrite existing metadata, but only if they pass in new data.
    • βž• Added a missing import to the README's usage example. (#149)
    • πŸ›  The intersphinx mapping to numpy got fixed (and items for scipy and
      matplotlib were added, too). Taking advantage of that, a bunch of broken
      πŸ›  object links scattered throughout the docs got fixed.
    • πŸ›  Fixed broken formatting of old entries in the changelog, for your reading pleasure.
  • v0.5.0

    December 04, 2017

    πŸ”„ Changes:

    ⬆️ Bumped version requirement for spaCy from < 2.0 to >= 2.0 --- textacy no longer
    ⬆️ works with spaCy 1.x! It's worth the upgrade, though. v2.0's new features and
    API enabled (or required) a few changes on textacy's end

    • textacy.load_spacy() takes the same inputs as the new spacy.load(),
      πŸ“¦ i.e. a package name string and an optional list of pipes to disable
    • textacy's Doc metadata and language string are now stored in user_data
      directly on the spaCy Doc object; although the API from a user's perspective
      is unchanged, this made the next change possible
    • Doc and Corpus classes are now de/serialized via pickle into a single
      πŸ“‡ file --- no more side-car JSON files for metadata! Accordingly, the .save()
      and .load() methods on both classes have a simpler API: they take
      a single string specifying the file on disk where data is stored.

    βœ… Cleaned up docs, imports, and tests throughout the entire code base.

    • docstrings and https://textacy.readthedocs.io 's API reference are easier to
      🌐 read, with better cross-referencing and far fewer broken web links
    • namespaces are less cluttered, and textacy's source code is easier to follow
    • import textacy takes less than half the time from before
    • the full test suite also runs about twice as fast, and most tests are now
      🐎 more robust to changes in the performance of spaCy's models

    - consistent adherence to conventions eases users' cognitive load :)

    The module responsible for caching loaded data in memory was cleaned up and
    πŸ‘Œ improved
    , as well as renamed: from data.py to cache.py, which is more
    descriptive of its purpose. Otherwise, you shouldn't notice much of a difference
    besides things working correctly.

    • All loaded data (e.g. spacy language pipelines) is now cached together in a
      single LRU cache whose max size is set to 2GB, and the size of each element
      in the cache is now accurately computed. (tl;dr: sys.getsizeof does not
      work on non-built-in objects like, say, a spacy.tokens.Doc.)
    • Loading and downloading of the DepecheMood resource is now less hacky and
      weird, and much closer to how users already deal with textacy's various
      Dataset s, In fact, it can be downloaded in exactly the same way as the
      datasets via textacy's new CLI: $ python -m textacy download depechemood.
      P.S. A brief guide for using the CLI got added to the README.

    🚚 Several function/method arguments marked for deprecation have been removed.
    ⚠ If you've been ignoring the warnings that print out when you use lemmatize=True
    ⚑️ instead of normalize='lemma' (etc.), now is the time to update your calls!

    • Of particular note: The readability_stats() function has been removed;
      πŸ‘‰ use TextStats(doc).readability_stats instead.

    πŸ›  Bugfixes:

    • In certain situations, the text of a spaCy span was being returned without
      whitespace between tokens; that has been avoided in textacy, and the source bug
      πŸ›  in spaCy got fixed (by yours truly! explosion/spaCy#1621).
    • πŸ“‡ When adding already-parsed Docs to a Corpus, including metadata
      πŸ“‡ now correctly overwrites any existing metadata on those docs.
    • πŸ›  Fixed a couple related issues involving the assignment of a 2-letter language
      string to the .lang attribute of Doc and Corpus objects.
    • textacy's CLI wasn't correctly handling certain dataset kwargs in all cases;
      now, all kwargs get to their intended destinations.