textacy v0.6.3 Release Notes

Release Date: 2019-03-23 // about 5 years ago
  • ๐Ÿ†• New:

    • โž• Added a proper contributing guide and code of conduct, as well as separate
      GitHub issue templates for different user situations. This should help folks
      contribute to the project more effectively, and make maintaining it a bit easier,
      too. [Issue #212]
    • ๐Ÿ“š Gave the documentation a new look, using a template popularized by requests.
      โž• Added documentation on dealing with multi-lingual datasets. [Issue #233]
    • ๐Ÿ“ฆ Made some minor adjustments to package dependencies, the way they're specified,
      ๐Ÿ‘ท and the Travis CI setup, making for a faster and better development experience.
    • ๐Ÿฑ Confirmed and enabled compatibility with v2.1+ of spacy. ๐Ÿ’ซ

    ๐Ÿ”„ Changed:

    • ๐Ÿ‘Œ Improved the Wikipedia dataset class in a variety of ways: it can now read
      Wikinews db dumps; access records in namespaces other than the usual "0"
      ๐Ÿ“œ (such as category pages in namespace "14"); parse and extract category pages
      in several languages, including in the case of bad wiki markup; and filter out
      section headings from the accompanying text via an include_headings kwarg.
      [PR #219, #220, #223, #224, #231]
    • โœ‚ Removed the transliterate_unicode() preprocessing function that transliterated
      non-ascii text into a reasonable ascii approximation, for technical and
      ๐Ÿšš philosophical reasons. Also removed its GPL-licensed unidecode dependency,
      for legal-ish reasons. [Issue #203]
    • โž• Added convention-abiding exclude argument to the function that writes
      ๐Ÿ“„ spacy docs to disk, to limit which pipeline annotations are serialized.
      Replaced the existing but non-standard include_tensor arg.
    • Deprecated the n_threads argument in Corpus.add_texts(), which had not
      been working in spacy.pipe for some time and, as of v2.1, is defunct.
    • โœ… Made many tests model- and python-version agnostic and thus less likely to break
      ๐Ÿš€ when spacy releases new and improved models.
    • Auto-formatted the entire code base using black; the results aren't always
      more readable, but they are pleasingly consistent.

    ๐Ÿ›  Fixed:

    • Fixed bad behavior of key_terms_from_semantic_network(), where an error
      would be raised if no suitable key terms could be found; now, an empty list
      is returned instead. [Issue #211]
    • ๐Ÿ›  Fixed variable name typo so GroupVectorizer.fit() actually works. [Issue #215]
    • ๐Ÿ›  Fixed a minor typo in the quick-start docs. [PR #217]
    • Check for and filter out any named entities that are entirely whitespace,
      ๐Ÿ‘€ seemingly caused by an issue in spacy.
    • ๐Ÿ›  Fixed an undefined variable error when merging spans. [Issue #225]
    • ๐Ÿ›  Fixed a unicode/bytes issue in experimental function for deserializing spacy
      ๐Ÿ“„ docs in "binary" format. [Issue #228, PR #229]

    Contributors:

    Many thanks to @abevieiramota, @ckot, @Jude188, and @digest0r for their help!