All Versions
11
Latest Version
Avg Release Cycle
37 days
Latest Release
1092 days ago

Changelog History
Page 1

  • v3.1.6 Changes

    October 20, 2018
    • ๐Ÿ‘Œ Improved handling of page encoding see PR #92
    • ๐Ÿ‘Œ Improved author and published date extraction see PR #93 Thanks @timoilya!
    • โž• Added additional schema extractors for schema.org parser see PR #89
    • ๐Ÿ‘ Allow for pulling more then the first og:type data for Opengraph see PR #90
  • v3.1.5 Changes

    September 11, 2018
  • v3.1.4 Changes

    August 19, 2018
    • ๐Ÿ›  Fix IndexError when title has only an title splitter or is the site name see issue #59 Thanks @dlrobertson!
    • Retry the calculate_top_node function with the root node if the first pass failed to find an article which may occur if one or more known article patterns are found, but none contain content see PR #66 Thanks @dlrobertson!
    • โž• Add parsing of schema.org's ReportageNewsArticle tags see PR #67 Thanks @dlrobertson!
    • โž• Add additional parsing of opengraph tags see PR #64 Thanks @dlrobertson!
  • v3.1.3 Changes

    July 07, 2018
    • ๐Ÿ“œ Parse headers and include in cleaned_text
    • โž• Additional Configuration options:
      • Parse Headers: parse_headers
      • Parse Lists: parse_lists
      • Pretty Lists: pretty_lists
    • ๐Ÿ‘€ Catch mismatch encoding meta tag and document encoding see pull request #53 Thanks @jeffquach!
  • v3.1.2 Changes

    June 02, 2018
    • ๐Ÿ“œ Parse lists out if present in the main article
    • โž• Added configuration option pretty_lists to specify if a list should be represented as text or made to read like a list; default is True
  • v3.1.1 Changes

    May 29, 2018
    • ๐Ÿ‘€ Catch additional PIL exceptions when attempting to read images; see #42
    • ๐Ÿ‘ Better meta processing of opengraph tags for use as keys in returned data; see #45
  • v3.1.0 Changes

    April 03, 2018
    • Changed configuration to not pull images by default see issue #31
    • Update get_encodings_from_content to return a string and remove trailing spaces see PR #35
    • โœ‚ Remove infinite recursion on parser selection see PR #39
    • Document video and image classes
    • โœ… Re-add remaining image tests
  • v3.0.9 Changes

    January 12, 2018
    • โž• Add soup as a parser option to use lxml.html.soupparser see issue #27
    • ๐Ÿ›  Fix an issue with passing the requests session object to the crawler
    • ๐Ÿ‘• Pylint changes
      • Added pylintrc file
      • Updated variable and positional argument names to be more pythonic
      • Fixed line continuation issues
      • Updated variable names when ambiguous
      • Cleaned up class and static methods
  • v3.0.8 Changes

    December 09, 2017
    • ๐Ÿ›  Fix using different requests session for each url fetched
      • Added close method to the Goose object
    • ๐Ÿ‘ Allow the Goose object to be a context manager python from goose3 import Goose with Goose() as g: g.extract(url='some-url-here') NOTE: No need to change code as it will attempt to automatically close the connection on garbage collection
    • ๐Ÿ”ง Configuration object changes
      • Better handling of the known_context_patterns configuration
      • Added http_headers configuration option to be passed to requests
      • Added http_proxies configuration option to be passed to requests
      • Added http_auth configuration option to be passed to requests
    • ๐Ÿ›  Fix base64 image parsing see issue #7
  • v3.0.7 Changes

    November 23, 2017
    • ๐Ÿ›  Fix installation issue
      • Removed unused/broken regex
      • Include all necessary files
      • Fix failed tests (most)
    • ๐Ÿ‘€ Resolved relative URL issue see issue #21
    • ๐Ÿ‘€ Resolved temporary files not being properly removed see issue #18
    • โœ‚ Removed unused dependencies and code to support python 2 see issue #16
    • ๐Ÿ›  Fix error when using the configuration object to configure goose see issue #14