All Versions
33
Latest Version
Avg Release Cycle
44 days
Latest Release
-

Changelog History
Page 2

  • v0.9.0 Changes

    • focused crawling functions including politeness rules
    • more efficient multi-threaded downloads + use as Python functions
    • ๐Ÿ“š documentation extended
    • ๐Ÿ› bugs fixed: extraction and URL handling
    • โœ‚ removed support for Python 3.4
  • v0.8.2 Changes

    • ๐Ÿ‘ better handling of formatting, links and images, title type as attribute in XML formats
    • more robust sitemaps and feeds processing
    • more accurate extraction
    • ๐Ÿ›  further consolidation: code simplified and bugs fixed
  • v0.8.1 Changes

    • ๐Ÿ‘ extraction trade-off: slightly better recall
    • ๐Ÿ”ง code robustness: requests, configuration and navigation
    • ๐Ÿ›  bugfixes: image data extraction
  • v0.8.0 Changes

    • ๐Ÿ‘Œ improved link discovery and handling
    • ๐Ÿ›  fixes in metadata extraction, feeds and sitemaps processing
    • ๐Ÿ’ฅ breaking change: the extract function now reads target format from output_format argument only
    • ๐Ÿ†• new extraction option: preserve links, CLI options re-ordered
    • more opportunistic backup extraction
  • v0.7.0 Changes

    • ๐Ÿ”ง customizable configuration file to parametrize extraction and downloads
    • ๐Ÿ‘ better handling of feeds and sitemaps
    • โž• additional CLI options: crytographic hash for file name, use Internet Archive as backup
    • more precise extraction
    • faster downloads: requests replaced with bare urllib3 and custom decoding
    • ๐Ÿ›  consolidation: bug fixes and improvements, many thanks to the issues reporters!
  • v0.6.1 Changes

    December 02, 2020
    • โž• added bare_extraction function returning Python variables
    • ๐Ÿ‘Œ improved link discovery in feeds and sitemaps
    • option to preserve image info
    • ๐Ÿ›  fixes (many thanks to bug reporters!)
  • v0.6.0 Changes

    November 06, 2020
    • ๐Ÿ”— link discovery in sitemaps
    • compatibility with Python 3.9
    • extraction coverage improved
    • deduplication now optional
    • ๐Ÿ› bug fixes
  • v0.5.2 Changes

    September 22, 2020
    • optional language detector changed: langid โ†’ pycld3
    • helper function bare_extraction()
    • 0๏ธโƒฃ optional deduplication off by default
    • ๐Ÿ‘ better URL handling (courlan), more complete metadata
    • code consolidation (cleaner and shorter)
  • v0.5.1 Changes

    July 15, 2020
    • extended and more convenient command-line options
    • output in JSON format
    • ๐Ÿ› bug fixes
  • v0.5.0 Changes

    June 02, 2020
    • ๐Ÿ“‡ faster and more robust text and metadata extraction
    • more efficient batch processing (parallel processing, URL queues)
    • extraction and processing of ATOM/RSS feeds
    • complete command-line tool with corresponding options