All Versions
24
Latest Version
Avg Release Cycle
44 days
Latest Release
-

Changelog History
Page 1

  • v0.9.1 Changes

    • 👌 improved author extraction (thanks @felipehertzer!)
    • 🐛 bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
    • 📄 docs updated and extended
    • 🗄 CLI: option names normalized (heed deprecation warnings), new option explore
  • v0.9.0 Changes

    • focused crawling functions including politeness rules
    • more efficient multi-threaded downloads + use as Python functions
    • 📚 documentation extended
    • 🐛 bugs fixed: extraction and URL handling
    • ✂ removed support for Python 3.4
  • v0.8.2 Changes

    • 👍 better handling of formatting, links and images, title type as attribute in XML formats
    • more robust sitemaps and feeds processing
    • more accurate extraction
    • 🛠 further consolidation: code simplified and bugs fixed
  • v0.8.1 Changes

    • 👍 extraction trade-off: slightly better recall
    • 🔧 code robustness: requests, configuration and navigation
    • 🛠 bugfixes: image data extraction
  • v0.8.0 Changes

    • 👌 improved link discovery and handling
    • 🛠 fixes in metadata extraction, feeds and sitemaps processing
    • 💥 breaking change: the extract function now reads target format from output_format argument only
    • 🆕 new extraction option: preserve links, CLI options re-ordered
    • more opportunistic backup extraction
  • v0.7.0 Changes

    • 🔧 customizable configuration file to parametrize extraction and downloads
    • 👍 better handling of feeds and sitemaps
    • ➕ additional CLI options: crytographic hash for file name, use Internet Archive as backup
    • more precise extraction
    • faster downloads: requests replaced with bare urllib3 and custom decoding
    • 🛠 consolidation: bug fixes and improvements, many thanks to the issues reporters!
  • v0.6.1 Changes

    December 02, 2020
    • ➕ added bare_extraction function returning Python variables
    • 👌 improved link discovery in feeds and sitemaps
    • option to preserve image info
    • 🛠 fixes (many thanks to bug reporters!)
  • v0.6.0 Changes

    November 06, 2020
    • 🔗 link discovery in sitemaps
    • compatibility with Python 3.9
    • extraction coverage improved
    • deduplication now optional
    • 🐛 bug fixes
  • v0.5.2 Changes

    September 22, 2020
    • optional language detector changed: langidpycld3
    • helper function bare_extraction()
    • 0️⃣ optional deduplication off by default
    • 👍 better URL handling (courlan), more complete metadata
    • code consolidation (cleaner and shorter)
  • v0.5.1 Changes

    July 15, 2020
    • extended and more convenient command-line options
    • output in JSON format
    • 🐛 bug fixes