All Versions
33
Latest Version
Avg Release Cycle
44 days
Latest Release
-
Changelog History
Page 2
Changelog History
Page 2
-
v0.9.0 Changes
- focused crawling functions including politeness rules
- more efficient multi-threaded downloads + use as Python functions
- ๐ documentation extended
- ๐ bugs fixed: extraction and URL handling
- โ removed support for Python 3.4
-
v0.8.2 Changes
- ๐ better handling of formatting, links and images, title type as attribute in XML formats
- more robust sitemaps and feeds processing
- more accurate extraction
- ๐ further consolidation: code simplified and bugs fixed
-
v0.8.1 Changes
- ๐ extraction trade-off: slightly better recall
- ๐ง code robustness: requests, configuration and navigation
- ๐ bugfixes: image data extraction
-
v0.8.0 Changes
- ๐ improved link discovery and handling
- ๐ fixes in metadata extraction, feeds and sitemaps processing
- ๐ฅ breaking change: the
extract
function now reads target format fromoutput_format
argument only - ๐ new extraction option: preserve links, CLI options re-ordered
- more opportunistic backup extraction
-
v0.7.0 Changes
- ๐ง customizable configuration file to parametrize extraction and downloads
- ๐ better handling of feeds and sitemaps
- โ additional CLI options: crytographic hash for file name, use Internet Archive as backup
- more precise extraction
- faster downloads:
requests
replaced with bareurllib3
and custom decoding - ๐ consolidation: bug fixes and improvements, many thanks to the issues reporters!
-
v0.6.1 Changes
December 02, 2020- โ added
bare_extraction
function returning Python variables - ๐ improved link discovery in feeds and sitemaps
- option to preserve image info
- ๐ fixes (many thanks to bug reporters!)
- โ added
-
v0.6.0 Changes
November 06, 2020- ๐ link discovery in sitemaps
- compatibility with Python 3.9
- extraction coverage improved
- deduplication now optional
- ๐ bug fixes
-
v0.5.2 Changes
September 22, 2020- optional language detector changed:
langid
โpycld3
- helper function
bare_extraction()
- 0๏ธโฃ optional deduplication off by default
- ๐ better URL handling (
courlan
), more complete metadata - code consolidation (cleaner and shorter)
- optional language detector changed:
-
v0.5.1 Changes
July 15, 2020- extended and more convenient command-line options
- output in JSON format
- ๐ bug fixes
-
v0.5.0 Changes
June 02, 2020- ๐ faster and more robust text and metadata extraction
- more efficient batch processing (parallel processing, URL queues)
- extraction and processing of ATOM/RSS feeds
- complete command-line tool with corresponding options