All Versions
30
Latest Version
Avg Release Cycle
44 days
Latest Release
-
Changelog History
Page 1
Changelog History
Page 1
-
v1.2.1 Changes
--precision
and--recall
arguments added to the CLI- ๐ better text cleaning: paywalls and comments
- ๐ improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
- ๐ further bugs fixed: #189, #192 (with @felipehertzer), #200
- ๐จ efficiency: faster module loading and improved RAM footprint
-
v1.2.0 Changes
- efficiency: replaced module readability-lxml by trimmed fork
- ๐ bug fixed: (#179, #180, #183, #184)
- ๐ improved baseline extraction
- ๐ cleaner metadata (with @felipehertzer)
-
v1.1.0 Changes
- ๐ encodings: better detection, output NFC-normalized Unicode
- ๐ maintenance and performance: more efficient code
- ๐ bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
- prepare compatibility with upcoming Python 3.11
- ๐ changed default settings
- ๐ extended documentation
-
v1.0.0 Changes
- compress HTML backup files & seamlessly open .gz files
- ๐ support JSON web feeds
- ๐ฆ graphical user interface integrated into main package
- faster downloads: reviewed backoff, compressed data
- optional modules: downloads with
pycurl
, language identification withpy3langid
- ๐ bugs fixed (#111, #125, #132, #136, #140)
- ๐ minor optimizations and fixes by @vbarbaresi in #124 & #130
- ๐ fixed array with single or multiples entries on json extractor by @felipehertzer in #143
- ๐จ code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
- โฌ๏ธ drop support for Python 3.5
-
v0.9.3 Changes
- ๐ better, faster encoding detection: replaced
chardet
withcharset_normalizer
- โก๏ธ faster execution: updated
justext
to 3.0 - ๐ better extraction of sub-elements in tables (#78, #90)
- ๐ more robust web feed parsing
- further defined precision- and recall-oriented settings
- license extraction in footers (#118)
- ๐ better, faster encoding detection: replaced
-
v0.9.2 Changes
- first precision- and recall-oriented presets defined
- ๐ improvements in authorship extraction (thanks @felipehertzer)
- requesting TXT output with formatting now results in Markdown format
- ๐ bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
- setting for cookies in request headers (thanks @muellermartin)
- ๐ better date extraction thanks to htmldate update
-
v0.9.1 Changes
- ๐ improved author extraction (thanks @felipehertzer!)
- ๐ bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
- ๐ docs updated and extended
- ๐ CLI: option names normalized (heed deprecation warnings), new option
explore
-
v0.9.0 Changes
- focused crawling functions including politeness rules
- more efficient multi-threaded downloads + use as Python functions
- ๐ documentation extended
- ๐ bugs fixed: extraction and URL handling
- โ removed support for Python 3.4
-
v0.8.2 Changes
- ๐ better handling of formatting, links and images, title type as attribute in XML formats
- more robust sitemaps and feeds processing
- more accurate extraction
- ๐ further consolidation: code simplified and bugs fixed
-
v0.8.1 Changes
- ๐ extraction trade-off: slightly better recall
- ๐ง code robustness: requests, configuration and navigation
- ๐ bugfixes: image data extraction