All Versions
33
Latest Version
Avg Release Cycle
44 days
Latest Release
-
Changelog History
Page 1
Changelog History
Page 1
-
v1.4.0 Changes
Impact on extraction and output format:
- ๐ better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
- XML: preserve list type as attribute (#229)
- ๐ XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
- faster text cleaning and shorter code (#237 with @deedy5, #245)
- ๐ metadata: add language when detector is activated (#224)
- ๐ metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
- TXT: change markdown formatting of headers by @LaundroMat (#257)
Smaller changes in convenience functions:
- โ add function to clear caches (#219)
- CLI: change exit code if download fails (#223)
- settings: use "\n" for multiple user agents by @k-sareen (#241)
โก๏ธ Updates:
- ๐ docs updated (and #244 by @dsgibbons)
- โก๏ธ package dependencies updated
-
v1.3.0 Changes
- fast and robust
html2txt()
function added (#221) - ๐ more robust parsing (#228)
- ๐ fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
- ๐ extraction about 10-20% faster, slightly better recall
- ๐ partial fixes for memory leaks (#216)
- ๐ docs extended and updated (#217, #225)
- ๐ prepared deprecation of old
process_record()
function - โก๏ธ more stable processing with updated dependencies
- fast and robust
-
v1.2.2 Changes
- more efficient rules for extraction
- ๐ metadata: further attributes used (with @felipehertzer)
- ๐ better baseline extraction
- ๐ issues fixed: #202, #204, #205
- โก๏ธ evaluation updated
-
v1.2.1 Changes
--precision
and--recall
arguments added to the CLI- ๐ better text cleaning: paywalls and comments
- ๐ improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
- ๐ further bugs fixed: #189, #192 (with @felipehertzer), #200
- ๐จ efficiency: faster module loading and improved RAM footprint
-
v1.2.0 Changes
- efficiency: replaced module readability-lxml by trimmed fork
- ๐ bug fixed: (#179, #180, #183, #184)
- ๐ improved baseline extraction
- ๐ cleaner metadata (with @felipehertzer)
-
v1.1.0 Changes
- ๐ encodings: better detection, output NFC-normalized Unicode
- ๐ maintenance and performance: more efficient code
- ๐ bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
- prepare compatibility with upcoming Python 3.11
- ๐ changed default settings
- ๐ extended documentation
-
v1.0.0 Changes
- compress HTML backup files & seamlessly open .gz files
- ๐ support JSON web feeds
- ๐ฆ graphical user interface integrated into main package
- faster downloads: reviewed backoff, compressed data
- optional modules: downloads with
pycurl
, language identification withpy3langid
- ๐ bugs fixed (#111, #125, #132, #136, #140)
- ๐ minor optimizations and fixes by @vbarbaresi in #124 & #130
- ๐ fixed array with single or multiples entries on json extractor by @felipehertzer in #143
- ๐จ code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
- โฌ๏ธ drop support for Python 3.5
-
v0.9.3 Changes
- ๐ better, faster encoding detection: replaced
chardet
withcharset_normalizer
- โก๏ธ faster execution: updated
justext
to 3.0 - ๐ better extraction of sub-elements in tables (#78, #90)
- ๐ more robust web feed parsing
- further defined precision- and recall-oriented settings
- license extraction in footers (#118)
- ๐ better, faster encoding detection: replaced
-
v0.9.2 Changes
- first precision- and recall-oriented presets defined
- ๐ improvements in authorship extraction (thanks @felipehertzer)
- requesting TXT output with formatting now results in Markdown format
- ๐ bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
- setting for cookies in request headers (thanks @muellermartin)
- ๐ better date extraction thanks to htmldate update
-
v0.9.1 Changes
- ๐ improved author extraction (thanks @felipehertzer!)
- ๐ bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
- ๐ docs updated and extended
- ๐ CLI: option names normalized (heed deprecation warnings), new option
explore