All Versions
33
Latest Version
Avg Release Cycle
44 days
Latest Release
-

Changelog History
Page 1

  • v1.4.0 Changes

    Impact on extraction and output format:

    • ๐Ÿ‘ better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
    • XML: preserve list type as attribute (#229)
    • ๐Ÿ‘ XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
    • faster text cleaning and shorter code (#237 with @deedy5, #245)
    • ๐Ÿ“‡ metadata: add language when detector is activated (#224)
    • ๐Ÿ“‡ metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
    • TXT: change markdown formatting of headers by @LaundroMat (#257)

    Smaller changes in convenience functions:

    • โž• add function to clear caches (#219)
    • CLI: change exit code if download fails (#223)
    • settings: use "\n" for multiple user agents by @k-sareen (#241)

    โšก๏ธ Updates:

    • ๐Ÿ“„ docs updated (and #244 by @dsgibbons)
    • โšก๏ธ package dependencies updated
  • v1.3.0 Changes

    • fast and robust html2txt() function added (#221)
    • ๐Ÿ“œ more robust parsing (#228)
    • ๐Ÿ›  fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
    • ๐Ÿ‘ extraction about 10-20% faster, slightly better recall
    • ๐Ÿ›  partial fixes for memory leaks (#216)
    • ๐Ÿ“„ docs extended and updated (#217, #225)
    • ๐Ÿ—„ prepared deprecation of old process_record() function
    • โšก๏ธ more stable processing with updated dependencies
  • v1.2.2 Changes

    • more efficient rules for extraction
    • ๐Ÿ“‡ metadata: further attributes used (with @felipehertzer)
    • ๐Ÿ‘ better baseline extraction
    • ๐Ÿ›  issues fixed: #202, #204, #205
    • โšก๏ธ evaluation updated
  • v1.2.1 Changes

    • --precision and --recall arguments added to the CLI
    • ๐Ÿ‘ better text cleaning: paywalls and comments
    • ๐Ÿ‘Œ improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
    • ๐Ÿ›  further bugs fixed: #189, #192 (with @felipehertzer), #200
    • ๐Ÿ–จ efficiency: faster module loading and improved RAM footprint
  • v1.2.0 Changes

    • efficiency: replaced module readability-lxml by trimmed fork
    • ๐Ÿ› bug fixed: (#179, #180, #183, #184)
    • ๐Ÿ‘Œ improved baseline extraction
    • ๐Ÿ“‡ cleaner metadata (with @felipehertzer)
  • v1.1.0 Changes

    • ๐Ÿ‘ encodings: better detection, output NFC-normalized Unicode
    • ๐ŸŽ maintenance and performance: more efficient code
    • ๐Ÿ› bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
    • prepare compatibility with upcoming Python 3.11
    • ๐Ÿ”„ changed default settings
    • ๐Ÿ“š extended documentation
  • v1.0.0 Changes

    • compress HTML backup files & seamlessly open .gz files
    • ๐Ÿ‘Œ support JSON web feeds
    • ๐Ÿ“ฆ graphical user interface integrated into main package
    • faster downloads: reviewed backoff, compressed data
    • optional modules: downloads with pycurl, language identification with py3langid
    • ๐Ÿ› bugs fixed (#111, #125, #132, #136, #140)
    • ๐Ÿ›  minor optimizations and fixes by @vbarbaresi in #124 & #130
    • ๐Ÿ›  fixed array with single or multiples entries on json extractor by @felipehertzer in #143
    • ๐Ÿ”จ code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
    • โฌ‡๏ธ drop support for Python 3.5
  • v0.9.3 Changes

    • ๐Ÿ‘ better, faster encoding detection: replaced chardet with charset_normalizer
    • โšก๏ธ faster execution: updated justext to 3.0
    • ๐Ÿ‘ better extraction of sub-elements in tables (#78, #90)
    • ๐Ÿ“œ more robust web feed parsing
    • further defined precision- and recall-oriented settings
    • license extraction in footers (#118)
  • v0.9.2 Changes

    • first precision- and recall-oriented presets defined
    • ๐Ÿ‘Œ improvements in authorship extraction (thanks @felipehertzer)
    • requesting TXT output with formatting now results in Markdown format
    • ๐Ÿ› bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
    • setting for cookies in request headers (thanks @muellermartin)
    • ๐Ÿ‘ better date extraction thanks to htmldate update
  • v0.9.1 Changes

    • ๐Ÿ‘Œ improved author extraction (thanks @felipehertzer!)
    • ๐Ÿ› bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
    • ๐Ÿ“„ docs updated and extended
    • ๐Ÿ—„ CLI: option names normalized (heed deprecation warnings), new option explore