trafilatura v1.4.0 Release Notes

  • Impact on extraction and output format:

    • ๐Ÿ‘ better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
    • XML: preserve list type as attribute (#229)
    • ๐Ÿ‘ XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
    • faster text cleaning and shorter code (#237 with @deedy5, #245)
    • ๐Ÿ“‡ metadata: add language when detector is activated (#224)
    • ๐Ÿ“‡ metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
    • TXT: change markdown formatting of headers by @LaundroMat (#257)

    Smaller changes in convenience functions:

    • โž• add function to clear caches (#219)
    • CLI: change exit code if download fails (#223)
    • settings: use "\n" for multiple user agents by @k-sareen (#241)

    โšก๏ธ Updates:

    • ๐Ÿ“„ docs updated (and #244 by @dsgibbons)
    • โšก๏ธ package dependencies updated