Description
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
Usage: html2text [(filename|url) [encoding]]
html2text alternatives and similar packages
Based on the "Web Content Extracting" category.
Alternatively, view html2text alternatives based on common mentions on social networks and blogs.
-
TWINT
An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations. -
newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs: -
python-goose
Html Content / Article Extractor, web scrapping lib in Python -
sumy
Module for automatic summarization of text documents and HTML pages. -
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js! -
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments -
Goose3
A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html -
inscriptis -- HTML to text conversion library, command line client and Web service
2.4 0.0 html2text VS inscriptis -- HTML to text conversion library, command line client and Web serviceA python based HTML to text conversion library, command line client and Web service. -
htmldate
Fast and robust date extraction from web pages, with Python or on the command-line -
Data Extractor
Combine XPath, CSS Selectors and JSONPath for Web data extracting.
Updating dependencies is time-consuming.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of html2text or a related project?
README
html2text
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
Usage: html2text [filename [encoding]]
Option | Description |
---|---|
--version |
Show program's version number and exit |
-h , --help |
Show this help message and exit |
--ignore-links |
Don't include any formatting for links |
--escape-all |
Escape all special characters. Output is less readable, but avoids corner case formatting issues. |
--reference-links |
Use reference links instead of links to create markdown |
--mark-code |
Mark preformatted and code blocks with [code]...[/code] |
For a complete list of options see the docs
Or you can use it from within Python
:
>>> import html2text
>>>
>>> print(html2text.html2text("<p><strong>Zed's</strong> dead baby, <em>Zed's</em> dead.</p>"))
**Zed's** dead baby, _Zed's_ dead.
Or with some configuration options:
>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
Hello, world!
>>> print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
Hello, world!
>>> # Don't Ignore links anymore, I like links
>>> h.ignore_links = False
>>> print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
Hello, [world](https://www.google.com/earth/)!
Originally written by Aaron Swartz. This code is distributed under the GPLv3.
How to install
html2text
is available on pypi
https://pypi.org/project/html2text/
$ pip install html2text
How to run unit tests
tox
To see the coverage results:
coverage html
then open the ./htmlcov/index.html
file in your browser.
Documentation
Documentation lives here
*Note that all licence references and agreements mentioned in the html2text README section above
are relevant to that project's source code only.