Popularity

5.7

Growing

Activity

6.3

Stars 1,647

Watchers 26

Forks 258

Last Commit about 2 months ago

Description

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

Usage: html2text [(filename|url) [encoding]]

Code Quality Rank: L1

Programming language: Python

License: GNU General Public License v3.0 only

Tags: Web Content Extracting

Latest version: v2020.1.16

html2text alternatives and similar packages

Based on the "Web Content Extracting" category.
Alternatively, view html2text alternatives based on common mentions on social networks and blogs.

TWINT

9.4 0.0 html2text VS TWINT

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
newspaper

9.3 0.0 L3 html2text VS newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Promo www.influxdata.com

python-goose

7.9 0.0 html2text VS python-goose

Html Content / Article Extractor, web scrapping lib in Python
textract

7.6 3.7 html2text VS textract

extract text from any document. no muss. no fuss.
sumy

7.4 6.7 L5 html2text VS sumy

Module for automatic summarization of text documents and HTML pages.
toapi

7.1 0.0 html2text VS toapi

Every web site provides APIs.
python-readability

6.7 3.4 html2text VS python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
trafilatura

6.5 8.4 html2text VS trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Goose3

4.2 6.6 html2text VS Goose3

A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
micawber

3.9 4.8 L5 html2text VS micawber

a small library for extracting rich content from urls
lassie

3.7 0.0 L4 html2text VS lassie

Web Content Retrieval for Humans™
opengraph

3.0 0.0 L5 html2text VS opengraph

A python module to parse the Open Graph Protocol
inscriptis -- HTML to text conversion library, command line client and Web service

2.6 8.5 html2text VS inscriptis -- HTML to text conversion library, command line client and Web service

A python based HTML to text conversion library, command line client and Web service.
Haul

2.5 0.0 L5 html2text VS Haul

An Extensible Image Crawler
htmldate

2.0 7.6 html2text VS htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
sanitize

1.5 0.0 L4 html2text VS sanitize

Bringing sanity to world of messed-up data
JSONPATH

1.0 5.7 html2text VS JSONPATH

A query expression for extracting data from JSON.
Data Extractor

0.9 6.0 html2text VS Data Extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of html2text or a related project?

Add another 'Web Content Extracting' Package

Popular Comparisons

README

html2text

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

Usage: html2text [filename [encoding]]

Option	Description
`--version`	Show program's version number and exit
`-h`, `--help`	Show this help message and exit
`--ignore-links`	Don't include any formatting for links
`--escape-all`	Escape all special characters. Output is less readable, but avoids corner case formatting issues.
`--reference-links`	Use reference links instead of links to create markdown
`--mark-code`	Mark preformatted and code blocks with [code]...[/code]

For a complete list of options see the docs

Or you can use it from within Python:

>>> import html2text
>>>
>>> print(html2text.html2text("<p><strong>Zed's</strong> dead baby, <em>Zed's</em> dead.</p>"))
**Zed's** dead baby, _Zed's_ dead.

Or with some configuration options:

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
Hello, world!

>>> print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))

Hello, world!

>>> # Don't Ignore links anymore, I like links
>>> h.ignore_links = False
>>> print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
Hello, [world](https://www.google.com/earth/)!

Originally written by Aaron Swartz. This code is distributed under the GPLv3.

How to install

html2text is available on pypi https://pypi.org/project/html2text/

$ pip install html2text

How to run unit tests

tox

To see the coverage results:

coverage html

then open the ./htmlcov/index.html file in your browser.

Documentation

Documentation lives here

*Note that all licence references and agreements mentioned in the html2text README section above are relevant to that project's source code only.

html2text

Convert HTML to Markdown-formatted text.

Description

html2text alternatives and similar packages

Popular Comparisons

README

html2text

How to install

How to run unit tests

Documentation