Popularity

6.5

Growing

Activity

8.7

Growing

Stars 2,778

Watchers 28

Forks 210

Last Commit 3 days ago

Description

Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.

Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).

The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast. With these objectives in mind, Trafilatura is designed to run in production on millions of web documents.

Programming language: Python

License: Apache License 2.0

Tags: Text Processing Markdown HTTP Web Crawling Web Content Extracting Security HTML Scientific Engineering Information Analysis Utilities Internet WWW Markup Linguistic XML Text Editors Web Scraping Scraping

Latest version: v1.4.0

trafilatura alternatives and similar packages

Based on the "Web Content Extracting" category.
Alternatively, view trafilatura alternatives based on common mentions on social networks and blogs.

TWINT

9.4 0.0 trafilatura VS TWINT

DISCONTINUED. An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
newspaper

9.3 0.0 L3 trafilatura VS newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Promo www.influxdata.com

python-goose

7.9 0.0 trafilatura VS python-goose

Html Content / Article Extractor, web scrapping lib in Python
textract

7.6 3.7 trafilatura VS textract

extract text from any document. no muss. no fuss.
sumy

7.4 6.7 L5 trafilatura VS sumy

Module for automatic summarization of text documents and HTML pages.
toapi

7.1 0.0 trafilatura VS toapi

Every web site provides APIs.
python-readability

6.7 3.4 trafilatura VS python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
html2text

5.7 6.1 L1 trafilatura VS html2text

Convert HTML to Markdown-formatted text.
Goose3

4.2 6.4 trafilatura VS Goose3

A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
micawber

3.9 4.8 L5 trafilatura VS micawber

a small library for extracting rich content from urls
lassie

3.7 0.0 L4 trafilatura VS lassie

Web Content Retrieval for Humans™
opengraph

3.0 0.0 L5 trafilatura VS opengraph

A python module to parse the Open Graph Protocol
inscriptis -- HTML to text conversion library, command line client and Web service

2.6 8.5 trafilatura VS inscriptis -- HTML to text conversion library, command line client and Web service

A python based HTML to text conversion library, command line client and Web service.
Haul

2.5 0.0 L5 trafilatura VS Haul

An Extensible Image Crawler
htmldate

2.0 7.6 trafilatura VS htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
sanitize

1.5 0.0 L4 trafilatura VS sanitize

Bringing sanity to world of messed-up data
JSONPATH

1.0 5.7 trafilatura VS JSONPATH

A query expression for extracting data from JSON.
Data Extractor

0.9 6.0 trafilatura VS Data Extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.