Description
Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.
Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).
The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast. With these objectives in mind, Trafilatura is designed to run in production on millions of web documents.
trafilatura alternatives and similar packages
Based on the "Web Content Extracting" category.
Alternatively, view trafilatura alternatives based on common mentions on social networks and blogs.
-
TWINT
DISCONTINUED. An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations. -
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs: -
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js! -
inscriptis -- HTML to text conversion library, command line client and Web service
2.8 8.5 trafilatura VS inscriptis -- HTML to text conversion library, command line client and Web serviceA python based HTML to text conversion library, command line client and Web service.
Scout Monitoring - Free Django app performance insights with Scout Monitoring
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of trafilatura or a related project?