Popularity

4.2

Growing

Activity

6.6

Declining

Stars 762

Watchers 17

Forks 97

Last Commit 3 months ago

Description

Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project.

This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Programming language: HTML

License: Apache License 2.0

Tags: Text Processing Web Crawling Web Content Extracting Utilities Internet

Latest version: v3.1.12

Goose3 alternatives and similar packages

Based on the "Web Content Extracting" category.
Alternatively, view Goose3 alternatives based on common mentions on social networks and blogs.

TWINT

9.4 0.0 Goose3 VS TWINT

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
newspaper

9.3 0.0 L3 Goose3 VS newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

python-goose

7.9 0.0 Goose3 VS python-goose

Html Content / Article Extractor, web scrapping lib in Python
textract

7.6 3.7 Goose3 VS textract

extract text from any document. no muss. no fuss.
sumy

7.4 6.7 L5 Goose3 VS sumy

Module for automatic summarization of text documents and HTML pages.
toapi

7.1 0.0 Goose3 VS toapi

Every web site provides APIs.
python-readability

6.7 3.4 Goose3 VS python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
trafilatura

6.5 8.4 Goose3 VS trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
html2text

5.7 6.3 L1 Goose3 VS html2text

Convert HTML to Markdown-formatted text.
micawber

3.9 4.8 L5 Goose3 VS micawber

a small library for extracting rich content from urls
lassie

3.7 0.0 L4 Goose3 VS lassie

Web Content Retrieval for Humans™
opengraph

3.0 0.0 L5 Goose3 VS opengraph

A python module to parse the Open Graph Protocol
inscriptis -- HTML to text conversion library, command line client and Web service

2.6 8.5 Goose3 VS inscriptis -- HTML to text conversion library, command line client and Web service

A python based HTML to text conversion library, command line client and Web service.
Haul

2.5 0.0 L5 Goose3 VS Haul

An Extensible Image Crawler
htmldate

2.0 7.6 Goose3 VS htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
sanitize

1.5 0.0 L4 Goose3 VS sanitize

Bringing sanity to world of messed-up data
JSONPATH

1.0 5.7 Goose3 VS JSONPATH

A query expression for extracting data from JSON.
Data Extractor

0.9 6.0 Goose3 VS Data Extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.