Popularity

7.4

Stable

Activity

6.7

Growing

Stars 3,417

Watchers 113

Forks 523

Last Commit 2 months ago

Description

Code Quality Rank: L5

Programming language: Python

License: Apache License 2.0

Tags: Text Processing Web Content Extracting HTML Scientific Engineering Information Analysis Internet Markup Linguistic Filters Education

Latest version: v0.10.0

sumy alternatives and similar packages

Based on the "Web Content Extracting" category.
Alternatively, view sumy alternatives based on common mentions on social networks and blogs.

TWINT

9.4 0.0 sumy VS TWINT

DISCONTINUED. An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
newspaper

9.3 0.0 L3 sumy VS newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

python-goose

7.9 0.0 sumy VS python-goose

Html Content / Article Extractor, web scrapping lib in Python
textract

7.6 3.7 sumy VS textract

extract text from any document. no muss. no fuss.
toapi

7.1 0.0 sumy VS toapi

Every web site provides APIs.
python-readability

6.7 3.4 sumy VS python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
trafilatura

6.5 8.7 sumy VS trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
html2text

5.7 6.1 L1 sumy VS html2text

Convert HTML to Markdown-formatted text.
Goose3

4.2 6.4 sumy VS Goose3

A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
micawber

3.9 4.8 L5 sumy VS micawber

a small library for extracting rich content from urls
lassie

3.7 0.0 L4 sumy VS lassie

Web Content Retrieval for Humans™
opengraph

3.0 0.0 L5 sumy VS opengraph

A python module to parse the Open Graph Protocol
inscriptis -- HTML to text conversion library, command line client and Web service

2.6 8.5 sumy VS inscriptis -- HTML to text conversion library, command line client and Web service

A python based HTML to text conversion library, command line client and Web service.
Haul

2.5 0.0 L5 sumy VS Haul

An Extensible Image Crawler
htmldate

2.0 7.6 sumy VS htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
sanitize

1.5 0.0 L4 sumy VS sanitize

Bringing sanity to world of messed-up data
JSONPATH

1.0 5.7 sumy VS JSONPATH

A query expression for extracting data from JSON.
Data Extractor

0.9 6.0 sumy VS Data Extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of sumy or a related project?

Add another 'Web Content Extracting' Package

Popular Comparisons

README

Automatic text summarizer

Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are described in the [documentation](docs/summarizators.md). I also maintain a list of [alternative implementations](docs/alternatives.md) of the summarizers in various programming languages.

Is my natural language supported?

There is a [good chance](docs/index.md#Tokenizer) it is. But if not it is [not too hard to add](docs/how-to-add-new-language.md) it.

Installation

Make sure you have Python 3.6+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install sumy
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git  # for the fresh version

Usage

Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy lex-rank --language=uk --length=30 --url=https://uk.wikipedia.org/wiki/Україна
$ sumy luhn --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info

Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt --url=https://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info

If you don't want to bother by the installation, you can try it as a container.

$ docker run --rm misobelica/sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization

Python API

Or you can use sumy like a library in your project. Create file sumy_example.py (don't name it sumy.py) with the code below to test it.

# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Interesting projects using sumy

I found some interesting projects while browsing the internet or sometimes people wrote me an e-mail with questions, and I was curious how they use the sumy :)