Popularity

6.0

Stable

Activity

0.0

Stable

Stars 2,030

Watchers 74

Forks 207

Last Commit over 6 years ago

Programming language: Python

License: GNU General Public License v3.0 only

Tags: Web Frameworks Web Crawling Web Content Extracting

gain alternatives and similar packages

Based on the "Web Crawling" category.
Alternatively, view gain alternatives based on common mentions on social networks and blogs.

Scrapy

9.9 9.6 L4 gain VS Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
pyspider

9.5 0.0 L3 gain VS pyspider

DISCONTINUED. A Powerful Spider(Web Crawler) System in Python.

InfluxDB – Built for High-Performance Time Series Workloads

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

Promo www.influxdata.com

requests-html

9.1 0.0 gain VS requests-html

Pythonic HTML Parsing for Humans™
portia

8.9 0.0 L2 gain VS portia

Visual scraping for Scrapy
MechanicalSoup

7.7 4.8 L4 gain VS MechanicalSoup

A Python library for automating interaction with websites.
RoboBrowser

7.2 0.0 L4 gain VS RoboBrowser

A simple, Pythonic library for browsing the web without a standalone web browser.
Grab

6.4 3.0 L3 gain VS Grab

Web Scraping Framework
PSpider

6.4 0.0 gain VS PSpider

简单易用的Python爬虫框架，QQ交流群：597510560
feedparser

6.3 7.5 L3 gain VS feedparser

Parse feeds in Python
cola

6.3 0.0 L3 gain VS cola

A high-level distributed crawling framework.
Scrapely

6.1 0.0 gain VS Scrapely

A pure-python HTML screen-scraping library
Sukhoi

4.3 0.0 gain VS Sukhoi

Minimalist and powerful Web Crawler.
Google Search Results in Python

4.1 5.0 gain VS Google Search Results in Python

Google Search Results via SERP API pip Python Package
MSpider

4.0 0.0 gain VS MSpider

Spider
reader

3.5 8.7 gain VS reader

A Python feed reader library.
spidy Web Crawler

3.3 0.0 gain VS spidy Web Crawler

The simple, easy to use command line web crawler.
Crawley

2.7 0.0 gain VS Crawley

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
brownant

2.6 0.0 gain VS brownant

Brownant is a web data extracting framework.
Demiurge

2.1 0.0 L5 gain VS Demiurge

PyQuery-based scraping micro-framework.
Pomp

1.6 0.0 L5 gain VS Pomp

Screen scraping and web crawling framework
FastImage

1.1 0.0 L4 gain VS FastImage

Python library that finds the size / type of an image given its URI by fetching as little as needed
Mariner

0.4 0.0 gain VS Mariner

This a is mirror of Gitlab repository. Open your issues and pull requests there.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of gain or a related project?

Add another 'Web Crawling' Package

Popular Comparisons

README

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

[](img/architecture.png)

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

[](img/sample.png)

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

*Note that all licence references and agreements mentioned in the gain README section above are relevant to that project's source code only.

gain

Web crawling framework based on asyncio.