Popularity

6.1

Declining

Activity

0.0

Stable

Stars 2,031

Watchers 75

Forks 215

Last Commit almost 5 years ago

Programming language: Python

License: GNU General Public License v3.0 only

Tags: Web Frameworks Web Crawling Web Content Extracting

gain alternatives and similar packages

Based on the "Web Crawling" category.
Alternatively, view gain alternatives based on common mentions on social networks and blogs.

Scrapy

9.9 9.7 L4 gain VS Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
pyspider

9.5 0.0 L3 gain VS pyspider

A Powerful Spider(Web Crawler) System in Python.

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

requests-html

9.2 0.0 gain VS requests-html

Pythonic HTML Parsing for Humans™
portia

9.0 0.0 L2 gain VS portia

Visual scraping for Scrapy
MechanicalSoup

7.8 5.9 L4 gain VS MechanicalSoup

A Python library for automating interaction with websites.
RoboBrowser

7.4 0.0 L4 gain VS RoboBrowser

A simple, Pythonic library for browsing the web without a standalone web browser.
PSpider

6.5 0.0 gain VS PSpider

简单易用的Python爬虫框架，QQ交流群：597510560
Grab

6.5 3.0 L3 gain VS Grab

Web Scraping Framework
cola

6.4 0.0 L3 gain VS cola

A high-level distributed crawling framework.
Scrapely

6.3 0.0 gain VS Scrapely

A pure-python HTML screen-scraping library
feedparser

6.1 7.7 L3 gain VS feedparser

Parse feeds in Python
Sukhoi

4.2 0.0 gain VS Sukhoi

Minimalist and powerful Web Crawler.
MSpider

4.0 0.0 gain VS MSpider

Spider
Google Search Results in Python

3.7 4.5 gain VS Google Search Results in Python

Google Search Results via SERP API pip Python Package
spidy Web Crawler

3.2 0.0 gain VS spidy Web Crawler

The simple, easy to use command line web crawler.
reader

3.1 9.2 gain VS reader

A Python feed reader library.
brownant

2.6 0.0 gain VS brownant

Brownant is a web data extracting framework.
Crawley

2.6 0.0 gain VS Crawley

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
Demiurge

2.0 0.0 L5 gain VS Demiurge

PyQuery-based scraping micro-framework.
Pomp

1.6 0.0 L5 gain VS Pomp

Screen scraping and web crawling framework
FastImage

1.0 0.0 L4 gain VS FastImage

Python library that finds the size / type of an image given its URI by fetching as little as needed
Mariner

0.4 0.0 gain VS Mariner

This a is mirror of Gitlab repository. Open your issues and pull requests there.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of gain or a related project?

Add another 'Web Crawling' Package

Popular Comparisons

README

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

[](img/architecture.png)

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

[](img/sample.png)

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

*Note that all licence references and agreements mentioned in the gain README section above are relevant to that project's source code only.

gain

Web crawling framework based on asyncio.