gain alternatives and similar packages
Based on the "Web Crawling" category.
Alternatively, view gain alternatives based on common mentions on social networks and blogs.
-
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python. -
MechanicalSoup
A Python library for automating interaction with websites. -
RoboBrowser
A simple, Pythonic library for browsing the web without a standalone web browser. -
spidy Web Crawler
The simple, easy to use command line web crawler. -
Google Search Results in Python
Google Search Results via SERP API pip Python Package -
Crawley
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. -
FastImage
Python library that finds the size / type of an image given its URI by fetching as little as needed -
Mariner
This a is mirror of Gitlab repository. Open your issues and pull requests there.
Build time-series-based applications quickly and at scale.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of gain or a related project?
Popular Comparisons
README
Web crawling framework for everyone. Written with asyncio
, uvloop
and aiohttp
.
[](img/architecture.png)
Requirements
- Python3.5+
Installation
pip install gain
pip install uvloop
(Only linux)
Usage
- Write spider.py:
from gain import Css, Item, Parser, Spider
import aiofiles
class Post(Item):
title = Css('.entry-title')
content = Css('.entry-content')
async def save(self):
async with aiofiles.open('scrapinghub.txt', 'a+') as f:
await f.write(self.results['title'])
class MySpider(Spider):
concurrency = 5
headers = {'User-Agent': 'Google Spider'}
start_url = 'https://blog.scrapinghub.com/'
parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]
MySpider.run()
Or use XPathParser:
from gain import Css, Item, Parser, XPathParser, Spider
class Post(Item):
title = Css('.breadcrumb_last')
async def save(self):
print(self.title)
class MySpider(Spider):
start_url = 'https://mydramatime.com/europe-and-us-drama/'
concurrency = 5
headers = {'User-Agent': 'Google Spider'}
parsers = [
XPathParser('//span[@class="category-name"]/a/@href'),
XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
]
proxy = 'https://localhost:1234'
MySpider.run()
You can add proxy setting to spider as above.
Run
python spider.py
Result:
[](img/sample.png)
Example
The examples are in the /example/
directory.
Contribution
- Pull request.
- Open issue.
*Note that all licence references and agreements mentioned in the gain README section above
are relevant to that project's source code only.