pyspider alternatives and similar packages
Based on the "Web Crawling" category.
Alternatively, view pyspider alternatives based on common mentions on social networks and blogs.
-
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python. -
MechanicalSoup
A Python library for automating interaction with websites. -
RoboBrowser
A simple, Pythonic library for browsing the web without a standalone web browser. -
Google Search Results in Python
Google Search Results via SERP API pip Python Package -
spidy Web Crawler
The simple, easy to use command line web crawler. -
Crawley
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. -
FastImage
Python library that finds the size / type of an image given its URI by fetching as little as needed -
Mariner
This a is mirror of Gitlab repository. Open your issues and pull requests there.
WorkOS - The modern identity platform for B2B SaaS
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of pyspider or a related project?
README
pyspider
A Powerful Spider(Web Crawler) System in Python.
- Write script in Python
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
- RabbitMQ, Redis and Kombu as message queue
- Task priority, retry, periodical, recrawl by age, etc...
- Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...
Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases
Sample Code
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
Installation
pip install pyspider
- run command
pyspider
, visit http://localhost:5000/
WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth
for webui.
Quickstart: http://docs.pyspider.org/en/latest/Quickstart/
Contribute
- Use It
- Open Issue, send PR
- User Group
- 中文问答
TODO
v0.4.0
- [ ] a visual scraping interface like portia
License
Licensed under the Apache License, Version 2.0
*Note that all licence references and agreements mentioned in the pyspider README section above
are relevant to that project's source code only.