Description
gazpacho is a web scraping library. It replaces requests and BeautifulSoup for most projects. gazpacho is small, simple, fast, and consistent. You should use it!
gazpacho alternatives and similar packages
Based on the "HTML Manipulation" category.
Alternatively, view gazpacho alternatives based on common mentions on social networks and blogs.
-
xmltodict
Python module that makes working with XML feel like you are working with JSON -
bleach
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes -
html5lib
Standards-compliant library for parsing and serializing HTML documents and fragments in Python -
selectolax
Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors). -
BeautifulSoup
Providing Pythonic idioms for iterating, searching, and modifying HTML or XML.
Collect and Analyze Billions of Data Points in Real Time
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of gazpacho or a related project?
README
About
gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.
Install
Install with pip
at the command line:
pip install -U gazpacho
Quickstart
Give this a try:
from gazpacho import get, Soup
url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)
def parse(book):
name = book.find('h4').text
price = float(book.find('p').text[1:].split(' ')[0])
return name, price
[parse(book) for book in books]
Tutorial
Import
Import gazpacho following the convention:
from gazpacho import get, Soup
get
Use the get
function to download raw HTML:
url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n <head>\n <met'
Adjust get
requests with optional params and headers:
get(
url='https://httpbin.org/anything',
params={'foo': 'bar', 'bar': 'baz'},
headers={'User-Agent': 'gazpacho'}
)
Soup
Use the Soup
wrapper on raw html to enable parsing:
soup = Soup(html)
Soup objects can alternatively be initialized with the .get
classmethod:
soup = Soup.get(url)
.find
Use the .find
method to target and extract HTML tags:
h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>
attrs=
Use the attrs
argument to isolate tags that contain specific HTML element attributes:
soup.find('div', attrs={'class': 'section-'})
partial=
Element attributes are partially matched by default. Turn this off by setting partial
to False
:
soup.find('div', {'class': 'soup'}, partial=False)
mode=
Override the mode argument {'auto', 'first', 'all'
} to guarantee return behaviour:
print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8
dir()
Soup
objects have html
, tag
, attrs
, and text
attributes:
dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']
Use them accordingly:
print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup
Support
If you use gazpacho, consider adding the badge to your project README.md:
[](https://github.com/maxhumber/gazpacho)
Contribute
For feature requests or bug reports, please use Github Issues
For PRs, please read the CONTRIBUTING.md document