Quick Web Scraping using Gazpacho

Rahul Shah Last Updated : 19 Apr, 2023

4 min read

This article was published as a part of the Data Science Blogathon.

Web Scraping is considered a fundamental process of getting data from the web. It automates the process of extracting the data from a web page, which is quicker and hassle-free than the conventional copy-pasting of the data. Thanks to the programming language methods, structuring and preprocessing the data can be done with ease. While the scraping is easy to perform, ethics are involved while scraping the data. One should only scrape the data from a website if it allows one to do so. One can find if the website allows or not by checking the robots.txt of the website. Even after fetching the data, it must not be used for commercial purposes without the owners’ consent.

In Python, Web Scraping is done predominantly using two libraries, requests and BeautifulSoup. Web scraping using gazpacho can come really handy when we want to fetch data in no time. This article will learn how to scrape the data using a single yet powerful library named gazpacho.

1. About gazpacho

2. How gazpacho works

3. Comparision of gazpacho with requests and BeautifulSoup

4. Conclusions

About Gazpacho

According to gazpacho’s documentation, gazpacho is a simple, fast, and modern web scraping library. Probably;y, it got its name from the Spanish food item. Gazpacho got the capabilities of the requests and BeautifulSoup library and can perform all of their operations by simply importing a few classes from it.

One can install the gazpacho library using the pip package manager:

pip install gazpacho

Although the gazpacho contains a list of different methods, we will be only a few important ones. Refer to the PyPI documentation of gazpacho to read about other methods.

In this article, we will be scraping a dummy Laptop website from the webpage of webscraper.io.

How do Gazpacho works?

To understand how gazpacho works, we will perform the basic set of operations on the specified webpage above.

Let’s first start with retrieving the webpage HTML data. Conventionally, we perform this operation using the .get() method of requests library. To perform the get operation on gazpacho, we will import get from requests.

from requests import get

Now we will specify the URL into a variable URL.

URL = 'https://webscraper.io/test-sites/e-commerce/static/computers/laptops'

Next, we will retrieve the HTML data using the .get() function and store it into another variable.

html = get(URL)

We will parse the retrieved HTML data using the Soup class of gazpacho. On the contrary, the same task was performed using the BeutifulSoup library and needed another import.

from gazpacho import Soup

Let’s parse the HTML data to make the retrieved look meaningful.

soup = Soup(html)

Let’s find a few Laptop titles using the .find() method of the Soup object.

soup.find('p', {'class':'description'})

This gives a list containing all the items that belong to the HTML class description. The first argument is the HTML tag we want to retrieve in single quotes. Here, we want to retrieve the ‘p’ tag. A second argument is a dictionary for the class name we want to retrieve. Here we want to retrieve the class ‘description’.

If we check one of the items from the above-retrieved list, it gives gazpacho.soup.Soup

To get the text from the gazpacho Soup object, we have to use the .text attribute.

soup.find('p', {'class':'description'})[0].text

We can also find the elements when we don’t know the exact name of the class. This is performed using the ‘partial’ argument in the .find() class.

For example, let’s find the title of the laptops, which is in the ‘p’ HTML tag with class ‘description’. Suppose we don’t know the exact name of the class; we could write the partial name of the class in that tag and set the partial keyword to True.

soup.find('p', {'class':'desc'})

This would find the exact match for the ‘desc’ class name from the ‘p’ HTML tag. Since there is no such class present, it will retrieve nothing.

We will retrieve the elements using the partial class name, setting the partial argument to True.

soup.find('p', {'class':'desc'}, partial = True)

This would retrieve a list of all the elements that belong to the class name starting from ‘desc’.

Comparison of Gazpacho with requests and BeautifulSoup

To get the same results from requests and BeautifulSoup, we first need to import requests and BeautifulSoup

import requests
from bs4 import BeautifulSoup

Now, let’s retrieve the webpage HTML data using request first.

html = requests.get(URL).text

We have added the .text attribute to get the text type from the soup object.

To parse the HTML data, we will use the BeautifulSoup to create the soup object.

soup = BeautifulSoup(html, 'html.parser')

Here, in the BeautifulSoup object, we added the HTML data in the first argument and ‘html.parser’ as the second argument to specify the type of parser we want.

Now, let’s find elements from the soup object. To get the first Laptop title, we will use

soup.find('div', class_='caption')('p')

This retrieves the list of all the elements belonging to the ‘caption‘ class of the ‘div‘ tag.

To get the first element, use the slicing function of Python lists, and use the .text attribute at the end.

soup.find('div', class_='caption')('p')[0].text

Conclusion

We learned about easy and quick web scraping using gazpacho in this article. Its major advantage over using requests and BeautifulSoup combined is, all the tasks can be done using a single library import. As specified, the scraping tasks must be performed for educational purposes only if you don’t have the necessary permissions. Also, don’t forget to check for the ‘robots.txt’ file for the permissions on any website. This article was inspired by the website Calm Code. Learn more about the gazpacho from the gazpacho official GitHub repository for troubleshooting and ideas. One can also try retrieving nested HTML tags (one tag inside the another) and get the required information.

About the Author

Connect with me on LinkedIn.

For any suggestions or article requests, you can email me here.

Check out my other Articles Here and on Medium.

You can provide your valuable feedback to me on LinkedIn.

Thanks for giving your time and reading my article on Web Scraping using Gazpacho.

Read more articles on Web Scarping here.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Rahul Shah

IT Engineering Graduate currently pursuing Post Graduate Diploma in Data Science.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Quick Web Scraping using Gazpacho

Table of Contents

About Gazpacho

How do Gazpacho works?

Comparison of Gazpacho with requests and BeautifulSoup

Conclusion

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Quick Web Scraping using Gazpacho

Table of Contents

About Gazpacho

How do Gazpacho works?

Comparison of Gazpacho with requests and BeautifulSoup

Conclusion

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques