Web Scraping with LLMs and ScrapeGraphAI

Adarsh Balan Last Updated : 04 Jan, 2025
7 min read

Web scraping has become an important tool essential for gathering useful information from the available websites. Of all the tools that are present, ScrapeGraphAI is unique as it can identify graphs and use Artificial Intelligence for web scraping. This article explores ScrapeGraphAI’s features, provides a step-by-step guide for implementation, and addresses common challenges. Whether you’re new to web scraping or an experienced user, this guide will equip you with the knowledge to use ScrapeGraphAI effectively.

Web Scraping with LLMs and ScrapeGraphAI

Learning Objectives

  • Understand the key features and advantages of using ScrapeGraphAI for web scraping.
  • Learn how to set up and configure ScrapeGraphAI for your scraping projects.
  • Gain hands-on experience with a step-by-step implementation guide to scrape web data.
  • Recognize the challenges and considerations when using ScrapeGraphAI effectively.
  • Discover how to export scraped data to useful formats like Excel or CSV.

This article was published as a part of the Data Science Blogathon.

What is ScrapeGraphAI?

Scraping product listings from Amazon can be a daunting task. Normally, you might spend 200–300 lines of code setting up HTTP requests, parsing HTML with selectors or regex, dealing with pagination, handling anti-bot measures, and more. But with ScrapeGraphAI, you can instruct an AI model (backed by large language models) to extract exactly what you need—often in just a few lines of Python.

Disclaimer:

  • Amazon’s Terms of Service typically prohibit scraping or data extraction without explicit permission.
  • This article is purely a demonstration of ScrapeGraphAI’s capabilities on a single Amazon page for educational or personal use.
  • Large-scale or commercial scraping from Amazon can be legally and technically risky.

Why Choose ScrapeGraphAI for Web Scraping?

ScrapeGraphAI revolutionizes web scraping by shifting the focus from complex coding to intuitive, natural-language instructions, making data extraction faster, simpler, and more efficient.

Significant Reduction in Code

With traditional scraping, you might use requests, BeautifulSoup, Selenium, or other libraries. A typical script could easily climb to 200–300 lines once you factor in error handling, CSS selectors, pagination, and more. In contrast, ScrapeGraphAI uses natural-language prompts to describe what you want—meaning most of the heavy lifting is done by an AI model in the background.

Faster Prototyping

Because you don’t have to manually craft selectors for every piece of HTML or worry about minor DOM changes, you can spin up a prototype in minutes.

Higher-Level Approach

By describing your data requirements in everyday English, you focus on what you want rather than how to get it. This approach can be more robust to small layout changes than brittle CSS or XPath queries (though site redesigns can still break any automated approach).

Ease of Maintenance

When Amazon (or any other site) changes its layout, you often have to rummage through HTML again to find the correct selectors. With ScrapeGraphAI, you mostly just update your prompt if the headings or page structure shift.

Getting Started with ScrapeGraphAI

Embarking on your web scraping journey with ScrapeGraphAI is straightforward and hassle-free. By leveraging its intuitive interface and AI-powered capabilities, you can skip the usual complexities of traditional scraping setups.

Below steps will guide you through acquiring the ScrapeGraphAI API key, installing the necessary tools, and setting up your environment to extract data efficiently in just a few steps. Whether you’re a seasoned developer or a beginner, you’ll find ScrapeGraphAI’s streamlined process a game-changer for tackling data extraction tasks.

  • Go to: ScrapeGraphAI
  • Click: Get Started
  • Log In: You can sign in using your Google account.
  • Copy Your API Key: On the next page, your API key will be displayed. Simply copy it.

Note: ScrapeGraphAI provides 100 free credits to get you started!

Step-by-Step Implementation Guide

Below, we’ll show you how to scrape Amazon’s bedside table search results page and extract details like title, price, rating, number of ratings, and delivery info with only a handful of lines of code.

Step 1: Install Dependencies

Before starting, you’ll need to install the required libraries. These will provide the tools necessary for web scraping and data handling.

pip install --quiet -U langchain-scrapegraph pandas
  • langchain-scrapegraph: The official package for ScrapeGraphAI’s Python tools.
  • pandas: We’ll use this to store the results in a DataFrame or Excel file.

Step 2: Import and Configure Your API Key

To interact with ScrapeGraphAI, you’ll need to set up your API key. If the key isn’t already in your environment, you’ll be prompted to enter it securely.

import os
import getpass
import pandas as pd
from langchain_scrapegraph.tools import SmartScraperTool

# If you haven't set your API key in your environment, you'll be prompted for it:
if not os.environ.get("SGAI_API_KEY"):
    os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")

Step 3: Create the SmartScraperTool

This step initializes the ScrapeGraphAI SmartScraper, which serves as the heart of the scraping process.

smartscraper = SmartScraperTool()

This one line of code gives you access to an AI-based web scraper that accepts a simple prompt.

Step 4: Write the Prompt

Instead of writing lines of CSS or XPath selectors, you tell the tool what to do in plain English. For example:

scraper_prompt = """
1. Go to the Amazon search results page: https://www.amazon.in/s?k=bedside+table
2. For each product listing, extract:
   - Product Title
   - Price
   - Star Rating
   - Number of Ratings
   - Delivery details
3. Return the results as a JSON array of objects, each with keys:
   "title", "price", "rating", "num_ratings", "delivery".
4. Ignore sponsored listings if possible.
"""

Feel free to add or remove instructions. You might also include “product link” or “prime eligibility.”

Step 5: Invoke the Scraper

With the prompt and scraper ready, you can now execute the scraping task.

search_url = "https://www.amazon.in/s?k=bedside+table"

result = smartscraper.invoke({
    "user_prompt": scraper_prompt,
    "website_url": search_url
})

print("Scraped Results:\n", result)

What you’ll get back is typically a list (array) of dictionaries. Each dictionary contains the data you requested: title, price, rating, num_ratings, delivery, etc.

Example (simplified):

[
  {
    "title": "XYZ Interiors Wooden Bedside Table...",
    "price": "₹1,499",
    "rating": "4.3 out of 5 stars",
    "num_ratings": "1,234",
    "delivery": "Get it by Monday, January 10"
  },
  ...
]

Output:

result
{"products": [{"title": "Studio Kook SEZ Sofa Mate Engineered Wood Side Table
(Junglewood, Matte Finish)",
'rating: 4.5 out of 5 stars',
"num_ratings": "19",
'delivery': 'Get it Monday 6 January Wednesday 8 January",
"product_link":
"3.0.in/dio-oo-oo-Fi/"}, {"title":"ULD CRAFTS Antique Wooden Fold-able Coffee
Table/Side Table/End Table/Tea Table/Plant Stand/St 'price': '979',
'rating': '4.0 out of 5 stars',
'n ratings" '14,586,
'delivery': "FREE delivery Thu, 2 Jan on top of items fulfilled by Amazon or fastest
delivery Tomorrow, 'product_link":"https://mazon.in/SSD-CRAFTS-Residul-fold-ale-
humáture/de/2692716056"},
('title': 'Firebees Modern Wooden Table, Wooden Bedside Table for Bed Room,
'nun ratings": "292",
'delivery': "Get it by 6-7 Jan",
'product_link":"//amazon.joedside-lansstand-millexten/da/GAMIX"),
('title': 'Delon Wooden Center Table, End Sofa, Bedside Table, Corner Coffee Table
with Solid Finish Space 'price': '49",
"rating": "3.6 out of 5 stars',
'n ratings": "63",
'delivery' "Get it by 67 Jan",
'product_link': '//zon.in/ein-Bedside-furniture-Storage-Bedroom/da/55"},
{"title":"ETIQUETTE ART Retro Bookcase Nightstand, End Table, Bed Side Table for
Small Spaces Magazine Star
'price': '99,
'rating': '3.8 out of 5 stars',
num ratings": "15",
'delivery': "Get it by Tuesday, January 7,
'product_link":"\/APHYAL"}}}
Output is truncated. View assialer or open in a tots Adjust cell output

Step 6: Optional: Export to Excel or CSV

If you want to store your results, pandas makes it easy:

df = pd.DataFrame(result)
df.to_excel("bedside_tables.xlsx", index=False)
print("Data exported to bedside_tables.xlsx")

Advantages of Using ScrapeGraphAI

Below are the advantages of using ScrapeGraphAI, which make it a standout choice for efficient and intelligent web scraping.

Simplicity

  • Traditional scraping with requests + BeautifulSoup or Selenium can easily bloat to 200–300 lines once you factor in error handling, pagination, dynamic loading, and data parsing.
  • With ScrapeGraphAI, you can often achieve the same result in under 20 lines (sometimes even fewer than 10).

Time Savings

  • You don’t need to figure out each CSS selector or Xpath. You simply say, “Extract the title, price, rating…”
  • The LLM does the heavy HTML parsing behind the scenes.

Rapid Iteration

  • Instead of rewriting complex logic for every new data point, you just rephrase your prompt to capture the additional fields you need.

Evolving with the Page

  • If Amazon changes class names or modifies the HTML structure slightly, you might only need a small prompt tweak, rather than rewriting entire CSS or Xpath queries.

Challenges and Considerations

Below are the challenges and considerations to keep in mind while using ScrapeGraphAI to ensure seamless and effective web scraping.

Amazon’s Terms of Service

  • Amazon generally prohibits automated data extraction. Repeated or large-scale scraping may get you blocked or lead to legal consequences.
  • If you plan to do anything beyond small-scale testing, get explicit permission or consider an official data feed.

CAPTCHAs / Anti-bot Measures

  • Amazon can detect unusual traffic patterns. If you’re blocked, you may need advanced solutions: rotating proxies, headless browsers, or carefully timed requests.

Data Volumes

  • If you want thousands of listings from multiple pages, ensure your approach is robust to handle pagination and big data sets.
  • Also watch your ScrapeGraphAI credits for large-scale usage.

Dynamic Content

  • If certain info (like shipping or prime badges) is loaded dynamically via JavaScript, a static approach might miss it. More advanced techniques (like Selenium or Puppeteer) might be needed to capture every detail.

Conclusion

ScrapeGraphAI brings a revolutionary approach to web scraping. Instead of painstakingly coding parse logic, you delegate that complexity to an AI model—shrinking your codebase from hundreds of lines down to a concise, easy-to-read script.

For many use cases—like quick product comparisons, one-off data extraction, or small-scale research—this can be a massive time-saver. However, you still need to be mindful of Amazon’s policies, and for large-scale scraping, advanced techniques and compliance considerations remain essential.

In short:

  • If you only need a handful of data points from a few pages, ScrapeGraph AI can be your best friend.
  • For bigger jobs, make sure you’re well within the site’s terms of service and prepared to handle CAPTCHAs or other anti-bot roadblocks.

Key Takeaways

  • ScrapeGraphAI reduces the effort and complexity of web scraping from hundreds of lines of code to concise, prompt-based instructions.
  • With natural language prompts, you can quickly extract data without worrying about HTML selectors or layout changes.
  • Minor updates to prompts can handle site structure changes, minimizing the need for extensive code rewrites.
  • Scraping Amazon at scale may violate their Terms of Service and require solutions for CAPTCHAs and anti-bot measures.
  • Ideal for quick, small-scale data extraction, but large-scale projects require compliance with Amazon’s policies and robust handling mechanisms.

Frequently Asked Questions

Q1. Is it legal to scrape Amazon?

A. Scraping Amazon at scale is generally not allowed under their Terms of Service. Amazon employs anti-bot measures (CAPTCHAs, IP blocking) to prevent unauthorized scraping. For a small-scale, personal project—such as collecting a limited number of listings for research—you may be okay, but you should always check the current Amazon Terms of Service and confirm you have permission. Large-scale or commercial scraping could be legally risky and may violate Amazon’s policies.

Q2. Why do we need ScrapeGraphAI for this task?

A. ScrapeGraphAI simplifies the scraping process by using prompt-based instructions with large language models under the hood. Rather than manually parsing HTML with CSS selectors or XPath, you can describe the data you want (“product titles, prices, etc.”) in plain English. This can save you from writing 200–300 lines of custom parsing code.

Q3. Will ScrapeGraph AI always be able to retrieve the data I request?

A. Not always. Some sites (including Amazon) heavily rely on JavaScript to load or update product information. If the data is injected dynamically and the HTML is not present in the initial source, ScrapeGraphAI might not see it through a simple HTTP request. Additionally, websites might employ captchas or block requests. In such cases, you might need advanced techniques (headless browsers, proxies, etc.).

Q4. Can I scrape multiple pages or entire categories?

A. Yes, in theory, you can instruct ScrapeGraphAI to follow pagination links and scrape more results. However, be mindful of rate limits, potential CAPTCHA challenges, and Amazon’s TOS. If you repeatedly scrape many pages, you risk getting blocked or violating their usage policies.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hi! I'm Adarsh, a Business Analytics graduate from ISB, currently deep into research and exploring new frontiers. I'm super passionate about data science, AI, and all the innovative ways they can transform industries. Whether it's building models, working on data pipelines, or diving into machine learning, I love experimenting with the latest tech. AI isn't just my interest, it's where I see the future heading, and I'm always excited to be a part of that journey!

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details