Web Scraping with LLMs

seematiwari0116 . Last Updated : 22 Dec, 2024
6 min read

Web scraping has long been a vital technique for extracting information from the internet, enabling developers to gather insights from various domains. With the integration of Large Language Models (LLMs) like ChatGroq, web scraping becomes even more powerful, offering enhanced flexibility and precision. This article explores how to implement scraping with LLMs to fetch structured data from webpages effectively.

Web Scraping with LLMs: Unlocking the Power of Language Models for Data Extraction

Learning Objectives

  • Understand how to integrate Large Language Models (LLMs) like ChatGroq with web scraping tools.
  • Learn how to extract structured data from webpages using Playwright and LLMs.
  • Gain practical knowledge of setting up an environment for web scraping with LLMs.
  • Explore techniques for processing and converting web content into structured formats like Markdown.
  • Learn how to automate and scale web scraping tasks for efficient data extraction.

Setting Up the Environment

Before diving into web scraping, ensure your environment is properly configured. Install the required libraries:

!pip install -Uqqq pip --progress-bar off  # Updates the Python package installer to the latest version
!pip install -qqq playwright==1.46.0 --progress-bar off  # Playwright for browser automation and web scraping
!pip install -qqq html2text==2024.2.26 --progress-bar off  # Converts HTML content to plain text or Markdown format
!pip install -qqq langchain-groq==0.1.9 --progress-bar off  # LangChain Groq for leveraging LLMs in data extraction workflows

!playwright install chromium

This code sets up the environment by updating pip, installing tools like playwright for browser automation, html2text for HTML-to-text conversion, langchain-groq for LLM-based data extraction, and downloading Chromium for Playwright’s use.

Import the Necessary Modules

Below we will import necessary modules one by one.

import re
from pprint import pprint
from typing import List, Optional

import html2text
import nest_asyncio
import pandas as pd
from google.colab import userdata
from langchain_groq import ChatGroq
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from tqdm import tqdm

nest_asyncio.apply()

Fetching Web Content as Markdown

The first step in scraping involves fetching the web content. Using Playwright, we load the webpage and retrieve its HTML content:

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch()

context = await browser.new_context(user_agent=USER_AGENT)

page = await context.new_page()
await page.goto("https://playwright.dev/")
content = await page.content()

await browser.close()
await playwright.stop()
print(content)
Fetching Web Content as Markdown

This code fetches the HTML content of a webpage using Playwright. It starts a Chromium browser instance, sets a custom user agent for the browsing context, navigates to the specified URL (https://playwright.dev/), and retrieves the page’s HTML content. After fetching, the browser and Playwright are cleanly closed to release resources.

To simplify text processing, convert the HTML content to Markdown format using the html2text library:

markdown_converter = html2text.HTML2Text()
markdown_converter.ignore_links = False
markdown_content = markdown_converter.handle(content)
print(markdown_content)
Markdown format using the html2text library

Setting Up Large Language Models (LLMs)

Next, configure the LLM to process and extract structured information. Use ChatGroq, a versatile LLM designed for structured data extraction:

MODEL = "llama-3.1-70b-versatile"

llm = ChatGroq(temperature=0, model_name=MODEL, api_key=userdata.get("GROQ_API_KEY"))

SYSTEM_PROMPT = """
You're an expert text extractor. You extract information from webpage content.
Always extract data without changing it and any other output.
"""

def create_scrape_prompt(page_content: str) -> str:
    return f"""
Extract the information from the following web page:
```
{page_content}
```
""".strip()

This code sets up the ChatGroq LLM for extracting structured data from webpage content. It initializes the model (`llama-3.1-70b-versatile`) with specific parameters and a system prompt instructing it to extract information accurately without altering the content. A function generates prompts for processing webpage text dynamically.

Scraping Landing Pages

Define the data structure for landing page extraction using Pydantic models:

class ProjectInformation(BaseModel):
    """Information about the project"""

    name: str = Field("Name of the project e.g. Excel")
    tagline: str = Field(
        description="What this project is about e.g. Get deep insights from your numbers",
    )
    benefits: List[str] = Field(
        description="""A list of main benefits of the project including 3-5 words to summarize each one.
    e.g. [
        'Your spreadshits everywhere you go - cloud-backed files with your account',
        'Accuracy without manual calculations - vast amount of built-in formulas ready to use'
    ]
    """
    )

Invoke the LLM with structured output:

page_scraper_llm = llm.with_structured_output(ProjectInformation)
extraction = page_scraper_llm.invoke(
    [("system", SYSTEM_PROMPT), ("user", create_scrape_prompt(markdown_content))]
)
pprint(extraction.__dict__, sort_dicts=False, width=120)
Invoke the LLM with structured output

You can extend this process to multiple URLs:

Fetching the Web Page

async def fetch_page(url, user_agent=USER_AGENT) -> str:
    # Launch browser and navigate to the URL
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch()
        context = await browser.new_context(user_agent=user_agent)
        page = await context.new_page()
        await page.goto(url)
        content = await page.content()
        await browser.close()

    # Convert HTML to Markdown
    markdown_converter = html2text.HTML2Text()
    markdown_converter.ignore_links = False
    return markdown_converter.handle(content)

This function launches a headless browser using Playwright, loads the specified URL, extracts its HTML content, and then converts it to Markdown using html2text. It ensures the browser starts and stops properly, and transforms the HTML content into Markdown format, including links.

Processing Multiple URLs

urls = [
    "https://videogen.io/",
    "https://blaze.today/aiblaze/",
    "https://www.insightpipeline.com/",
    "https://apps.apple.com/us/app/today-app-to-do-list-habits/id6461726826",
    "https://brainybear.ai/",
]

Here, a list of URLs is defined to be processed.

Extracting Content

extractions = []
for url in tqdm(urls):
    content = await fetch_page(url)
    extractions.append(
        page_scraper_llm.invoke(
            [("system", SYSTEM_PROMPT), ("user", create_scrape_prompt(content))]
        )
    )

A loop is used to go through each URL, fetch the page content, and then pass it to a language model (page_scraper_llm) for further processing. The results are stored in the extractions list.

Save the Results to a DataFrame

rows = []
for extraction, url in zip(extractions, urls):
    row = extraction.__dict__
    row["url"] = url
    rows.append(row)
projects_df = pd.DataFrame(rows)
projects_df
output: Web Scraping with LLMs
projects_df.iloc[0].benefits
Web Scraping with LLMs
projects_df.to_csv("projects.csv", index=None)

Scraping Car Listings

For more complex data, define additional models like CarListing and CarListings:

url = "https://www.autoscout24.com/lst?atype=C&cy=D%2CA%2CB%2CE%2CF%2CI%2CL%2CNL&desc=0&fregfrom=2018&gear=M&powerfrom=309&powerto=478&powertype=hp&search_id=1tih4oks815&sort=standard&ustate=N%2CU"
auto_content = await fetch_page(url)

print(auto_content)
Scraping Car Listings
class CarListing(BaseModel):
    """Information about a car listing"""


    make: str = Field("Make of the car e.g. Toyota")
    model: str = Field("Model of the car, maximum 3 words e.g. Land Cruiser")
    horsepower: int = Field("Horsepower of the engine e.g. 231")
    price: int = Field("Price in euro e.g. 34000")
    mileage: Optional[int] = Field("Number of kilometers on the odometer e.g. 73400")
    year: Optional[int] = Field("Year of registration (if available) e.g. 2015")
    url: str = Field(
        "Url to the listing e.g. https://www.autoscout24.com/offers/lexus-rc-f-advantage-coupe-gasoline-grey-19484ec1-ee56-4bfd-8769-054f03515792"
    )

class CarListings(BaseModel):
    """List of car listings"""


    cars: List[CarListing] = Field("List of cars for sale.")

car_listing_scraper_llm = llm.with_structured_output(CarListings)

extraction = car_listing_scraper_llm.invoke(
    [("system", SYSTEM_PROMPT), ("user", create_scrape_prompt(auto_content))]
)

extraction.cars
webscraping output
def filter_model(row):
    row = re.sub("[^0-9a-zA-Z]+", " ", row)
    parts = row.split(" ")
    return " ".join(parts[:3])

rows = [listing.__dict__ for listing in extraction.cars]

listings_df = pd.DataFrame(rows)
listings_df["model"] = listings_df.model.apply(filter_model)
listings_df
listings_df: Web Scraping with LLMs
listings_df.to_csv("car-listings.csv", index=None)

This code defines two Pydantic models, CarListing and CarListings, for representing individual car details (e.g., make, model, price, mileage) and a list of such cars. It uses an LLM to extract structured car listing data from a webpage. The extracted data is converted into a pandas DataFrame and saved as a CSV file (car-listings.csv) for further use.

Conclusion

Combining the power of LLMs with traditional scraping tools like Playwright unlocks new possibilities for structured data extraction. Whether extracting project details or car listings, this approach ensures accuracy and scalability, paving the way for advanced data workflows. Happy scraping!

Key Takeaways

  • LLMs like ChatGroq enhance web scraping by providing precise and structured data extraction.
  • Playwright automates browser interactions, allowing for effective content retrieval.
  • Markdown conversion simplifies text processing for better data extraction.
  • Structured data models (e.g., Pydantic) ensure organized and clean output.
  • Combining LLMs with traditional scraping tools improves scalability and accuracy.

Frequently Asked Questions

Q1. What is the role of Playwright in web scraping?

A. Playwright automates browser interactions, enabling you to load and retrieve dynamic content from webpages efficiently.

Q2. How does ChatGroq enhance web scraping?

A. ChatGroq, a large language model, processes webpage content and extracts structured information accurately, improving data extraction precision.

Q3. Why convert HTML content to Markdown?

A. Converting HTML to Markdown simplifies text processing by stripping unnecessary tags, making it easier to extract relevant information.

Q4. What is the benefit of using structured data models like Pydantic?

A. Structured data models ensure the extracted data is organized and validated, making it easier to process and analyze.

Q5. Can this approach scale for large-scale scraping projects?

A. Yes, by combining Playwright for automation and LLMs for data extraction, this method can handle large datasets efficiently, ensuring scalability.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hi! I am a keen Data Science student who loves to explore new things. My passion for data science stems from a deep curiosity about how data can be transformed into actionable insights. I enjoy diving into various datasets, uncovering patterns, and applying machine learning algorithms to solve real-world problems. Each project I undertake is an opportunity to enhance my skills and learn about new tools and techniques in the ever-evolving field of data science.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details