Web scraping has long been a vital technique for extracting information from the internet, enabling developers to gather insights from various domains. With the integration of Large Language Models (LLMs) like ChatGroq, web scraping becomes even more powerful, offering enhanced flexibility and precision. This article explores how to implement scraping with LLMs to fetch structured data from webpages effectively.
Before diving into web scraping, ensure your environment is properly configured. Install the required libraries:
!pip install -Uqqq pip --progress-bar off # Updates the Python package installer to the latest version
!pip install -qqq playwright==1.46.0 --progress-bar off # Playwright for browser automation and web scraping
!pip install -qqq html2text==2024.2.26 --progress-bar off # Converts HTML content to plain text or Markdown format
!pip install -qqq langchain-groq==0.1.9 --progress-bar off # LangChain Groq for leveraging LLMs in data extraction workflows
!playwright install chromium
This code sets up the environment by updating pip, installing tools like playwright for browser automation, html2text for HTML-to-text conversion, langchain-groq for LLM-based data extraction, and downloading Chromium for Playwright’s use.
Below we will import necessary modules one by one.
import re
from pprint import pprint
from typing import List, Optional
import html2text
import nest_asyncio
import pandas as pd
from google.colab import userdata
from langchain_groq import ChatGroq
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from tqdm import tqdm
nest_asyncio.apply()
The first step in scraping involves fetching the web content. Using Playwright, we load the webpage and retrieve its HTML content:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch()
context = await browser.new_context(user_agent=USER_AGENT)
page = await context.new_page()
await page.goto("https://playwright.dev/")
content = await page.content()
await browser.close()
await playwright.stop()
print(content)
This code fetches the HTML content of a webpage using Playwright. It starts a Chromium browser instance, sets a custom user agent for the browsing context, navigates to the specified URL (https://playwright.dev/), and retrieves the page’s HTML content. After fetching, the browser and Playwright are cleanly closed to release resources.
To simplify text processing, convert the HTML content to Markdown format using the html2text library:
markdown_converter = html2text.HTML2Text()
markdown_converter.ignore_links = False
markdown_content = markdown_converter.handle(content)
print(markdown_content)
Next, configure the LLM to process and extract structured information. Use ChatGroq, a versatile LLM designed for structured data extraction:
MODEL = "llama-3.1-70b-versatile"
llm = ChatGroq(temperature=0, model_name=MODEL, api_key=userdata.get("GROQ_API_KEY"))
SYSTEM_PROMPT = """
You're an expert text extractor. You extract information from webpage content.
Always extract data without changing it and any other output.
"""
def create_scrape_prompt(page_content: str) -> str:
return f"""
Extract the information from the following web page:
```
{page_content}
```
""".strip()
This code sets up the ChatGroq LLM for extracting structured data from webpage content. It initializes the model (`llama-3.1-70b-versatile`) with specific parameters and a system prompt instructing it to extract information accurately without altering the content. A function generates prompts for processing webpage text dynamically.
Define the data structure for landing page extraction using Pydantic models:
class ProjectInformation(BaseModel):
"""Information about the project"""
name: str = Field("Name of the project e.g. Excel")
tagline: str = Field(
description="What this project is about e.g. Get deep insights from your numbers",
)
benefits: List[str] = Field(
description="""A list of main benefits of the project including 3-5 words to summarize each one.
e.g. [
'Your spreadshits everywhere you go - cloud-backed files with your account',
'Accuracy without manual calculations - vast amount of built-in formulas ready to use'
]
"""
)
Invoke the LLM with structured output:
page_scraper_llm = llm.with_structured_output(ProjectInformation)
extraction = page_scraper_llm.invoke(
[("system", SYSTEM_PROMPT), ("user", create_scrape_prompt(markdown_content))]
)
pprint(extraction.__dict__, sort_dicts=False, width=120)
You can extend this process to multiple URLs:
async def fetch_page(url, user_agent=USER_AGENT) -> str:
# Launch browser and navigate to the URL
async with async_playwright() as playwright:
browser = await playwright.chromium.launch()
context = await browser.new_context(user_agent=user_agent)
page = await context.new_page()
await page.goto(url)
content = await page.content()
await browser.close()
# Convert HTML to Markdown
markdown_converter = html2text.HTML2Text()
markdown_converter.ignore_links = False
return markdown_converter.handle(content)
This function launches a headless browser using Playwright, loads the specified URL, extracts its HTML content, and then converts it to Markdown using html2text. It ensures the browser starts and stops properly, and transforms the HTML content into Markdown format, including links.
urls = [
"https://videogen.io/",
"https://blaze.today/aiblaze/",
"https://www.insightpipeline.com/",
"https://apps.apple.com/us/app/today-app-to-do-list-habits/id6461726826",
"https://brainybear.ai/",
]
Here, a list of URLs is defined to be processed.
Extracting Content
extractions = []
for url in tqdm(urls):
content = await fetch_page(url)
extractions.append(
page_scraper_llm.invoke(
[("system", SYSTEM_PROMPT), ("user", create_scrape_prompt(content))]
)
)
A loop is used to go through each URL, fetch the page content, and then pass it to a language model (page_scraper_llm) for further processing. The results are stored in the extractions list.
rows = []
for extraction, url in zip(extractions, urls):
row = extraction.__dict__
row["url"] = url
rows.append(row)
projects_df = pd.DataFrame(rows)
projects_df
projects_df.iloc[0].benefits
projects_df.to_csv("projects.csv", index=None)
For more complex data, define additional models like CarListing and CarListings:
url = "https://www.autoscout24.com/lst?atype=C&cy=D%2CA%2CB%2CE%2CF%2CI%2CL%2CNL&desc=0&fregfrom=2018&gear=M&powerfrom=309&powerto=478&powertype=hp&search_id=1tih4oks815&sort=standard&ustate=N%2CU"
auto_content = await fetch_page(url)
print(auto_content)
class CarListing(BaseModel):
"""Information about a car listing"""
make: str = Field("Make of the car e.g. Toyota")
model: str = Field("Model of the car, maximum 3 words e.g. Land Cruiser")
horsepower: int = Field("Horsepower of the engine e.g. 231")
price: int = Field("Price in euro e.g. 34000")
mileage: Optional[int] = Field("Number of kilometers on the odometer e.g. 73400")
year: Optional[int] = Field("Year of registration (if available) e.g. 2015")
url: str = Field(
"Url to the listing e.g. https://www.autoscout24.com/offers/lexus-rc-f-advantage-coupe-gasoline-grey-19484ec1-ee56-4bfd-8769-054f03515792"
)
class CarListings(BaseModel):
"""List of car listings"""
cars: List[CarListing] = Field("List of cars for sale.")
car_listing_scraper_llm = llm.with_structured_output(CarListings)
extraction = car_listing_scraper_llm.invoke(
[("system", SYSTEM_PROMPT), ("user", create_scrape_prompt(auto_content))]
)
extraction.cars
def filter_model(row):
row = re.sub("[^0-9a-zA-Z]+", " ", row)
parts = row.split(" ")
return " ".join(parts[:3])
rows = [listing.__dict__ for listing in extraction.cars]
listings_df = pd.DataFrame(rows)
listings_df["model"] = listings_df.model.apply(filter_model)
listings_df
listings_df.to_csv("car-listings.csv", index=None)
This code defines two Pydantic models, CarListing and CarListings, for representing individual car details (e.g., make, model, price, mileage) and a list of such cars. It uses an LLM to extract structured car listing data from a webpage. The extracted data is converted into a pandas DataFrame and saved as a CSV file (car-listings.csv) for further use.
Combining the power of LLMs with traditional scraping tools like Playwright unlocks new possibilities for structured data extraction. Whether extracting project details or car listings, this approach ensures accuracy and scalability, paving the way for advanced data workflows. Happy scraping!
A. Playwright automates browser interactions, enabling you to load and retrieve dynamic content from webpages efficiently.
A. ChatGroq, a large language model, processes webpage content and extracts structured information accurately, improving data extraction precision.
A. Converting HTML to Markdown simplifies text processing by stripping unnecessary tags, making it easier to extract relevant information.
A. Structured data models ensure the extracted data is organized and validated, making it easier to process and analyze.
A. Yes, by combining Playwright for automation and LLMs for data extraction, this method can handle large datasets efficiently, ensuring scalability.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.