Empowering Real-Time Insights with Website Monitoring Using Python

sharkbite Last Updated : 03 Jul, 2023
12 min read

Introduction

The purpose of this project is to develop a Python program that automates the process of monitoring and tracking changes across multiple websites. We aim to streamline the meticulous task of detecting and documenting modifications in web-based content by utilizing Python. This capability is invaluable for real-time news tracking, immediate product updates, and conducting competitive analyses. As the digital landscape evolves rapidly, identifying website changes is essential for maintaining continuous awareness and comprehension.

Website Monitoring: Using Python for Efficient Data Extraction

Learning Objectives

Our learning objectives for this project will cover the following components:

  1. Enhance knowledge of web scraping methods using Python libraries like BeautifulSoup and Scrapy. We aim to extract valuable data from websites proficiently, navigate HTML structures, identify specific elements, and handle diverse content types.
  2. Improve skills in identifying subtle changes in the website content. We aspire to learn techniques for comparing newly scraped data with existing references to detect insertions, removals, or modifications. We also aim to handle the various data formats and structures encountered during these comparisons.
  3. Leverage Python’s automation capabilities to track website updates. We plan to employ scheduling mechanisms such as cron jobs or Python’s scheduling libraries to enhance data gathering and eliminate repetitive tasks.
  4. Develop a comprehensive understanding of HTML’s architecture. We aim to navigate HTML documents proficiently, identify crucial elements during data extraction, and effectively manage changes in website layouts and structures.
  5. Enhance text processing skills by exploring data manipulation techniques. We will learn to clean and refine extracted data, address data encoding complexities, and manipulate data for insightful analysis and versatile reporting.

This article was published as a part of the Data Science Blogathon.

Project Description

We aim to devise a Python application in this project to oversee and catalog alterations on select websites. This application will incorporate the following:

  1. Website Checks: Consistent evaluations of the assigned website to spot updates in particular content or sections.
  2. Data Retrieval: Using web scraping methods to draw out required details from the website, such as text, graphics, or pertinent data.
  3. Change Identification: Contrasting freshly scraped data with earlier stored data to spot differences or amendments.
  4. Notification Mechanism: Implementing an alert mechanism to keep the user informed when changes are noticed.
  5. Logging: Keeping a detailed record of modifications over time, with timestamps and information about the changes. This application can be tailored to monitor any given website and particular content based on user preferences. The anticipated results include immediate alerts about website alterations and comprehensive change records for understanding the nature and timing of changes.

Problem Statement

The main aim of this project is to streamline the process of keeping tabs on specific websites. By crafting a Python application, we plan to track and catalog changes on a website of interest. This tool will offer timely updates about recent modifications in news articles, product listings, and other web-based content. Automating this tracking process will be time-saving and ensure immediate awareness about any modifications or additions made to the website.

Website monitoring | web scrapping method | python | data extraction

Approach

To implement this project successfully, we will follow a high-level approach that involves the following
steps:

  1. Our project will use Python’s powerful libraries like BeautifulSoup or Scrapy. These libraries make it easy to gather information from websites and sift through HTML content.
  2. We’ll pull information from the website to create a baseline at the outset. This benchmark data will help us identify any changes later on.
  3. We can match incoming data with a set benchmark to track any new additions or changes. Our techniques may involve comparing text or analyzing differences in HTML structures.
  4. We’ll keep track of our project’s runs through log files. These logs will have useful details like run times, websites tracked, and changes found. They will help us keep track of updates and find patterns.
  5. As part of the system, a notification feature will be integrated. If a change is detected, alerts will be sent via email, SMS, or other methods, keeping users updated in real-time.

Scenario

Imagine a company that gathers information about kids’ activities from numerous websites and consolidates them onto their own website. However, manually tracking changes on each website and updating their own platform accordingly poses significant challenges. This is where our specialized tool comes in to save the day, providing an efficient solution for overcoming these obstacles.

Examples of Monitored Websites:

We monitor various websites to curate information on kids’ activities. Here are a few examples:

Super Duper Tennis

This organization offers engaging programs such as lessons, camps, and parties to introduce children aged 2 to 7 to the world of tennis. Their focus is teaching tennis fundamentals, promoting fitness and coordination, and fostering good sportsmanship.

Next Step Broadway

This performing arts school in Jersey City provides high-quality dance, voice, and acting classes. They cater to students of all skill levels, nurturing their creativity and confidence in a supportive and inspiring environment.

The School of Nimbus

Renowned in Jersey City, this institution offers dance education for all ages and skill levels. With a diverse range of dance genres and the organization of performances and community outreach programs, they contribute to the local arts scene and foster an appreciation for dance.

We have developed a Python-based solution that utilizes web scraping techniques to automate the process. Our tool periodically monitors the selected websites to detect changes in the information about kids’ activities. Once a change is identified, the tool seamlessly updates the company’s website, consistently reflecting the most up-to-date information.

In addition to updating the website, our tool maintains a detailed log of changes, providing valuable data for analysis and reference purposes. It can also be configured to send real-time notifications, keeping the company’s team informed of any detected changes. Utilizing our tool allows the company to streamline its operations, ensuring its website always showcases the latest information from multiple sources.

Warning on Web scraping

It is important to note that web scraping activities may have legal and ethical implications. Before engaging in any scraping activities, it is essential to verify whether the target websites permit scraping or obtain necessary permissions from the website owners. Adhering to the websites’ terms of service and respecting their policies is crucial. Additionally, it is important to be mindful of the frequency of requests and avoid practices that may disrupt the website’s operations. Always approach web scraping carefully and follow best practices to ensure a positive and compliant experience.

Website monitoring | web scrapping method | python | data extraction

A sophisticated approach is employed to extract the page links from the target website’s home page. Here’s a code snippet where we will use Python’s BeautifulSoup library to extract the page links from the home page of Super Duper Tennis:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd

# URL of the home page
url = 'https://www.superdupertennis.com/'

# Retrieve the HTML content via a GET request
response = requests.get(url)
html_content = response.text

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Extract the page links from anchor tags (<a>)
links = soup.find_all('a')

# Create a list to store the data
data = []
for link in links:
    page_name = link.text.strip()  
    # Remove leading/trailing whitespace 
    # and format the page name properly
    web_link = link.get('href')  
    # Use get() method to retrieve 
    #the 'href' attribute
    if web_link:  # Check if 'href' 
    #attribute exists before adding to table
        complete_link = urljoin(url, web_link)  
        # Construct the complete web link using urljoin
        data.append({
            'Service Provider': 'Super Duper Tennis',  
            # Update with the actual service provider name
            'Page Name': page_name,
            'Complete Web Link': complete_link
        })

# Create a pandas DataFrame from the data
df = pd.DataFrame(data)
  • We bring essential tools like the requests, BeautifulSoup, and pandas libraries. We also pick out the website we want to explore, for example, ‘https://www.superdupertennis.com/’.
  • A GET request is sent to this URL through the requests library. The resulting HTML content from the homepage is saved in ‘html_content’.
  • BeautifulSoup reviews the HTML content and identifies ‘<a>’ tags. These tags usually have the links we’re interested in.
  • Each ‘<a>’ tag is processed to pull out the page name and ‘href’ value, the actual link. We also tidied up the extracted data by removing extra spaces.
  • With the help of the urljoin() function from urllib.parse, we stick the base URL and each relative link together to form full URLs.
  • All the cleaned and prepared data is then put into a ‘data’ list. This list contains dictionaries containing the service provider name, clean page name, and full URL.
  • Finally, we change the ‘data’ list using pandas into a DataFrame. This DataFrame is divided into three columns: ‘Service Provider’, ‘Page Name’, and ‘Complete Web Link’.

Initial Data Capture

To establish a baseline for future comparisons, perform an initial data capture by scraping the desired
content from the website and storing it in a data structure such as a database or file. Here is the continuation of the previous code:

for a,b,c in zip(df['Service Provider'].to_list(), 
df['Page Name'].to_list(),df['Web link'].to_list()):
    url = c
    headers = {'User-Agent': 'Mozilla/5.0 
    (Macintosh; Intel Mac OS X 10_10_1) 
    AppleWebKit/537.36 (KHTML, like Gecko) 
    Chrome/39.0.2171.95 Safari/537.36'}

    time.sleep(60)
    
    # page download
    response = requests.get(url, headers=headers)
    # parsed downloaded homepage
    soup = BeautifulSoup(response.text, "lxml")
    
    # discard scripts and styles
    for script in soup(["script", "style"]):
        script.extract() 
    soup = soup.get_text()
    
    current_ver = soup.splitlines()
    with open(r"PATH\{}\{}\{}_{}_{}.txt".
    format(a,b,a,b,date.today().strftime
    ('%Y-%m-%d')),"a",encoding="utf-8") as file:
        file.write("\n".join(current_ver))
    
    file.close()

Using the pandas DataFrame, we iterate over the rows to access the service provider, page name, and web link. These variables are assigned to a, b, and c, respectively.

We set the url variable within the iteration to the web link (c). Additionally, we define the headers variable for the requests library.

To introduce a reasonable delay between requests, we use the time.sleep() function to pause for 60 seconds.

Next, we send a GET request to download the page’s content using the specified URL and headers. The response is stored in the response variable.

Using BeautifulSoup with the “lxml” parser, we parse the downloaded homepage and extract the text
content. Scripts and styles are removed from the parsed content.

The extracted text is split into lines and assigned to the current_ver variable.

Finally, we open a file in write mode and write the current_ver text content. We can construct the file name based on the service provider, page name, and current date. This captured data will serve as the baseline for future comparisons in the website monitoring project.

Comparison and Change Detection

During subsequent executions, we retrieve the updated webpage content and contrast it with our stored baseline data to identify any deviations or alterations. Here’s a continuation of the previous script:

change_logs = pd.DataFrame()

for provider, page, link in zip(df['Service Provider'].
to_list(), df['Page Name'].to_list(), df['Web link'].to_list()):
files = glob.glob(r"PATH{}{}*.txt".format(provider, page))
files_sorted = sorted(files, key=os.path.getctime, reverse=True)
current_content = open(files_sorted[0], 'r', encoding="utf-8").readlines()
prior_content = open(files_sorted[1], 'r', encoding="utf-8").readlines()

comparison = difflib.context_diff(current_content, 
prior_content, n=3, lineterm='\n')

compared_text = "\n".join([line.rstrip() for line 
in'\n'.join(comparison).splitlines() if line.strip()])
if compared_text == '':
    change_description = 'No alterations detected on ' 
    + date.today().strftime('%Y-%m-%d') + ' compared to ' 
    + files_sorted[1].split('_')[2].split('.')[0]
else:
    if "We couldn't find the page you were looking for" 
    in compared_text:
        change_description = 'URL modified on ' + 
        date.today().strftime('%Y-%m-%d') + ' compared to ' + 
        files_sorted[1].split('_')[2].split('.')[0]
    else:
        change_description = 'Alterations detected on ' +
         date.today().strftime('%Y-%m-%d') + ' compared to ' + 
         files_sorted[1].split('_')[2].split('.')[0]

temp_log = pd.DataFrame({'Service Provider': pd.Series(provider),
  'Section': pd.Series(page), 'Changes': pd.Series
  (change_description), 'Link': pd.Series(link)})
change_logs = change_logs.append(temp_log)

comparison = difflib.context_diff(current_content, 
prior_content, n=3, lineterm='\n')

compared_text = "\n".join([line.rstrip() for line 
in'\n'.join(comparison).splitlines() if line.strip()])
if compared_text == '':
    change_description = 'No alterations detected 
    on ' + date.today().strftime('%Y-%m-%d') + 
    ' compared to ' + files_sorted[1].split('_')[2].split('.')[0]
else:
    if "We couldn't find the page you were looking for"
     in compared_text:
        change_description = 'URL modified on ' + 
        date.today().strftime('%Y-%m-%d') + ' compared to '
         + files_sorted[1].split('_')[2].split('.')[0]
    else:
        change_description = 'Alterations detected on '
         + date.today().strftime('%Y-%m-%d') + ' compared to '
          + files_sorted[1].split('_')[2].split('.')[0]

temp_log = pd.DataFrame({'Service Provider': 
pd.Series(provider), 'Section': pd.Series(page), 
'Changes': pd.Series(change_description), 'Link': pd.Series(link)})
change_logs = change_logs.append(temp_log)

We create an empty DataFrame called change_logs to store the details of any identified changes. Using the pandas DataFrame, we iterate over the rows to fetch the service provider, page name, and webpage link. Denote these as provider, page, and link.

Inside the loop, we gather a collection of files matching the pattern of the previously saved files. This collection is sorted according to the file creation time, with the most recent file coming first.

We then read the content of the current and prior files for comparison. The difflib.context_diff() function performs the comparison, storing the result in the comparison variable.

Depending on the content of compared_text, we can determine if there are any changes or if specific messages indicate the page is missing or the URL has changed.

Subsequently, we construct the change_description variable, noting the date and the reference date of the prior file for comparison. Using the retrieved data, we generate a temporary DataFrame, temp_log, which includes the service provider, page name, change description, and webpage link.

Finally, we add temp_log to the change_logs DataFrame, which gathers the details of all detected changes.

Notification Mechanism

Employ a notification mechanism to alert the user upon detecting changes. You can use Python libraries or external APIs for notification delivery. First, import the necessary libraries for sending notifications, and depending on the method you choose, you may need to install additional libraries or APIs.

For email notifications, we will utilize the smtplib library to send emails via an SMTP server. Ensure you provide your email credentials and SMTP server details.

Below is a code snippet showcasing an email notification:

import smtplib

def send_email_notification(subject, message, recipient):
    sender = '[email protected]'
    password = 'your-email-password'
    smtp_server = 'smtp.example.com'
    smtp_port = 587

    email_body = f'Subject: {subject}\n\n{message}'
    with smtplib.SMTP(smtp_server, smtp_port) as server:
        server.starttls()
        server.login(sender, password)
        server.sendmail(sender, recipient, email_body)

# Usage:
subject = 'Website Change Notification'
message = 'Changes have been detected on the website. Please review.'
recipient = '[email protected]'
send_email_notification(subject, message, recipient)

For SMS notifications, you can integrate external APIs like Twilio or Nexmo. These APIs enable programmatic SMS message sending. Register for an account, acquire the necessary API credentials, and install the respective Python libraries.

Below is an example code snippet demonstrating SMS notification using the Twilio API:

from twilio.rest import Client

def send_sms_notification(message, recipient):
    account_sid = 'your-account-sid'
    auth_token = 'your-auth-token'
    twilio_number = 'your-twilio-phone-number'

    client = Client(account_sid, auth_token)
    message = client.messages.create(
        body=message,
        from_=twilio_number,
        to=recipient
    )

# Usage:
message = 'Changes have been detected on the website. Please review.'
recipient = '+1234567890'
send_sms_notification(message, recipient)

Logging and Reporting

As the organization would run this script frequently to track changes, it’s essential to maintain a record of the output of each run. Logging each execution, including the time, duration, and changes detected, can facilitate this process. We can use this data to generate summary reports that show trends over time and aid in understanding the frequency and nature of website updates.

We initiate the process by importing the suitable libraries, which include the logging library in Python. Simultaneously, we must set up the logging level and the log file format.

import logging

# Configure logging settings
logging.basicConfig(filename='website_monitoring.log', 
level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Log an informational message
logging.info('Website monitoring started.')

# Log the detected changes
logging.info('Changes detected on {date}: {details}'
.format(date='2023-06-15', details='Updated content on homepage.'))

# Log an error message
logging.error('Error occurred while retrieving website content.')

To generate a report from the logging data, we can employ libraries like matplotlib or seaborn to create visualizations summarizing the changes over time. The choice of reports and visualizations will depend on the tracked changes.

Here’s an example code snippet for generating a simple line plot to illustrate the change frequency over time:

import matplotlib.pyplot as plt
import pandas as pd

# Read the log file into a pandas DataFrame
log_data = pd.read_csv('website_monitoring.log', 
delimiter=' - ', header=None, names=['Timestamp', 'Level', 'Message'])

# Convert the Timestamp column to datetime format
log_data['Timestamp'] = pd.to_datetime(log_data
['Timestamp'], format='%Y-%m-%d %H:%M:%S')

# Group the data by date and count the number of changes per day
changes_per_day = log_data[log_data['Level'] == 'INFO']
.groupby(log_data['Timestamp'].dt.date).size()

# Plot the changes over time
plt.plot(changes_per_day.index, changes_per_day.values)
plt.xlabel('Date')
plt.ylabel('Number of Changes')
plt.title('Website Content Changes Over Time')
plt.xticks(rotation=45)
plt.show()

Limitations

Several challenges may arise during the project implementation, requiring careful consideration. These
limitations include website structure changes, legal or ethical constraints, and errors in the web scraping or data comparison processes.

Website Structure Changes: Dynamic websites often undergo modifications, impacting the web scraping process. Adapting the scraping code to accommodate these changes becomes necessary. Regularly monitoring and updating the scraping code can ensure compatibility with evolving website structures.

Legal and Ethical Constraints: Following legal and ethical guidelines is crucial for web scraping. Websites may have terms of service that prohibit scraping or impose restrictions on data collection. Respecting these terms and using the scraped data responsibly is essential to ensure compliance.

Errors in Web Scraping and Data Comparison: Web scraping involves interaction with external websites, introducing the possibility of errors. Connection failures, timeouts, or server issues may occur during scraping. Employing robust error-handling mechanisms to gracefully handle such situations is vital. Additionally, ensuring the accuracy of data comparison processes and accounting for potential
errors like false positives or negatives are crucial for reliable results.

Permissions and Website Policies: Verifying whether the target website permits scraping or obtains necessary permissions from the website owner is essential before initiating web scraping. Complying with the website’s robots.txt file, respecting their terms of service, and being mindful of the frequency of requests are important considerations to avoid policy violations.

Conclusion

In conclusion, this project has successfully created a powerful Python tool for tracking website updates through web scraping. We have successfully developed a tool with essential features such as web scraping, data comparison, notifications, logging, and reporting.

Throughout this project, we have deepened our understanding of HTML, honed our skills in text processing, and mastered the art of data manipulation. Leveraging the capabilities of BeautifulSoup and requests, we have become proficient in web scraping and automated tasks using Python. Furthermore, we have developed a robust error-handling mechanism and acquired data analysis and reporting expertise.

Our tool is a reliable solution for tracking changes in news articles, product listings, and other web content. Automating the process eliminates the need for manual updates, ensuring that the information remains current and accurate.

Throughout this journey, we have gained valuable knowledge and skills, including:

  1. Web scraping techniques using BeautifulSoup and requests.
  2. Effective extraction of valuable information from HTML structures.
  3. Automation of tasks to streamline processes.
  4. Robust error handling in web scraping procedures.
  5. Advanced data analysis and comparison to identify changes.
  6. Creation of comprehensive notifications, logs, and reports for efficient tracking and insightful analysis.

Frequently Asked Questions

Q1: What is web scraping, and how does it work?

A. Web scraping is a process that automatically pulls out data from websites. It works by reviewing the website’s HTML code, looking at its pages, and collecting necessary information. Tools like BeautifulSoup and requests in Python make this process easier.

Q2: How does web scraping aid in monitoring websites?

A. Web scraping helps keep tabs on changes on websites. It regularly pulls data from sites and compares it with past data to spot any updates or changes. This is particularly handy for tracking updates on news, product details, or even competitor data.

Q3. What legal issues should one keep in mind while web scraping?

A. It’s crucial to abide by the law when scraping websites. Websites have terms of service that explain their rules about scraping. Always make sure you understand and follow these rules. Also, use the data you collect responsibly and avoid gathering sensitive or personal information.

Q4. How can we manage errors that come up during web scraping?

A. Sometimes, problems like changes in a website’s design or internet issues can lead to errors in web scraping. Good error-handling techniques can help address these problems. This includes using try-except blocks to handle errors, retry mechanisms for connection issues, and log files to track recurring issues.

Q5. What are the recommended practices for using Python for web scraping and monitoring?

A. To scrape and monitor websites successfully, respect the website’s rules, use headers and user-agent strings that copy real browsing behavior, and avoid too many requests quickly. You should also update your scraping code regularly to deal with changes in the website.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I am a data scientist with a passion for innovative problem-solving. With years of experience in the industry, I work with one of the world's leading cloud-based software companies, where I apply my expertise in data analysis, modelling, and visualization to drive meaningful insights and help businesses make data-driven decisions. Through my work, I have gained a deep understanding of the latest technologies and tools in the field of data science and machine learning, and I am always eager to learn more and stay up-to-date with the latest trends and developments. In my free time, I enjoy exploring new coding techniques, experimenting with new data sets, and sharing my knowledge and experience with other data scientists and enthusiasts.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details