The purpose of this project is to develop a Python program that automates the process of monitoring and tracking changes across multiple websites. We aim to streamline the meticulous task of detecting and documenting modifications in web-based content by utilizing Python. This capability is invaluable for real-time news tracking, immediate product updates, and conducting competitive analyses. As the digital landscape evolves rapidly, identifying website changes is essential for maintaining continuous awareness and comprehension.
Learning Objectives
Our learning objectives for this project will cover the following components:
This article was published as a part of the Data Science Blogathon.
We aim to devise a Python application in this project to oversee and catalog alterations on select websites. This application will incorporate the following:
The main aim of this project is to streamline the process of keeping tabs on specific websites. By crafting a Python application, we plan to track and catalog changes on a website of interest. This tool will offer timely updates about recent modifications in news articles, product listings, and other web-based content. Automating this tracking process will be time-saving and ensure immediate awareness about any modifications or additions made to the website.
To implement this project successfully, we will follow a high-level approach that involves the following
steps:
Imagine a company that gathers information about kids’ activities from numerous websites and consolidates them onto their own website. However, manually tracking changes on each website and updating their own platform accordingly poses significant challenges. This is where our specialized tool comes in to save the day, providing an efficient solution for overcoming these obstacles.
Examples of Monitored Websites:
We monitor various websites to curate information on kids’ activities. Here are a few examples:
This organization offers engaging programs such as lessons, camps, and parties to introduce children aged 2 to 7 to the world of tennis. Their focus is teaching tennis fundamentals, promoting fitness and coordination, and fostering good sportsmanship.
This performing arts school in Jersey City provides high-quality dance, voice, and acting classes. They cater to students of all skill levels, nurturing their creativity and confidence in a supportive and inspiring environment.
Renowned in Jersey City, this institution offers dance education for all ages and skill levels. With a diverse range of dance genres and the organization of performances and community outreach programs, they contribute to the local arts scene and foster an appreciation for dance.
We have developed a Python-based solution that utilizes web scraping techniques to automate the process. Our tool periodically monitors the selected websites to detect changes in the information about kids’ activities. Once a change is identified, the tool seamlessly updates the company’s website, consistently reflecting the most up-to-date information.
In addition to updating the website, our tool maintains a detailed log of changes, providing valuable data for analysis and reference purposes. It can also be configured to send real-time notifications, keeping the company’s team informed of any detected changes. Utilizing our tool allows the company to streamline its operations, ensuring its website always showcases the latest information from multiple sources.
It is important to note that web scraping activities may have legal and ethical implications. Before engaging in any scraping activities, it is essential to verify whether the target websites permit scraping or obtain necessary permissions from the website owners. Adhering to the websites’ terms of service and respecting their policies is crucial. Additionally, it is important to be mindful of the frequency of requests and avoid practices that may disrupt the website’s operations. Always approach web scraping carefully and follow best practices to ensure a positive and compliant experience.
A sophisticated approach is employed to extract the page links from the target website’s home page. Here’s a code snippet where we will use Python’s BeautifulSoup library to extract the page links from the home page of Super Duper Tennis:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
# URL of the home page
url = 'https://www.superdupertennis.com/'
# Retrieve the HTML content via a GET request
response = requests.get(url)
html_content = response.text
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract the page links from anchor tags (<a>)
links = soup.find_all('a')
# Create a list to store the data
data = []
for link in links:
page_name = link.text.strip()
# Remove leading/trailing whitespace
# and format the page name properly
web_link = link.get('href')
# Use get() method to retrieve
#the 'href' attribute
if web_link: # Check if 'href'
#attribute exists before adding to table
complete_link = urljoin(url, web_link)
# Construct the complete web link using urljoin
data.append({
'Service Provider': 'Super Duper Tennis',
# Update with the actual service provider name
'Page Name': page_name,
'Complete Web Link': complete_link
})
# Create a pandas DataFrame from the data
df = pd.DataFrame(data)
To establish a baseline for future comparisons, perform an initial data capture by scraping the desired
content from the website and storing it in a data structure such as a database or file. Here is the continuation of the previous code:
for a,b,c in zip(df['Service Provider'].to_list(),
df['Page Name'].to_list(),df['Web link'].to_list()):
url = c
headers = {'User-Agent': 'Mozilla/5.0
(Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/39.0.2171.95 Safari/537.36'}
time.sleep(60)
# page download
response = requests.get(url, headers=headers)
# parsed downloaded homepage
soup = BeautifulSoup(response.text, "lxml")
# discard scripts and styles
for script in soup(["script", "style"]):
script.extract()
soup = soup.get_text()
current_ver = soup.splitlines()
with open(r"PATH\{}\{}\{}_{}_{}.txt".
format(a,b,a,b,date.today().strftime
('%Y-%m-%d')),"a",encoding="utf-8") as file:
file.write("\n".join(current_ver))
file.close()
Using the pandas DataFrame, we iterate over the rows to access the service provider, page name, and web link. These variables are assigned to a, b, and c, respectively.
We set the url variable within the iteration to the web link (c). Additionally, we define the headers variable for the requests library.
To introduce a reasonable delay between requests, we use the time.sleep() function to pause for 60 seconds.
Next, we send a GET request to download the page’s content using the specified URL and headers. The response is stored in the response variable.
Using BeautifulSoup with the “lxml” parser, we parse the downloaded homepage and extract the text
content. Scripts and styles are removed from the parsed content.
The extracted text is split into lines and assigned to the current_ver variable.
Finally, we open a file in write mode and write the current_ver text content. We can construct the file name based on the service provider, page name, and current date. This captured data will serve as the baseline for future comparisons in the website monitoring project.
During subsequent executions, we retrieve the updated webpage content and contrast it with our stored baseline data to identify any deviations or alterations. Here’s a continuation of the previous script:
change_logs = pd.DataFrame()
for provider, page, link in zip(df['Service Provider'].
to_list(), df['Page Name'].to_list(), df['Web link'].to_list()):
files = glob.glob(r"PATH{}{}*.txt".format(provider, page))
files_sorted = sorted(files, key=os.path.getctime, reverse=True)
current_content = open(files_sorted[0], 'r', encoding="utf-8").readlines()
prior_content = open(files_sorted[1], 'r', encoding="utf-8").readlines()
comparison = difflib.context_diff(current_content,
prior_content, n=3, lineterm='\n')
compared_text = "\n".join([line.rstrip() for line
in'\n'.join(comparison).splitlines() if line.strip()])
if compared_text == '':
change_description = 'No alterations detected on '
+ date.today().strftime('%Y-%m-%d') + ' compared to '
+ files_sorted[1].split('_')[2].split('.')[0]
else:
if "We couldn't find the page you were looking for"
in compared_text:
change_description = 'URL modified on ' +
date.today().strftime('%Y-%m-%d') + ' compared to ' +
files_sorted[1].split('_')[2].split('.')[0]
else:
change_description = 'Alterations detected on ' +
date.today().strftime('%Y-%m-%d') + ' compared to ' +
files_sorted[1].split('_')[2].split('.')[0]
temp_log = pd.DataFrame({'Service Provider': pd.Series(provider),
'Section': pd.Series(page), 'Changes': pd.Series
(change_description), 'Link': pd.Series(link)})
change_logs = change_logs.append(temp_log)
comparison = difflib.context_diff(current_content,
prior_content, n=3, lineterm='\n')
compared_text = "\n".join([line.rstrip() for line
in'\n'.join(comparison).splitlines() if line.strip()])
if compared_text == '':
change_description = 'No alterations detected
on ' + date.today().strftime('%Y-%m-%d') +
' compared to ' + files_sorted[1].split('_')[2].split('.')[0]
else:
if "We couldn't find the page you were looking for"
in compared_text:
change_description = 'URL modified on ' +
date.today().strftime('%Y-%m-%d') + ' compared to '
+ files_sorted[1].split('_')[2].split('.')[0]
else:
change_description = 'Alterations detected on '
+ date.today().strftime('%Y-%m-%d') + ' compared to '
+ files_sorted[1].split('_')[2].split('.')[0]
temp_log = pd.DataFrame({'Service Provider':
pd.Series(provider), 'Section': pd.Series(page),
'Changes': pd.Series(change_description), 'Link': pd.Series(link)})
change_logs = change_logs.append(temp_log)
We create an empty DataFrame called change_logs to store the details of any identified changes. Using the pandas DataFrame, we iterate over the rows to fetch the service provider, page name, and webpage link. Denote these as provider, page, and link.
Inside the loop, we gather a collection of files matching the pattern of the previously saved files. This collection is sorted according to the file creation time, with the most recent file coming first.
We then read the content of the current and prior files for comparison. The difflib.context_diff() function performs the comparison, storing the result in the comparison variable.
Depending on the content of compared_text, we can determine if there are any changes or if specific messages indicate the page is missing or the URL has changed.
Subsequently, we construct the change_description variable, noting the date and the reference date of the prior file for comparison. Using the retrieved data, we generate a temporary DataFrame, temp_log, which includes the service provider, page name, change description, and webpage link.
Finally, we add temp_log to the change_logs DataFrame, which gathers the details of all detected changes.
Employ a notification mechanism to alert the user upon detecting changes. You can use Python libraries or external APIs for notification delivery. First, import the necessary libraries for sending notifications, and depending on the method you choose, you may need to install additional libraries or APIs.
For email notifications, we will utilize the smtplib library to send emails via an SMTP server. Ensure you provide your email credentials and SMTP server details.
Below is a code snippet showcasing an email notification:
import smtplib
def send_email_notification(subject, message, recipient):
sender = '[email protected]'
password = 'your-email-password'
smtp_server = 'smtp.example.com'
smtp_port = 587
email_body = f'Subject: {subject}\n\n{message}'
with smtplib.SMTP(smtp_server, smtp_port) as server:
server.starttls()
server.login(sender, password)
server.sendmail(sender, recipient, email_body)
# Usage:
subject = 'Website Change Notification'
message = 'Changes have been detected on the website. Please review.'
recipient = '[email protected]'
send_email_notification(subject, message, recipient)
For SMS notifications, you can integrate external APIs like Twilio or Nexmo. These APIs enable programmatic SMS message sending. Register for an account, acquire the necessary API credentials, and install the respective Python libraries.
Below is an example code snippet demonstrating SMS notification using the Twilio API:
from twilio.rest import Client
def send_sms_notification(message, recipient):
account_sid = 'your-account-sid'
auth_token = 'your-auth-token'
twilio_number = 'your-twilio-phone-number'
client = Client(account_sid, auth_token)
message = client.messages.create(
body=message,
from_=twilio_number,
to=recipient
)
# Usage:
message = 'Changes have been detected on the website. Please review.'
recipient = '+1234567890'
send_sms_notification(message, recipient)
As the organization would run this script frequently to track changes, it’s essential to maintain a record of the output of each run. Logging each execution, including the time, duration, and changes detected, can facilitate this process. We can use this data to generate summary reports that show trends over time and aid in understanding the frequency and nature of website updates.
We initiate the process by importing the suitable libraries, which include the logging library in Python. Simultaneously, we must set up the logging level and the log file format.
import logging
# Configure logging settings
logging.basicConfig(filename='website_monitoring.log',
level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Log an informational message
logging.info('Website monitoring started.')
# Log the detected changes
logging.info('Changes detected on {date}: {details}'
.format(date='2023-06-15', details='Updated content on homepage.'))
# Log an error message
logging.error('Error occurred while retrieving website content.')
To generate a report from the logging data, we can employ libraries like matplotlib or seaborn to create visualizations summarizing the changes over time. The choice of reports and visualizations will depend on the tracked changes.
Here’s an example code snippet for generating a simple line plot to illustrate the change frequency over time:
import matplotlib.pyplot as plt
import pandas as pd
# Read the log file into a pandas DataFrame
log_data = pd.read_csv('website_monitoring.log',
delimiter=' - ', header=None, names=['Timestamp', 'Level', 'Message'])
# Convert the Timestamp column to datetime format
log_data['Timestamp'] = pd.to_datetime(log_data
['Timestamp'], format='%Y-%m-%d %H:%M:%S')
# Group the data by date and count the number of changes per day
changes_per_day = log_data[log_data['Level'] == 'INFO']
.groupby(log_data['Timestamp'].dt.date).size()
# Plot the changes over time
plt.plot(changes_per_day.index, changes_per_day.values)
plt.xlabel('Date')
plt.ylabel('Number of Changes')
plt.title('Website Content Changes Over Time')
plt.xticks(rotation=45)
plt.show()
Several challenges may arise during the project implementation, requiring careful consideration. These
limitations include website structure changes, legal or ethical constraints, and errors in the web scraping or data comparison processes.
Website Structure Changes: Dynamic websites often undergo modifications, impacting the web scraping process. Adapting the scraping code to accommodate these changes becomes necessary. Regularly monitoring and updating the scraping code can ensure compatibility with evolving website structures.
Legal and Ethical Constraints: Following legal and ethical guidelines is crucial for web scraping. Websites may have terms of service that prohibit scraping or impose restrictions on data collection. Respecting these terms and using the scraped data responsibly is essential to ensure compliance.
Errors in Web Scraping and Data Comparison: Web scraping involves interaction with external websites, introducing the possibility of errors. Connection failures, timeouts, or server issues may occur during scraping. Employing robust error-handling mechanisms to gracefully handle such situations is vital. Additionally, ensuring the accuracy of data comparison processes and accounting for potential
errors like false positives or negatives are crucial for reliable results.
Permissions and Website Policies: Verifying whether the target website permits scraping or obtains necessary permissions from the website owner is essential before initiating web scraping. Complying with the website’s robots.txt file, respecting their terms of service, and being mindful of the frequency of requests are important considerations to avoid policy violations.
In conclusion, this project has successfully created a powerful Python tool for tracking website updates through web scraping. We have successfully developed a tool with essential features such as web scraping, data comparison, notifications, logging, and reporting.
Throughout this project, we have deepened our understanding of HTML, honed our skills in text processing, and mastered the art of data manipulation. Leveraging the capabilities of BeautifulSoup and requests, we have become proficient in web scraping and automated tasks using Python. Furthermore, we have developed a robust error-handling mechanism and acquired data analysis and reporting expertise.
Our tool is a reliable solution for tracking changes in news articles, product listings, and other web content. Automating the process eliminates the need for manual updates, ensuring that the information remains current and accurate.
Throughout this journey, we have gained valuable knowledge and skills, including:
A. Web scraping is a process that automatically pulls out data from websites. It works by reviewing the website’s HTML code, looking at its pages, and collecting necessary information. Tools like BeautifulSoup and requests in Python make this process easier.
A. Web scraping helps keep tabs on changes on websites. It regularly pulls data from sites and compares it with past data to spot any updates or changes. This is particularly handy for tracking updates on news, product details, or even competitor data.
A. It’s crucial to abide by the law when scraping websites. Websites have terms of service that explain their rules about scraping. Always make sure you understand and follow these rules. Also, use the data you collect responsibly and avoid gathering sensitive or personal information.
A. Sometimes, problems like changes in a website’s design or internet issues can lead to errors in web scraping. Good error-handling techniques can help address these problems. This includes using try-except blocks to handle errors, retry mechanisms for connection issues, and log files to track recurring issues.
A. To scrape and monitor websites successfully, respect the website’s rules, use headers and user-agent strings that copy real browsing behavior, and avoid too many requests quickly. You should also update your scraping code regularly to deal with changes in the website.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.