In today’s digital world, we have an ocean of information waiting to be explored online. From tracking the latest trends to understanding what makes a website tick, digging into this data can reveal all sorts of valuable insights. And that’s where web scraping comes in—a nifty technique that lets us gather data from websites automatically. Rather than picking an unknown website, I have decided to work on analysis of Analytics Vidhya’s blogathon page, as we are all familiar with it. Since the current leaderboard does not have much data to deal with, I am using an old leaderboard page with more data points.
Web scraping involves extracting data from websites and converting unstructured information into structured datasets for analysis and visualization. Python offers several libraries, such as BeautifulSoup and Scrapy, which facilitate this process.
The target webpage (AV Blogathon Leaderboard) contains a leaderboard displaying user names and their corresponding views. The idea is to inspect the HTML structure of the webpage, identify the relevant elements, and extract the desired data using BeautifulSoup’s intuitive syntax.
To achieve it, in this article, we will be leveraging Python’s Tkinter library to build a Graphical User Interface (GUI), Selenium to scrape data, and Plotly to visualize leaderboard results.
This article was published as a part of the Data Science Blogathon.
Let’s start but don’t worry if you’re not a coding whiz just yet! We’ll break down the process step by step.
import re
import requests
import pandas as pd
import tkinter as tk
from PIL import Image
from bs4 import BeautifulSoup
from selenium import webdriver
import plotly.graph_objects as go
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
In the above code block, we import the libraries and modules required for various tasks within the script.
def scrape_leaderboard_requests():
# URL of the Analytics Vidhya leaderboard
url = "https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/#LeaderBoard"
# Headers for the HTTP request
headers = {
'authority': 'datahack.analyticsvidhya.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
}
# Sending an HTTP GET request to fetch the webpage
response = requests.get(url, headers=headers, verify=False, timeout=80)
# Checking if the request was successful (status code 200)
if response.status_code == 200:
# Parsing the HTML content of the webpage using BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
# Finding the leaderboard table on the webpage
table = soup.find('table', attrs={'class': 'table-responsive'})
# Checking if the table exists
if table:
print(table) # Print the table content (for debugging)
# parse table to get names, values data
else:
print('no such element found') # Print a message if the table is not found
else:
print('invalid status code') # Print a message if the HTTP request fails
return names, views
The function above uses the Python requests module to fetch the page content but fails because the content is dynamically loaded using JavaScript. In such cases, we can use Selenium. With Selenium, we can automate web interactions such as clicking buttons, filling out forms, and scrolling through web pages, mimicking human behavior in the virtual realm.
def get_data(driver, url):
cur_names = []
cur_views = []
driver.get(url)
driver.implicitly_wait(10)
all_elements = driver.find_elements(By.CLASS_NAME, 'table-responsive')
if all_elements:
last_ele = all_elements[-1]
leaderboard_table = last_ele.get_attribute('outerHTML')
soup = BeautifulSoup(leaderboard_table, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) >= 3:
cur_names.append(cells[2].text.strip())
cur_views.append(int(cells[-1].text.strip()))
return cur_names, cur_views
def scrape_leaderboard():
print('fetching')
update_message(msg="Fetching leaderboard results, please wait...")
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--log-level=3")
chrome_driver_path = "path to chromedriver executable file"
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)
url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/#LeaderBoard'
cur_names, cur_views = get_data(driver, url)
names.extend(cur_names)
views.extend(cur_views)
last_page = None
pagination_ele = driver.find_element(By.CLASS_NAME, 'page-link')
if pagination_ele:
pagination_ele = pagination_ele.get_attribute('outerHTML')
last_page = re.search('Page\s+\d+\s+of\s+(\d+)', pagination_ele)
if last_page:
last_page = int(last_page.group(1))
if last_page:
for i in range(2, last_page+1):
url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/lb/%s/' % i
cur_names, cur_views = get_data(driver, url)
names.extend(cur_names)
views.extend(cur_views)
driver.quit()
return names, views
The scrape_leaderboard() function coordinates the scraping process. It initializes a headless Chrome browser using WebDriver, then calls the get_data() function to fetch data from the main leaderboard page and subsequent pages if pagination exists. The script appends the extracted names and views to global lists (names and views), ensuring comprehensive data collection.
The get_data() function is responsible for scraping user names and views from a specified URL. It utilizes Selenium to navigate the webpage and extract data from the leaderboard table using BeautifulSoup.
Data, in its raw form, can be overwhelming and difficult to comprehend. Data visualization serves as a beacon of light, illuminating patterns, trends, and insights hidden within the data. Plotly, a Python library for interactive data visualization, empowers us to create stunning visualizations that captivate and inform.
From scatter plots to bar charts, Plotly offers a diverse range of visualization options, each tailored to convey specific insights effectively. With its interactive features and customization capabilities, Plotly enables us to engage with data in meaningful ways, unlocking its full potential.
The plot_data function transforms the extracted data into interactive scatter plots using Plotly, a versatile visualization library. These plots offer dynamic exploration capabilities, including hover tooltips with user details, customizable color schemes, and axis labels for enhanced clarity.
def plot_data(df, msg=''):
update_message(msg="Generating report, please wait...")
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['Name'], y=df['Views'], mode='markers',
marker=dict(color=df['Views'], colorscale='Viridis', size=10),
text=[f"User: {name}<br>Views: {view}" for name, view in zip(df['Name'], df['Views'])],
hoverinfo='text'))
bg_image = Image.open("bg.png") # Replace "bg.png" with your actual image file
fig.update_layout(images=[dict(source=bg_image, xref="paper", yref="paper", x=0, y=1, sizex=1, sizey=1, opacity=0.1, layer="below")])
fig.update_layout(
xaxis=dict(tickangle=45),
yaxis=dict(range=[0, df['Views'].max() + 10]),
template='plotly_dark',
title='Views by User%s'%msg,
xaxis_title='User',
yaxis_title='Views'
)
fig.show()
update_message('Report Generated...')
The code integrates a user-friendly GUI using Tkinter, a popular Python GUI toolkit. The GUI features interactive buttons that enable users to generate reports, access additional features, and receive real-time progress updates.
root = tk.Tk()
root.geometry("400x400")
root.title("AV Blogathon Report")
button_frame = tk.Frame(root)
button_frame.pack(side="bottom", pady=20)
button_width = 40
execute_button1 = tk.Button(button_frame, text="Get Leaderboard Report", command=get_full_report, width=button_width)
execute_button1.pack(pady=5)
execute_button2 = tk.Button(button_frame, text="Get Top 'N'", command=get_top_ten, width=button_width)
execute_button2.pack(pady=5)
execute_button3 = tk.Button(button_frame, text='Get article Links of user', command=get_article_link, width=button_width)
execute_button3.pack(pady=5)
message_label = tk.Label(button_frame, text="")
message_label.pack(side="bottom", pady=5)
disable_buttons()
root.after(100, check_data)
root.mainloop()
To optimize user experience, data loading, and GUI initialization occur asynchronously. The check_data function fetches leaderboard data in the background, allowing users to interact with the GUI without interruptions.
import re
import requests
import pandas as pd
import tkinter as tk
from PIL import Image
from bs4 import BeautifulSoup
from selenium import webdriver
import plotly.graph_objects as go
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
names = []
views = []
def scrape_leaderboard_requests():
url = "https://datahack.analyticsvidhya.com/blogathon/#LeaderBoard"
headers = {
'authority': 'datahack.analyticsvidhya.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
}
response = requests.get(url, headers, verify=False, timeout=80)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'lxml')
table = soup.find('table', attrs={'class': 'table-responsive'})
if table:
print(table)
else:
print('no such element found')
else:
print('invalid status code')
return names, views
def update_message(msg, level='info'):
color = 'green'
if level=='alert':
color = 'red'
message_label.config(text=msg, fg=color)
root.update()
return
def get_data(driver, url):
cur_names = []
cur_views = []
driver.get(url)
driver.implicitly_wait(10)
all_elements = driver.find_elements(By.CLASS_NAME, 'table-responsive')
if all_elements:
last_ele = all_elements[-1]
leaderboard_table = last_ele.get_attribute('outerHTML')
soup = BeautifulSoup(leaderboard_table, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) >= 3: # Ensure the row contains the required data
cur_names.append(cells[2].text.strip())
cur_views.append(int(cells[-1].text.strip()))
return cur_names, cur_views
def scrape_leaderboard():
print('fetching')
update_message(msg="Fetching leaderboard results, please wait...")
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--log-level=3")
chrome_driver_path = path_to_chromedriver # add correct path here
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)
# url = "https://datahack.analyticsvidhya.com/blogathon/#LeaderBoard"
url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/#LeaderBoard'
cur_names, cur_views = get_data(driver, url)
names.extend(cur_names)
views.extend(cur_views)
last_page = None
pagination_ele = driver.find_element(By.CLASS_NAME, 'page-link')
if pagination_ele:
pagination_ele = pagination_ele.get_attribute('outerHTML')
last_page = re.search('Page\s+\d+\s+of\s+(\d+)', pagination_ele)
if last_page:
last_page = int(last_page.group(1))
if last_page:
for i in range(2, last_page+1):
url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/lb/%s/' % i
cur_names, cur_views = get_data(driver, url)
names.extend(cur_names)
views.extend(cur_views)
driver.quit()
return names, views
def plot_data(df, msg=''):
update_message(msg="Generating report, please wait...")
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['Name'], y=df['Views'], mode='markers',
marker=dict(color=df['Views'], colorscale='Viridis', size=10),
text=[f"User: {name}<br>Views: {view}" for name, view in zip(df['Name'], df['Views'])],
hoverinfo='text'))
bg_image = Image.open("bg.png") # Replace "bg.png" with your actual image file
fig.update_layout(images=[dict(source=bg_image, xref="paper", yref="paper", x=0, y=1, sizex=1, sizey=1, opacity=0.1, layer="below")])
fig.update_layout(
xaxis=dict(tickangle=45),
yaxis=dict(range=[0, df['Views'].max() + 10]),
template='plotly_dark',
title='Views by User%s'%msg,
xaxis_title='User',
yaxis_title='Views'
)
fig.show()
update_message('Report Generated...')
def get_full_report():
plot_data(df)
def get_top_ten():
df_sorted = df.sort_values(by='Views', ascending=False)
top_10 = df_sorted.head(10)
plot_data(top_10, msg='(Top 10)')
def get_article_link():
update_message('error error error!!! Feature not developed yet.', level='alert')
def disable_buttons():
execute_button1.config(state="disabled")
execute_button2.config(state="disabled")
execute_button3.config(state="disabled")
def enable_buttons():
execute_button1.config(state="normal")
execute_button2.config(state="normal")
execute_button3.config(state="normal")
def check_data():
names, views = scrape_leaderboard()
if not names or not views:
update_message(msg="No results found. Please try after sometime...", level='alert')
root.destroy()
exit()
else:
enable_buttons()
update_message(msg='Data Fetched, please proceed to generate report..')
global df
df = pd.DataFrame({'Name': names, 'Views': views})
root = tk.Tk()
root.geometry("400x400")
root.title("AV Blogathon Report")
button_frame = tk.Frame(root)
button_frame.pack(side="bottom", pady=20)
button_width = 40
execute_button1 = tk.Button(button_frame, text="Get Leaderboard Report", command=get_full_report, width=button_width)
execute_button1.pack(pady=5)
execute_button2 = tk.Button(button_frame, text="Get Top 'N'", command=get_top_ten, width=button_width)
execute_button2.pack(pady=5)
execute_button3 = tk.Button(button_frame, text='Get article Links of user', command=get_article_link, width=button_width)
execute_button3.pack(pady=5)
message_label = tk.Label(button_frame, text="")
message_label.pack(side="bottom", pady=5)
disable_buttons()
root.after(100, check_data)
root.mainloop()
A smooth user experience is paramount in engagement and usability. The codebase incorporates several strategies to enhance user experience and provide real-time insights:
While the provided code offers a solid foundation for real-time blogathon analytics, the journey doesn’t end here. We can explore several enhancements and advanced applications to elevate the analytics capabilities:
This article provides a comprehensive exploration of web scraping, data visualization, and GUI development in Python. By dissecting the codebase, learners gain insights into automated data extraction using BeautifulSoup and Selenium, interactive visualization with Plotly, and building user-friendly interfaces with Tkinter. The article focus on analysis of Analytics Vidhya Blogathon leaderboard, offering practical application of these concepts. Learners can embark on their own data-driven projects, extracting insights, creating engaging visualizations, and designing user interfaces.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.