An Intuitive Guide to Web Scraping using Selenium

Eugenia Last Updated : 20 Sep, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Disclaimer: The goal of this post is only educational. Web Scraping is not encouraged, especially when there are terms and conditions against such actions.

The post is the third in a series of tutorials to build scrapers. Below, there is the full series:

HTML basics for web scraping
Web Scraping with Octoparse
Web Scraping with Selenium (this post)

The purpose of this series is to learn to extract data from websites. Most of the data in websites are in HTML format, then the first tutorial explains the basics of this markup language. The second guide shows a way to scrape data easily using an intuitive web scraping tool, which doesn’t need any knowledge of HTML. Instead, the third and last tutorial is focused on gathering data with Python from the web. In this case, you need to grasp to interact directly with HTML pages and you need some previous knowledge of it.

In the last years, the exponential growth of data available on the web leads to needing for web scraping. What is web scraping? It provides a set of techniques to extract and process huge amounts of data from web pages. To gather information from the websites, there are many possible scraping tools that can be applied.

In this tutorial, I am going to focus on performing web scraping using Selenium. In particular, it will be used to gather and parse the HTML code from a Wikipedia page, the list of countries by greenhouse gas emissions in Wikipedia. The crawler Selenium supports many languages, including Python, which will be used in this tutorial.

Table of Content

Required installations
Principal methods of Selenium
Data extraction with Selenium

Required Installations

Before beginning the tutorial, you need Python 3 installed on your PC. I used the Jupiter notebook to write the code in Windows 10. To install Selenium, you only need to copy this line in your notebook:

!pip install selenium

After you have to install a browser driver, which you choose depending on the browser you often use. In my case, I have Chrome, so I installed the Chrome driver. Below, there are the links to the more popular browser drivers:

Downloads – ChromeDriver – WebDriver for Chrome
If you are using Chrome from Dev or Canary channel, please follow the instructions on the ChromeDriver Canary page. For…sites.google.com

Microsoft Edge Driver
Close the loop on your developer cycle by automating testing of your website in Microsoft Edge with Microsoft Edge…developer.microsoft.com

Releases · mozilla/geckodriver
macOS 10.15 (Catalina): Due to the requirement from Apple that all programs must be notarized, geckodriver will not…github.com

WebDriver Support in Safari 10
Starting with Safari 10 on OS X El Capitan and macOS Sierra, Safari comes bundled with a WebDriver implementation…webkit.org

Principal methods of Selenium

Here, I show the methods of Selenium to find multiple elements in web pages [1]. Then, these methods return lists.

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

To locate only one element in the website, there are the corresponding methods:

find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

You need only to remove the ‘s’ from elements in the string. In this guide, I use find_elements_by_class_name, where you need to know the class name of the selected tag in HTML code, and find_elements_by_xpath, which specify the path of the elements using XPath.

XPath is a language, which uses path expressions to take nodes or a set of nodes in an XML document. There is a similarity to the paths you usually see in your computer file systems. The most useful path expressions are:

nodename takes the nodes with that name
/ gets from the root node
// gets nodes in the document from the current node
. gets the current node
.. gets the “parent” of the current node
@ gets the attribute of that node, such as id and class

To grasp better XPath, check the w3schools website.

Data Extraction with Selenium

Let’s start by importing the libraries:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd

First, we create an instance of Chrome WebDriver, specifying the path of Chromedriver installed:

driver = webdriver.Chrome(r
"C:UsersEugeniaDownloadschromedriver_win32chromedriver.exe")

Given the url, driver.get is used to navigate the web page.

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_greenhouse_gas_emissions'

driver.get(url)

We are interested to extract the data from the table, save it in a Pandas Dataframe and export it into a CSV file.

The first step is to extract the header row of the table. As shown before, the find_elements_by_class_name needs only the class name as input.

titles = driver.find_elements_by_class_name('headerSort')

for t in titles:
    print(t.text)

In this way, we obtained a list containing all the titles of the table. We can already create an empty Dataframe, specifying the names of the columns.

df = pd.DataFrame(columns=[t.text for t in titles])
df.head()

Now, it’s time to gather the data contained in each column. As you can observe, the tag

contains the body content in an HTML table, so all cells we want to extract are within these tags.

states = driver.find_elements_by_xpath('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]/tbody/tr/th')

To check the states found, I print the list:

for idx,s in enumerate(states):
    print('row {}:'.format(idx))
    print('{}'.format(s.text))

In a similar way, I extracted the content of the other columns. After the column of states, all the remaining columns are contained in the

tags. The index needs to be specified since we look row by row with the

tags.

col2 =  driver.find_elements_by_xpath('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]/tbody/tr/td[1]')

col3 =  driver.find_elements_by_xpath('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]/tbody/tr/td[2]')

col4 =  driver.find_elements_by_xpath('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]/tbody/tr/td[3]')

col5 =  driver.find_elements_by_xpath('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]/tbody/tr/td[4]')

Finally, we can add the columns to the DataFrame previously created:

df[df.columns[0]] = [s.text for s in states]

df[df.columns[1]] = [s.text for s in col2]
df[df.columns[2]] = [s.text for s in col3]
df[df.columns[3]] = [s.text for s in col4]
df[df.columns[4]] = [s.text for s in col5]

df.head()

Let’s export the dataset into a CSV file:

df.to_csv('greenhouse_gas_emissions.csv')

Final thoughts

I hope that this tutorial allowed you to learn web scraping with Selenium. There are many other scraping tools available, like BeautifulSoup and Scrapy. You are spoiled for choice! Thanks for reading. Have a nice day!

References:

[1] https://selenium-python.readthedocs.io/locating-elements.html l

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Eugenia

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

An Intuitive Guide to Web Scraping using Selenium

Table of Content

Required Installations

Principal methods of Selenium

Data Extraction with Selenium

Final thoughts

References:

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID