Data Science Project: Scraping YouTube Data using Python and Selenium to Classify Videos

Shubham singh Last Updated : 10 May, 2020

8 min read

This article was submitted as part of Analytics Vidhya’s Internship Challenge.

Introduction

I’m an avid YouTube user. The sheer amount of content I can watch on a single platform is staggering. In fact, a lot of my data science learning has happened through YouTube videos!

So, I was browsing YouTube a few weeks ago searching for a certain category to watch. That’s when my data scientist thought process kicked in. Given my love for web scraping and machine learning, could I extract data about YouTube videos and build a model to classify them into their respective categories?

New Feature

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

I was intrigued! This sounded like the perfect opportunity to combine my existing Python and data science knowledge with my curiosity to learn something new. And Analytics Vidhya’s internship challenge offered me the chance to pen down my learning in article form.

Web scraping is a skill I feel every data science enthusiast should know. It is immensely helpful when we’re looking for data for our project or want to analyze specific data present only on a website. Keep in mind though, web scraping should not cross ethical and legal boundaries.

In this article, we’ll learn how to use web scraping to extract YouTube video data using Selenium and Python. We will then use the NLTK library to clean the data and then build a model to classify these videos based on specific categories.

You can also check out the below tutorials on web scraping using different libraries:

Note: BeautifulSoup is another library for web scraping. You can learn about this using our free course- Introduction to Web Scraping using Python.

Overview of Selenium
Prerequisites for our Web Scraping Project
Setting up the Python Environment
Scraping Data from YouTube
Cleaning the Scraped Data using the NLTK Library
Building our Model to Classify YouTube Videos
Analyzing the Results

Overview of Selenium

Selenium is a popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. You must have come across Selenium if you’ve worked in the IT field.

We can easily program a Python script to automate a web browser using Selenium. It gives us the freedom we need to efficiently extract the data and store it in our preferred format for future use.

Selenium requires a driver to interface with our chosen browser. Chrome, for example, requires ChromeDriver, which needs to be installed before we start scraping. The Selenium web driver speaks directly to the browser using the browser’s own engine to control it. This makes it incredibly fast.

Prerequisites for our Web Scraping Project

There are a few things we must know before jumping into web scraping:

Basic knowledge of HTML and CSS is a must. We need this to understand the structure of a webpage we’re about to scrape
Python is required to clean the data, explore it, and build models
Knowledge of some basic libraries like Pandas and NumPy would be the cherry on the cake

Setting up the Python Environment

Time to power up your favorite Python IDE (that’s Jupyter notebooks for me)! Let’s get our hands dirty and start coding.

Step 1: Install Python binding:

#Open terminal and type-
$ pip install selenium

Step 2: Download Chrome WebDriver:

Visit https://sites.google.com/a/chromium.org/chromedriver/download
Select the compatible driver for your Chrome version
To check the Chrome version you are using, click on the three vertical dots on the top right corner
Then go to Help -> About Google Chrome

Step 3: Move the driver file to a PATH:

Go to the downloads directory, unzip the file, and move it to usr/local/bin PATH.

$ cd Downloads
$ unzip chromedriver_linux64.zip
$ mv chromedriver /usr/local/bin/

We’re all set to begin web scraping now.

Scraping Data from YouTube

In this article, we’ll be scraping the video ID, video title, and video description of a particular category from YouTube. The categories we’ll be scraping are:

Travel
Science
Food
History
Manufacturing
Art & Dance

So let’s begin!

First, let’s import some libraries:

	from selenium import webdriver
	import pandas as pd
	from selenium.webdriver.common.by import By
	from selenium.webdriver.support.ui import WebDriverWait
	from selenium.webdriver.support import expected_conditions as EC

view raw gistfile1.py hosted with ❤ by GitHub

Before we do anything else, open YouTube in your browser. Type in the category you want to search videos for and set the filter to “videos”. This will display only the videos related to your search. Copy the URL after doing this.
Next, we need to set up the driver to fetch the content of the URL from YouTube:

	driver = webdriver.Chrome()
	driver.get("YOUR_LINK_HERE")

view raw driver.py hosted with ❤ by GitHub

Paste the link into to driver.get(“ Your Link Here ”) function and run the cell. This will open a new browser window for that link. We will do all the following tasks in this browser window
Fetch all the video links present on that particular page. We will create a “list” to store those links
Now, go to the browser window, right-click on the page, and select ‘inspect element’
Search for the anchor tag with id = ”video-title” and then right-click on it -> Copy -> XPath. The XPath should look something like : //*[@id=”video-title”]

With me so far? Now, write the below code to start fetching the links from the page and run the cell. This should fetch all the links present on the web page and store it in a list.

Note: Traverse all the way down to load all the videos on that page.

	user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
	links = []
	for i in user_data:
	links.append(i.get_attribute('href'))

	print(len(links))

view raw fetch_links.py hosted with ❤ by GitHub

The above code will fetch the “href” attribute of the anchor tag we searched for.

Now, we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. We will store the details of videos for different categories in these columns:

df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

view raw df.py hosted with ❤ by GitHub

We are all set to scrape the video details from YouTube. Here’s the Python code to do it:

	wait = WebDriverWait(driver, 10)
	v_category = "CATEGORY_NAME"
	for x in links:
	driver.get(x)
	v_id = x.strip('https://www.youtube.com/watch?v=')
	v_title = wait.until(EC.presence_of_element_located(
	(By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
	v_description = wait.until(EC.presence_of_element_located(
	(By.CSS_SELECTOR,"div#description
	yt-formatted-string"))).text
	df.loc[len(df)] = [v_id, v_title, v_description, v_category]

view raw dataframe.py hosted with ❤ by GitHub

Let’s breakdown this code block to understand what we just did:

“wait” will ignore instances of NotFoundException that are encountered (thrown) by default in the ‘until’ condition. It will immediately propagate all others
Parameters:
- driver: The WebDriver instance to pass to the expected conditions
- timeOutInSeconds: The timeout in seconds when an expectation is called
v_category stores the video category name we searched for earlier
The “for” loop is applied on the list of links we created above
driver.get(x) traverses through all the links one-by-one and opens them in the browser to fetch the details
v_id stores the stripped video ID from the link
v_title stores the video title fetched by using the CSS path
Similarly, v_description stores the video description by using the CSS path

During each iteration, our code saves the extracted data inside the dataframe we created earlier.

We have to follow the aforementioned steps for the remaining five categories. We should have six different dataframes once we are done with this. Now, it’s time to merge them together into a single dataframe:

	frames = [df_travel, df_science, df_food, df_manufacturing, df_history, df_artndance]
	df_copy = pd.concat(frames, axis=0, join='outer', join_axes=None, ignore_index=True,
	keys=None, levels=None, names=None, verify_integrity=False, copy=True)

view raw concat.py hosted with ❤ by GitHub

Voila! We have our final dataframe containing all the desired details of a video from all the categories mentioned above.

Cleaning the Scraped Data using the NLTK Library

In this section, we’ll use the popular NLTK library to clean the data present in the “title” and “description” columns. NLP enthusiasts will love this section!

Before we start cleaning the data, we need to store all the columns separately so that we can perform different operations quickly and easily:

	df_link = pd.DataFrame(columns = ["link"])
	df_title = pd.DataFrame(columns = ["title"])
	df_description = pd.DataFrame(columns = ["description"])
	df_category = pd.DataFrame(columns = ["category"])
	df_link[‘link’] = df_copy['link']
	df_title [‘title’]= df_copy['title']
	df_description[‘description’] = df_copy['description']
	df_category[‘category’] = df_copy['category']

view raw separating.py hosted with ❤ by GitHub

Import the required libraries first:

	import re
	import nltk
	nltk.download('stopwords')
	from nltk.corpus import stopwords
	from nltk.stem.porter import PorterStemmer

view raw cleaning.py hosted with ❤ by GitHub

Now, create a list in which we can store our cleaned data. We will store this data in a dataframe later. Write the following code to create a list and do some data cleaning on the “title” column from df_title:

	corpus = []
	for i in range(0, 8375):
	review = re.sub('[^a-zA-Z]', ' ', df_title['title'][i])
	review = review.lower()
	review = review.split()
	ps = PorterStemmer()
	review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
	review = ' '.join(review)
	corpus.append(review)

view raw corpus1.py hosted with ❤ by GitHub

Did you see what we did here? We removed all the punctuation from the titles and only kept the English root words. After all these iterations, we are ready with our list full of data.

We need to follow the same steps to clean the “description” column from df_description:

	corpus1 = []
	for i in range(0, 8375):
	review = re.sub('[^a-zA-Z]', ' ', df_description['description'][i])
	review = review.lower()
	review = review.split()
	ps = PorterStemmer()
	review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
	review = ' '.join(review)
	corpus1.append(review)

view raw corpus2.py hosted with ❤ by GitHub

Note: The range is selected as per the rows in our dataset.

Now, convert these lists into dataframes:

	dftitle = pd.DataFrame({'title':corpus})
	dfdescription = pd.DataFrame({'description':corpus1})

view raw convert.py hosted with ❤ by GitHub

Next, we need to label encode the categories. The “LabelEncoder()” function encodes labels with a value between 0 and n_classes – 1 where n is the number of distinct labels.

	from sklearn.preprocessing import LabelEncoder
	dfcategory = df_category.apply(LabelEncoder().fit_transform)

view raw encoding.py hosted with ❤ by GitHub

Here, we have applied label encoding on df_category and stored the result into dfcategory. We can store our cleaned and encoded data in into a new dataframe:

df_new = pd.concat([df_link, dftitle, dfdescription, dfcategory], axis=1, join_axes = [df_link.index])

view raw new_dataframe.py hosted with ❤ by GitHub

We’re not quite all the way done with our cleaning and transformation part.

We should create a bag-of-words so that our model can understand the keywords from that bag to classify videos accordingly. Here’s the code to do create a bag-of-words:

	from sklearn.feature_extraction.text import CountVectorizer
	cv = CountVectorizer(max_features = 1500)
	X = cv.fit_transform(corpus, corpus1).toarray()
	y = df_new.iloc[:, 3].values

view raw bag_of_words.py hosted with ❤ by GitHub

Note: Here, we created 1500 features from data stored in the lists – corpus and corpus1. “X” stores all the features and “y” stores our encoded data.

We are all set for the most anticipated part of a data scientist’s role – model building!

Building our Model to Classify YouTube Videos

Before we build our model, we need to divide the data into training set and test set:

Training set: A subset of the data to train our model
Test set: Contains the remaining data to test the trained model

Make sure that your test set meets the following two conditions:

Large enough to yield statistically meaningful results
Representative of the dataset as a whole. In other words, don’t pick a test set with different characteristics than the training set

We can use the following code to split the data:

	from sklearn.model_selection import train_test_split
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

view raw split.py hosted with ❤ by GitHub

Time to train the model! We will use the random forest algorithm here. So let’s go ahead and train the model using the RandomForestClassifier() function:

	from sklearn.ensemble import RandomForestClassifier
	classifier = RandomForestClassifier(n_estimators = 1000, criterion = 'entropy')
	classifier.fit(X_train, y_train)

view raw model.py hosted with ❤ by GitHub

Parameters:

n_estimators: The number of trees in the forest
criterion: The function to measure the quality of a split. Supported criteria are “gini” for Gini impurity and “entropy” for information gain

Note: These parameters are tree-specific.

We can now check the performance of our model on the test set:

	y_pred = classifier.predict(X_test)
	classifier.score(X_test, y_test)

view raw result.py hosted with ❤ by GitHub

We get an impressive 96.05% accuracy. Our entire process went pretty smoothly! But we’re not done yet – we need to analyze our results as well to fully understand what we achieved.

Analyzing the Results

Let’s check the classification report:

print(classification_report(y_test, y_pred))

view raw analyse.py hosted with ❤ by GitHub

The result will give the following attributes:

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision = TP/TP+FP
Recall is the ratio of correctly predicted positive observations to all the observations in the actual class. Recall = TP/TP+FN
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. F1 Score = 2*(Recall * Precision) / (Recall + Precision)

We can check our results by creating a confusion matrix as well:

	from sklearn.metrics import confusion_matrix
	cm = confusion_matrix(y_test, y_pred)
	cm

view raw confusion_matrix.py hosted with ❤ by GitHub

The confusion matrix will be a 6×6 matrix since we have six classes in our dataset.

End Notes

I’ve always wanted to combine my interest in scraping and extracting data with NLP and machine learning. So I loved immersing myself in this project and penning down my approach.

In this article, we just witnessed Selenium’s potential as a web scraping tool. All the code used in this article is random forest algorithm Congratulations on successfully scraping and creating a dataset to classify videos!

I look forward to hearing your thoughts and feedback on this article.

Note: BeautifulSoup is another library for web scraping. You can learn about this using our free course- Introduction to Web Scraping using Python.

Shubham singh

A Data Science Enthusiast who loves reading & writing about Data Science and its applications. He has done many projects in this field and his recent work include concepts like Web Scraping, NLP etc. He is a Data Science Content Strategist Intern at Analytics Vidhya. And currently pursuing BTech in Computer Science from DIT University, Dehradun.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Ranbeer

Thanks Shubham. Pretty methodical approach. I wish you could have show output at each step. That way it's easier to follow along and see how the output changes in each step. Do you have the Juptyer notebook somewhere?

Show 1 reply

Shubham Singh

Hi, Thank you for your feedback and suggestion. I'll try to keep outputs hand in my future posts. You can also go through the notebook in my GitHub (https://github.com/shubham-singh-ss/Youtube-scraping-using-Selenium)

Mohit

Is it legal to scrap data for analysis...academic purposes

It depends on the policy of the website you want to scrap data from. It's not clearly legal. If the policies allow you to scrap data for academic or research purpose, sure it's legal.

Ayushi Dhingra

It is really quite difficult to find such detailed information about any new or still-going-on technology. Brilliant article for beginners like me.

Thank You, it's good to know that my content helped you somehow.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Data Science Project: Scraping YouTube Data using Python and Selenium to Classify Videos

Introduction

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Table of Contents

Overview of Selenium

Prerequisites for our Web Scraping Project

Setting up the Python Environment

Scraping Data from YouTube

Cleaning the Scraped Data using the NLTK Library

Building our Model to Classify YouTube Videos

Analyzing the Results

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm