I’m an avid YouTube user. The sheer amount of content I can watch on a single platform is staggering. In fact, a lot of my data science learning has happened through YouTube videos!
So, I was browsing YouTube a few weeks ago searching for a certain category to watch. That’s when my data scientist thought process kicked in. Given my love for web scraping and machine learning, could I extract data about YouTube videos and build a model to classify them into their respective categories?
I was intrigued! This sounded like the perfect opportunity to combine my existing Python and data science knowledge with my curiosity to learn something new. And Analytics Vidhya’s internship challenge offered me the chance to pen down my learning in article form.
Web scraping is a skill I feel every data science enthusiast should know. It is immensely helpful when we’re looking for data for our project or want to analyze specific data present only on a website. Keep in mind though, web scraping should not cross ethical and legal boundaries.
In this article, we’ll learn how to use web scraping to extract YouTube video data using Selenium and Python. We will then use the NLTK library to clean the data and then build a model to classify these videos based on specific categories.
You can also check out the below tutorials on web scraping using different libraries:
Selenium is a popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. You must have come across Selenium if you’ve worked in the IT field.
We can easily program a Python script to automate a web browser using Selenium. It gives us the freedom we need to efficiently extract the data and store it in our preferred format for future use.
Selenium requires a driver to interface with our chosen browser. Chrome, for example, requiresChromeDriver, which needs to be installed before we start scraping. The Selenium web driver speaks directly to the browser using the browser’s own engine to control it. This makes it incredibly fast.
Prerequisites for our Web Scraping Project
There are a few things we must know before jumping into web scraping:
Basic knowledge of HTML and CSS is a must. We need this to understand the structure of a webpage we’re about to scrape
Python is required to clean the data, explore it, and build models
Knowledge of some basic libraries like Pandas and NumPy would be the cherry on the cake
Setting up the Python Environment
Time to power up your favorite Python IDE (that’s Jupyter notebooks for me)! Let’s get our hands dirty and start coding.
Select the compatible driver for your Chrome version
To check the Chrome version you are using, click on the three vertical dots on the top right corner
Then go to Help -> About Google Chrome
Step 3:Move the driver file to a PATH:
Go to the downloads directory, unzip the file, and move it to usr/local/bin PATH.
$ cd Downloads$ unzip chromedriver_linux64.zip$ mv chromedriver /usr/local/bin/
We’re all set to begin web scraping now.
Scraping Data from YouTube
In this article, we’ll be scraping the video ID, video title, and video description of a particular category from YouTube. The categories we’ll be scraping are:
Travel
Science
Food
History
Manufacturing
Art & Dance
So let’s begin!
First, let’s import some libraries:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Before we do anything else, open YouTube in your browser. Type in the category you want to search videos for and set the filter to “videos”. This will display only the videos related to your search. Copy the URL after doing this.
Next, we need to set up the driver to fetch the content of the URL from YouTube:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Paste the link into to driver.get(“ Your Link Here ”) function and run the cell. This will open a new browser window for that link. We will do all the following tasks in this browser window
Fetch all the video links present on that particular page. We will create a “list” to store those links
Now, go to the browser window, right-click on the page, and select ‘inspect element’
Search for the anchor tag with id = ”video-title” and then right-click on it -> Copy -> XPath. The XPath should look something like : //*[@id=”video-title”]
With me so far? Now, write the below code to start fetching the links from the page and run the cell. This should fetch all the links present on the web page and store it in a list.
Note: Traverse all the way down to load all the videos on that page.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The above code will fetch the “href” attribute of the anchor tag we searched for.
Now, we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. We will store the details of videos for different categories in these columns:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We are all set to scrape the video details from YouTube. Here’s the Python code to do it:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Let’s breakdown this code block to understand what we just did:
“wait” will ignore instances of NotFoundException that are encountered (thrown) by default in the ‘until’ condition. It will immediately propagate all others
Parameters:
driver: The WebDriver instance to pass to the expected conditions
timeOutInSeconds: The timeout in seconds when an expectation is called
v_category stores the video category name we searched for earlier
The “for” loop is applied on the list of links we created above
driver.get(x) traverses through all the links one-by-one and opens them in the browser to fetch the details
v_id stores the stripped video ID from the link
v_title stores the video title fetched by using the CSS path
Similarly, v_description stores the video description by using the CSS path
During each iteration, our code saves the extracted data inside the dataframe we created earlier.
We have to follow the aforementioned steps for the remaining five categories. We should have six different dataframes once we are done with this. Now, it’s time to merge them together into a single dataframe:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Voila! We have our final dataframe containing all the desired details of a video from all the categories mentioned above.
Cleaning the Scraped Data using the NLTK Library
In this section, we’ll use the popular NLTK library to clean the data present in the “title” and “description” columns. NLP enthusiasts will love this section!
Before we start cleaning the data, we need to store all the columns separately so that we can perform different operations quickly and easily:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Now, create a list in which we can store our cleaned data. We will store this data in a dataframe later. Write the following code to create a list and do some data cleaning on the “title” column from df_title:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Did you see what we did here? We removed all the punctuation from the titles and only kept the English root words. After all these iterations, we are ready with our list full of data.
We need to follow the same steps to clean the “description” column from df_description:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note: The range is selected as per the rows in our dataset.
Now, convert these lists into dataframes:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Next, we need to label encode the categories. The “LabelEncoder()” function encodes labels with a value between 0 and n_classes – 1 where n is the number of distinct labels.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here, we have applied label encoding on df_category and stored the result into dfcategory. We can store our cleaned and encoded data in into a new dataframe:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We’re not quite all the way done with our cleaning and transformation part.
We should create a bag-of-words so that our model can understand the keywords from that bag to classify videos accordingly. Here’s the code to do create a bag-of-words:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note: Here, we created 1500 features from data stored in the lists – corpus and corpus1. “X” stores all the features and “y” stores our encoded data.
We are all set for the most anticipated part of a data scientist’s role – model building!
Building our Model to Classify YouTube Videos
Before we build our model, we need to divide the data into training set and test set:
Training set: A subset of the data to train our model
Test set: Contains the remaining data to test the trained model
Make sure that your test set meets the following two conditions:
Large enough to yield statistically meaningful results
Representative of the dataset as a whole. In other words, don’t pick a test set with different characteristics than the training set
We can use the following code to split the data:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Time to train the model! We will use the random forest algorithm here. So let’s go ahead and train the model using the RandomForestClassifier() function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
criterion: The function to measure the quality of a split. Supported criteria are “gini” for Gini impurity and “entropy” for information gain
Note: These parameters are tree-specific.
We can now check the performance of our model on the test set:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We get an impressive 96.05% accuracy. Our entire process went pretty smoothly! But we’re not done yet – we need to analyze our results as well to fully understand what we achieved.
Analyzing the Results
Let’s check the classification report:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision = TP/TP+FP
Recall is the ratio of correctly predicted positive observations to all the observations in the actual class. Recall = TP/TP+FN
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. F1 Score = 2*(Recall * Precision) / (Recall + Precision)
We can check our results by creating a confusion matrix as well:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The confusion matrix will be a 6×6 matrix since we have six classes in our dataset.
End Notes
I’ve always wanted to combine my interest in scraping and extracting data with NLP and machine learning. So I loved immersing myself in this project and penning down my approach.
In this article, we just witnessed Selenium’s potential as a web scraping tool. All the code used in this article is random forest algorithm Congratulations on successfully scraping and creating a dataset to classify videos!
I look forward to hearing your thoughts and feedback on this article.
A Data Science Enthusiast who loves reading & writing about Data Science and its applications. He has done many projects in this field and his recent work include concepts like Web Scraping, NLP etc. He is a Data Science Content Strategist Intern at Analytics Vidhya. And currently pursuing BTech in Computer Science from DIT University, Dehradun.
Thanks Shubham. Pretty methodical approach. I wish you could have show output at each step. That way it's easier to follow along and see how the output changes in each step.
Do you have the Juptyer notebook somewhere?
Mohit
Is it legal to scrap data for analysis...academic purposes
Ayushi Dhingra
It is really quite difficult to find such detailed information about any new or still-going-on technology. Brilliant article for beginners like me.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Thanks Shubham. Pretty methodical approach. I wish you could have show output at each step. That way it's easier to follow along and see how the output changes in each step. Do you have the Juptyer notebook somewhere?
Hi, Thank you for your feedback and suggestion. I'll try to keep outputs hand in my future posts. You can also go through the notebook in my GitHub (https://github.com/shubham-singh-ss/Youtube-scraping-using-Selenium)
Is it legal to scrap data for analysis...academic purposes
It depends on the policy of the website you want to scrap data from. It's not clearly legal. If the policies allow you to scrap data for academic or research purpose, sure it's legal.
It is really quite difficult to find such detailed information about any new or still-going-on technology. Brilliant article for beginners like me.
Thank You, it's good to know that my content helped you somehow.