A Tool for Investor – The Art of Web Scraping

Aryan Last Updated : 11 Nov, 2021

6 min read

This article was published as a part of the Data Science Blogathon

INTRODUCTION

Investing is an important part of one’s life because Investing helps in making the present and future safety, it allows you to grow financially. Also, investing is a process of compounding profits. Investing money at the right place and right time helps in increasing the money, for example, one has invested Rs. 100/- and after two days Rs.100/- becomes Rs.120/- now if the person still wants the same money to be invested then the profits will be on Rs.120/- and not on Rs.100/- that’s why it is also known as a process of compounding.

But now the question is how to determine where to invest in? Let’s take the example of the stock market… in which type of stock one should invest. After choosing the stock, which company one should invest?

if you want to know this, then you are in the right place….

Suppose that you are a financial investor that likes to put resources into various stocks/items/commodities and so on. The reality behind being an investor is that you need to explore a great deal. You need to explore every single market for the top-performing organizations, so that, you can put resources into and have an assurance that day’s end you will acquire something.

Click here to view the source of this image

If we take into consideration all the traditional factors for trading like – one has to understand all the industries, then one has to research about a particular industry, then google about the different companies after that, using NSE or BSE website analyze the stock by going to different tabs & links.

Click here to view the source of this image

Imagine having the power to speed up this process by analyzing BSE/NSE website in a few seconds. I am sure now you surely have thought of it, so let me help you with it.

This can be done using the WEB SCRAPING technique, in this technique, one can extract useful data from any website or any
online source.

Click here to view the source of this image

WHAT IS WEB SCRAPING?

Web scraping is a process of extracting information from the web. This can be done using various software and can be used in extracting any type of information. It is a very feasible process as it can be done anywhere and anytime the only condition of web scraping is that it requires an internet connection.

Web scratching is an important method since it licenses quickly and is capable of extracting online data. Such data would then have the option to be taken care of to assemble bits of knowledge as required. In this manner, it furthermore makes it possible to screen the brand and reputation of an association.

How To Perform Web Scraping?

After understanding web-scraping, the most common question is – How do I learn web scraping?

The process of web-scraping is really simple. To extract data using web scraping with python, you need to follow these basic steps:

1. Find the URL that you want to scrape.

2. Inspecting the Page.

3. Find the data you want to extract.

4. Write the code.

5. Run the code and extract the data.

6. Store the data in the desired format

All the steps mentioned above as shown below by performing actual web-scraping that will help in investing.

Let’s begin with the Art of Web Scraping

Now, we would be extracting the data from the official NSE website i.e., https://www.nseindia.com/“Be fearful when others are greedy. Be greedy when others are fearful.” — Warren Buffett

With the help of web scraping one can understand – when people are scared and in which stock one can invest and earn more even in the bearish market.

For performing the above-mentioned process of extracting data from the web i.e., web scraping, first we need to install some necessary libraries like:

· Pandas

· Bs4

· BeautifulSoup

· Webdriver_manager.chrome

· ChromeDriveManager

The code for importing the same is:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.nseindia.com/")
html=driver.page_source
soup = BeautifulSoup(html,'html.parser')

OUTPUT:https://docs.google.com/document/d/1x-M-AoeQbibmudqKa6cXSHkSt2S0mdqRXf-vol7VRPM/edit?usp=sharing

Now, let’s check whether we are on the correct website or not…..

For checking, we will be using Beautiful Soup Library

The code for the same is:

print("Title of the website is : ") 
for title in soup.find_all('title'): 
    print(title.get_text())

OUTPUT:

Title of the website is :

NSE - National Stock Exchange of India Ltd: Live Share/Stock Market News & Updates, Quotes- Nseindia.com

Now, we have to open the NSE site on the other tab, let’s look at it for a second and try to observe different tags. To look for the tag names that are used in the actual website one needs to open inspect element.

What is Inspect Element?

Inspect element is one of the designer devices consolidated into Google Chrome, Firefox, Safari, and Internet Explorer internet browsers. By getting this instrument, one can really see — and even alter — the HTML and CSS source code behind the web content.

Inspect Element is a source that helps in viewing the source code of the website. There are two ways to open inspect element:

1. Right-click on the web page and select inspect element

2. Use shortcut key – Ctrl + Shift + I

Source – It is a screenshot from my Laptop

After opening Inspect Element, search for the market/Index for which you want to extract data. Generally, all these types of information are known as a class and all classes are at the ‘P’ tag. Hence to extract information that is on the ‘P’ tag we will use the code:

para=soup.select('p')
para

OUTPUT:

[<p class="greenTxt">Debt Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt" id="capMarStat">Normal Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt">Currency Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt">Commodity Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt">Debt Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt" id="">Normal Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt">Currency Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt">Commodity Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p class="greenTxt">Debt Market is Open</p>,

<p class="next_date">Current

                                            Trading Date - 01-Nov-2021</p>,

<p>Find everything about the leading stock exchange of India</p>,

<p>Check NSE's group of the companies</p>,

<p>Browse a comprehensive and innovative product and service offerings by NSE</p>,

<p>Live Analysis of top gainers/losers, most active securities/contracts, price band hitters, overview of the market.</p>,

<p>View Option chain for the exchange</p>,

Now, it can be observed that we got all the information about different markets with dates + timings but this is not very readable/understandable. To make it easy to understand we will use code:

para = soup.findAll('p')
for p in para:
     print(p.get_text())

OUTPUT:

https://docs.google.com/document/d/1-YoeGDLMbpdUBQKRdYOqTfMy-fAWU_KPCnJnHFajdd0/edit?usp=sharing

Finally, we can now read it and understand it.

Now, let’s deep-dive into the same and now let us search for Index – I will choose NIFTY index, you can choose according to your own desire.

To get the NIFTY Index information we will use the code:

Nifty = soup.findAll('p', {'class':'tb_name'})
for name in Nifty:
     print(name.get_text())

OUTPUT:

NIFTY 50
NIFTY NEXT 50
NIFTY MIDCAP 50
NIFTY BANK
NIFTY FINANCIAL SERVICES

Now let’s find out the value of each NIFTY Index for the same, we’ll use code:

Nifty = soup.findAll('p', {'class':'tb_name'})
value = soup.findAll('p', {'class':'tb_val'})
for Nifty_name in Nifty:
     print(Nifty_name.get_text())
for Nifty_value in value:
     print(Nifty_value.get_text())

OUTPUT:

NIFTY 50
NIFTY NEXT 50
NIFTY MIDCAP 50
NIFTY BANK
NIFTY FINANCIAL SERVICES
17,802.00
42,443.10
8,606.30
39,400.55
18,829.70

Therefore, we got all the information we need to understand today’s Index for options trading.

In this article, we extracted a few pieces of information, but you can use the same technique to extract more data.

Another example for web scraping can be:

Let’s use the “DIV” tag now,

For this let’s use the code:

div=soup.find_all("div")
div

OUTPUT:

(The output for this is also not readable and understandable)

https://docs.google.com/document/d/1pNmTNJXFzslTnrxBA1IOoNUAraoKqr2i13o4_fT6gE8/edit?usp=sharing

Let’s make it easy to understand

For this we’ll use the code:

t = soup.body
for T in t.find_all('div'):
    print(T.text)

OUTPUT:

https://docs.google.com/document/d/1KWpBqoGNYwi1r6ZRdDd0NpJ4BkBJhpUMYfOXlm0mE2U/edit?usp=sharing

Now, It can be observed that everything is readable and easy to understand…..

By this one can perform the art of the Web – Scraping.

ABOUT THE AUTHOR

A 3^rd-year (5^thSemester) Student at CHRIST University, Lavasa, Pune Campus. Currently Pursuing BBA (BUSINESS ANALYTICS).

Website – acumenfinalysis.com (CHECK THIS OUT)

Contacts:

If you have any questions or suggestions on what my next article should be about, please leave a comment below or write to me at aryanbajaj104@gmail.com.

If you want to keep updated with my latest articles and projects, follow me on Medium.

Connect with me via:

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Aryan

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

Data Science Tools and Techniques

Reading list

Introduction

Tools

Libraries

Plots

Use cases

A Tool for Investor – The Art of Web Scraping

INTRODUCTION

WHAT IS WEB SCRAPING?

How To Perform Web Scraping?

Let’s begin with the Art of Web Scraping

What is Inspect Element?

ABOUT THE AUTHOR

Contacts:

Connect with me via:

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ