Anomaly Detection on Google Stock Data 2014-2022

Adil Mohammed Last Updated : 12 Oct, 2024

6 min read

Introduction

Welcome to the fascinating world of stock market anomaly detection! In this project, we’ll dive into the historical data of Google’s stock from 2014-2022 and use cutting-edge anomaly detection techniques to uncover hidden patterns and gain insights into the stock market. By identifying outliers and other anomalies, we aim to understand stock market trends better and potentially discover new investment opportunities. With the power of Python and the Scikit-learn library at our fingertips, we’re ready to embark on a thrilling data science journey that could change how we view the stock market forever. So, fasten your seatbelts and get ready to discover the unknown!

Learning Objectives:

In this article, we will:

Explore the data and identify potential anomalies.
Create visualizations to understand the data and its anomalies better.
Construct and train a model to detect anomalous data points.
Analyze and interpret our results to draw meaningful conclusions about the stock market.

This article was published as a part of the Data Science Blogathon.

Understanding the Data and Problem Statement
Data Preprocessing
Exploratory Data Analysis
Model Development

Understanding the Data and Problem Statement

In this project-based blog, we will explore anomaly detection in Google stock data from 2014-2022. The dataset used in this project is obtained from Kaggle. The dataset is available on Kaggle, and you can download it here. The dataset contains 106 rows and 7 columns. The dataset consists of daily stock price data for Google, also known as Alphabet Inc. (GOOGL), from 2014 to 2022. The dataset contains several features, including the opening, closing, highest, lowest, and volume of shares traded for each day. It also includes the date on which the stock was traded. The dataset contains 106 rows and 7 columns.

Problem statement

This project aims to analyze the Google stock data from 2014-2022 and use anomaly detection techniques to uncover hidden patterns and outliers in the data. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset. Finally, we will analyze and interpret our results to draw meaningful conclusions about the stock market.

Data Preprocessing

Missing values

Missing values are a common issue that can arise in datasets. A missing value refers to a data point that is absent or unknown in a particular variable or column of a dataset. This can occur due to various reasons, such as incomplete data entry, data corruption, or data loss during collection or processing. Let’s check if we have any missing values in our dataset.

Python Code:

import pandas as pd
data = pd.read_excel('Google Dataset.xlsx')
print(data.head())
print(data.isnull().sum())

Finding data points that have a 0.0% change from the previous month’s value:

data[data['Change %']==0.0]

Two data points, 100 and 105, have a 0.0% change.

Changing the ‘Month Starting’ column to a date datatype:

data['Month Starting'] = pd.to_datetime(data['Month Starting'], errors='coerce').dt.date

After converting to this format, we encountered three unexpected missing values. Let’s address these missing values.

#Replacing the missing values after cross verifying
data['Month Starting'][31] = pd.to_datetime('2020-05-01')
data['Month Starting'][43] = pd.to_datetime('2019-05-01')
data['Month Starting'][55] = pd.to_datetime('2018-05-01')

The data is now clean and ready to be analyzed.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important first step in analyzing a dataset, and it involves examining and summarizing the main characteristics of the data. Data visualization is one of the most powerful and widely used tools in EDA. Data visualization allows us to visually explore and understand the patterns and trends in the data, and it can reveal relationships, outliers, and potential errors in the data.

Change in the stock price over the years:

plt.figure(figsize=(25,5))
plt.plot(data['Month Starting'],data['Open'], label='Open')
plt.plot(data['Month Starting'],data['Close'], label='Close')
plt.xlabel('Year')
plt.ylabel('Close Price')
plt.legend()
plt.title('Change in the stock price of Google over the years')

The stock price has increased since 2017, with a peak enhancement occurring in 2022.

# Calculating the daily returns
data['Returns'] = data['Close'].pct_change()

# Calculating the rolling average of the returns
data['Rolling Average'] = data['Returns'].rolling(window=30).mean()

plt.figure(figsize=(10,5))

''' Creating a line plot using the 'Month Starting' column as the x-axis 
and the 'Rolling Average' column as the y-axis'''

sns.lineplot(x='Month Starting', y='Rolling Average', data=data)

The plot above shows that the rolling mean decreased in 2019 due to an increase in the stock price.

Correlation between variables
Correlation is a statistical measure that indicates the degree to which two or more variables are related. It is a useful tool in data analysis, as it can help to identify patterns and relationships between variables and to understand the extent to which changes in one variable are associated with changes in another variable. To find the correlation between variables in the data, we can use the in-built function corr(). This will give us a correlation matrix with values ranging from -1.0 to 1.0. The closer a value is to 1.0, the stronger the positive correlation between the two variables. Conversely, the closer a value is to -1.0, the stronger the negative correlation between the two variables. The heatmap will visually represent the correlation intensity between the variables, with darker colors indicating stronger correlations and lighter colors indicating weaker correlations. This can be a helpful way to identify relationships between variables and guide further analysis quickly.

corr = data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')

Scaling the returns using StandardScaler

To ensure that the data is normalized to have zero mean and unit variance, we use the StandardScaler from the Scikit-learn library. We first import the StandardScaler class and then create an instance of the class. We then fit the scaler to the Returns column of our dataset using the fit_transform method. This scales our data to have zero mean and unit variance, which is necessary for some machine learning algorithms to function properly.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data['Returns'] = scaler.fit_transform(data['Returns'].values.reshape(-1,1))
data.head()

Handling Unexpected Missing Values

data['Returns'] = data['Returns'].fillna(data['Returns'].mean())
data['Rolling Average'] = data['Rolling Average'].fillna(data['Rolling Average'].mean())

Model Development

Now that the data has been preprocessed and analyzed, we are ready to develop a model for anomaly detection. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset.

We will use the Isolation Forest algorithm to detect anomalies. Isolation Forest is an unsupervised machine learning algorithm that isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. This process is repeated until the anomaly is isolated.

We will use the Scikit-learn library to construct and train our Isolation Forest model. The following code snippet shows how to construct and train the model.

from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.05)
model.fit(data[['Returns']])

# Predicting anomalies
data['Anomaly'] = model.predict(data[['Returns']])
data['Anomaly'] = data['Anomaly'].map({1: 0, -1: 1})

# Ploting the results
plt.figure(figsize=(13,5))
plt.plot(data.index, data['Returns'], label='Returns')
plt.scatter(data[data['Anomaly'] == 1].index, data[data['Anomaly'] == 1]['Returns'], color='red')
plt.legend(['Returns', 'Anomaly'])
plt.show()

Conclusion

This project-based blog explored anomaly detection in Google stock data from 2014-2022. We used the Scikit-learn library in Python to construct and train an Isolation Forest model to detect anomalous data points within the dataset.

Our model was able to uncover hidden patterns and outliers in the data, and we were able to draw meaningful conclusions about the stock market. We found that the stock price has increased since 2017 and that the rolling mean decreased in 2019. We also found that the Open price correlates more with the Close price than any other feature.

Overall, this project was a great success and has opened up new possibilities for stock market analysis and anomaly detection.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Adil Mohammed

👋 Hello! I'm Adil Naib, a passionate data science enthusiast and Kaggle Notebooks Expert currently pursuing a degree in Data Science at Presidency University. Through courses and independent projects, I've acquired expertise in Exploratory Data Analysis, Data Visualization, and Predictive modelling.

🖋 I've written and published informative data science blogs on Analytics Vidhya. As I continue my journey in the data science world, I aim to create more valuable content that provides insights, tips, and solutions to complex data-related challenges.

💼 I'm excited to use my knowledge and talents in the real world and acquire first-hand experience through internships or other possibilities. I eventually want to work in data science and contribute significantly.

In my free time, I enjoy writing Data Science blogs and sharing my knowledge with the community.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

Data Science Tools and Techniques

Reading list

Introduction

Common Patterns

Validation Techniques

Time Series Forecasting

Exponential Smoothing

ARIMA

Prophet

Deep Learning

Anomaly Detection on Google Stock Data 2014-2022

Introduction

Learning Objectives:

Table of Contents

Understanding the Data and Problem Statement

Data Preprocessing

Exploratory Data Analysis

Model Development

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV