Welcome to the fascinating world of stock market anomaly detection! In this project, we’ll dive into the historical data of Google’s stock from 2014-2022 and use cutting-edge anomaly detection techniques to uncover hidden patterns and gain insights into the stock market. By identifying outliers and other anomalies, we aim to understand stock market trends better and potentially discover new investment opportunities. With the power of Python and the Scikit-learn library at our fingertips, we’re ready to embark on a thrilling data science journey that could change how we view the stock market forever. So, fasten your seatbelts and get ready to discover the unknown!
In this article, we will:
This article was published as a part of the Data Science Blogathon.
In this project-based blog, we will explore anomaly detection in Google stock data from 2014-2022. The dataset used in this project is obtained from Kaggle. The dataset is available on Kaggle, and you can download it here. The dataset contains 106 rows and 7 columns. The dataset consists of daily stock price data for Google, also known as Alphabet Inc. (GOOGL), from 2014 to 2022. The dataset contains several features, including the opening, closing, highest, lowest, and volume of shares traded for each day. It also includes the date on which the stock was traded. The dataset contains 106 rows and 7 columns.
Problem statement
This project aims to analyze the Google stock data from 2014-2022 and use anomaly detection techniques to uncover hidden patterns and outliers in the data. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset. Finally, we will analyze and interpret our results to draw meaningful conclusions about the stock market.
Missing values
Missing values are a common issue that can arise in datasets. A missing value refers to a data point that is absent or unknown in a particular variable or column of a dataset. This can occur due to various reasons, such as incomplete data entry, data corruption, or data loss during collection or processing. Let’s check if we have any missing values in our dataset.
Python Code:
import pandas as pd
data = pd.read_excel('Google Dataset.xlsx')
print(data.head())
print(data.isnull().sum())
Finding data points that have a 0.0% change from the previous month’s value:
data[data['Change %']==0.0]
Changing the ‘Month Starting’ column to a date datatype:
data['Month Starting'] = pd.to_datetime(data['Month Starting'], errors='coerce').dt.date
#Replacing the missing values after cross verifying
data['Month Starting'][31] = pd.to_datetime('2020-05-01')
data['Month Starting'][43] = pd.to_datetime('2019-05-01')
data['Month Starting'][55] = pd.to_datetime('2018-05-01')
Exploratory Data Analysis (EDA) is an important first step in analyzing a dataset, and it involves examining and summarizing the main characteristics of the data. Data visualization is one of the most powerful and widely used tools in EDA. Data visualization allows us to visually explore and understand the patterns and trends in the data, and it can reveal relationships, outliers, and potential errors in the data.
Change in the stock price over the years:
plt.figure(figsize=(25,5))
plt.plot(data['Month Starting'],data['Open'], label='Open')
plt.plot(data['Month Starting'],data['Close'], label='Close')
plt.xlabel('Year')
plt.ylabel('Close Price')
plt.legend()
plt.title('Change in the stock price of Google over the years')
# Calculating the daily returns
data['Returns'] = data['Close'].pct_change()
# Calculating the rolling average of the returns
data['Rolling Average'] = data['Returns'].rolling(window=30).mean()
plt.figure(figsize=(10,5))
''' Creating a line plot using the 'Month Starting' column as the x-axis
and the 'Rolling Average' column as the y-axis'''
sns.lineplot(x='Month Starting', y='Rolling Average', data=data)
Correlation between variables
Correlation is a statistical measure that indicates the degree to which two or more variables are related. It is a useful tool in data analysis, as it can help to identify patterns and relationships between variables and to understand the extent to which changes in one variable are associated with changes in another variable. To find the correlation between variables in the data, we can use the in-built function corr(). This will give us a correlation matrix with values ranging from -1.0 to 1.0. The closer a value is to 1.0, the stronger the positive correlation between the two variables. Conversely, the closer a value is to -1.0, the stronger the negative correlation between the two variables. The heatmap will visually represent the correlation intensity between the variables, with darker colors indicating stronger correlations and lighter colors indicating weaker correlations. This can be a helpful way to identify relationships between variables and guide further analysis quickly.
corr = data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')
Scaling the returns using StandardScaler
To ensure that the data is normalized to have zero mean and unit variance, we use the StandardScaler from the Scikit-learn library. We first import the StandardScaler class and then create an instance of the class. We then fit the scaler to the Returns column of our dataset using the fit_transform method. This scales our data to have zero mean and unit variance, which is necessary for some machine learning algorithms to function properly.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data['Returns'] = scaler.fit_transform(data['Returns'].values.reshape(-1,1))
data.head()
Handling Unexpected Missing Values
data['Returns'] = data['Returns'].fillna(data['Returns'].mean())
data['Rolling Average'] = data['Rolling Average'].fillna(data['Rolling Average'].mean())
Now that the data has been preprocessed and analyzed, we are ready to develop a model for anomaly detection. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset.
We will use the Isolation Forest algorithm to detect anomalies. Isolation Forest is an unsupervised machine learning algorithm that isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. This process is repeated until the anomaly is isolated.
We will use the Scikit-learn library to construct and train our Isolation Forest model. The following code snippet shows how to construct and train the model.
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.05)
model.fit(data[['Returns']])
# Predicting anomalies
data['Anomaly'] = model.predict(data[['Returns']])
data['Anomaly'] = data['Anomaly'].map({1: 0, -1: 1})
# Ploting the results
plt.figure(figsize=(13,5))
plt.plot(data.index, data['Returns'], label='Returns')
plt.scatter(data[data['Anomaly'] == 1].index, data[data['Anomaly'] == 1]['Returns'], color='red')
plt.legend(['Returns', 'Anomaly'])
plt.show()
This project-based blog explored anomaly detection in Google stock data from 2014-2022. We used the Scikit-learn library in Python to construct and train an Isolation Forest model to detect anomalous data points within the dataset.
Our model was able to uncover hidden patterns and outliers in the data, and we were able to draw meaningful conclusions about the stock market. We found that the stock price has increased since 2017 and that the rolling mean decreased in 2019. We also found that the Open price correlates more with the Close price than any other feature.
Overall, this project was a great success and has opened up new possibilities for stock market analysis and anomaly detection.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.