Stock Market Price Trend Prediction Using Time Series Forecasting

Prabhat Last Updated : 17 Oct, 2024

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Time series forecasting is used to predict future values based on previously observed values and one of the best tools for trend analysis and future prediction.

What is time-series data?

It is recorded at regular time intervals, and the order of these data points is important. Therefore, any predictive model based on time series data will have time as an independent variable. The output of a model would be the predicted value or classification at a specific time.

Time series analysis vs time series forecasting

Let’s talk about some possible confusion about the Time Series Analysis and Forecasting. Time series forecasting is an example of predictive modeling whereas time series analysis is a form of descriptive modeling.

For a new investor general research which is associated with the stock or share market is not enough to make the decision. The common trend towards the stock market among the society is highly risky for investment so most of the people are not able to make decisions based on common trends. The seasonal variance and steady flow of any index will help both existing and new investors to understand and make a decision to invest in the share market.

To solve this kind of problem time series forecasting is the best technique.

Stock market

Stock markets are where individual and institutional investors come together to buy and sell shares in a public venue. Nowadays these exchanges exist as electronic marketplaces.

That supply and demand help determine the price for each security or the levels at which stock market participants — investors and traders — are willing to buy or sell.

The concept behind how the stock market works is pretty simple. Operating much like an auction house, the stock market enables buyers and sellers to negotiate prices and make trades.

Definition of ‘Stock’

A Stock or share (also known as a company’s “equity”) is a financial instrument that represents ownership in a company

Machine learning in the stock market

The stock market is very unpredictable, any geopolitical change can impact the share trend of stocks in the share market, recently we have seen how covid-19 has impacted the stock prices, which is why on financial data doing a reliable trend analysis is very difficult. The most efficient way to solve this kind of issue is with the help of Machine learning and Deep learning.

In this tutorial, we will be solving this problem with ARIMA Model.

A popular and widely used statistical method for time series forecasting is the ARIMA model. It is one of the most popular models to predict linear time series data. This model has been used extensively in the field of finance and economics as it is known to be robust, efficient, and has a strong potential for short-term share market prediction. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the auto-correlation(Autocorrelation is the degree of similarity between a given time series and a lagged version of itself over successive time intervals) in the data.

To know about seasonality please refer to my previous blog, And to get a basic understanding of ARIMA I would recommend you to go through this blog, this will help you to get a better understanding of how Time Series analysis works.

Implementing stock price forecasting

I will be using nsepy library to extract the historical data for SBIN.

Imports and Reading Data

Python Code:

import os
import warnings
warnings.filterwarnings('ignore')
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math
import numpy as np
from nsepy import get_history
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sbin = get_history(symbol='SBIN',
                   start=date(2000,1,1),
                   end=date(2020,11,1))
print(sbin.head())

The data shows the stock price of SBIN from 2020-1-1 to 2020-11-1. The goal is to create a model that will forecast

the closing price of the stock.

Let us create a visualization which will show per day closing price of the stock-

^{plt.figure(figsize=(10,6))}
^{plt.grid(True)}
^{plt.xlabel('Dates')}
^{plt.ylabel('Close Prices')}
^{plt.plot(sbin['Close'])}
^{plt.title('SBIN closing price')}
^plt.show()

^{plt.figure(figsize=(10,6))}
^{df_close = sbin['Close']}
^{df_close.plot(style='k.')}
^{plt.title('Scatter plot of closing price')}
^plt.show()

^{plt.figure(figsize=(10,6))}
^{df_close = sbin['Close']}
^{df_close.plot(style='k.',kind='hist')}
^{plt.title('Hisogram of closing price')}
^plt.show()

First, we need to check if a series is stationary or not because time series analysis only works with stationary data.

Testing For Stationarity:

To identify the nature of the data, we will be using the null hypothesis.

H0: The null hypothesis: It is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

H1: The alternative hypothesis: It is a claim about the population that is contradictory to H₀ and what we conclude when we reject H₀.

#Ho: It is non-stationary

#H1: It is stationary

If we fail to reject the null hypothesis, we can say that the series is non-stationary. This means that the series can be linear.

If both mean and standard deviation are flat lines(constant mean and constant variance), the series becomes stationary.

from statsmodels.tsa.stattools import adfuller
^{def test_stationarity(timeseries):}
^{#Determing rolling statistics}
^{rolmean = timeseries.rolling(12).mean()}
^{rolstd = timeseries.rolling(12).std()}
^{#Plot rolling statistics:}
^{plt.plot(timeseries, color='yellow',label='Original')}
^{plt.plot(rolmean, color='red', label='Rolling Mean')}
^{plt.plot(rolstd, color='black', label = 'Rolling Std')}
^{plt.legend(loc='best')}
^{plt.title('Rolling Mean and Standard Deviation')}
^{plt.show(block=False)}
^{print("Results of dickey fuller test")}
^{adft = adfuller(timeseries,autolag='AIC')}
^{# output for dft will give us without defining what the values are.}
^{#hence we manually write what values does it explains using a for loop}
^{output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])}
^{for key,values in adft[4].items():}
^{output['critical value (%s)'%key] =  values}
^{print(output)}

^{test_stationarity(sbin['Close'])}

Rolling mean and SD - Time Series forecasting

After analysing the above graph, we can see the increasing mean and standard deviation and hence our series is not stationary.

Results of dickey fuller test
Test Statistics                  -1.914523
p-value                           0.325260
No. of lags used                  3.000000
Number of observations used    5183.000000
critical value (1%)              -3.431612
critical value (5%)              -2.862098
critical value (10%)             -2.567067
dtype: float64

We see that the p-value is greater than 0.05 so we cannot reject the Null hypothesis. Also, the test statistics is greater than the critical values. so the data is non-stationary.

For time series analysis we separate Trend and Seasonality from the time series.

^{result = seasonal_decompose(df_close, model='multiplicative', freq = 30)}
^{fig = plt.figure()}
^{fig = result.plot()}
^{fig.set_size_inches(16, 9)}

from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
df_log = np.log(sbin['Close'])
moving_avg = df_log.rolling(12).mean()
std_dev = df_log.rolling(12).std()
plt.legend(loc='best')
plt.title('Moving Average')
plt.plot(std_dev, color ="black", label = "Standard Deviation")
plt.plot(moving_avg, color="red", label = "Mean")
plt.legend()
plt.show()

Time Series forecasting - moving average

Now we are going to create an ARIMA model and will train it with the closing price of the stock on the train data. So let us split the data into training and test set and visualize it.

^{train_data, test_data = df_log[3:int(len(df_log)*0.9)], df_log[int(len(df_log)*0.9):]}
^{plt.figure(figsize=(10,6))}
^{plt.grid(True)}
^{plt.xlabel('Dates')}
^{plt.ylabel('Closing Prices')}
^{plt.plot(df_log, 'green', label='Train data')}
^{plt.plot(test_data, 'blue', label='Test data')}
^plt.legend()

^{model_autoARIMA = auto_arima(train_data, start_p=0, start_q=0,}^{test='adf',       # use adftest to find optimal 'd'}
^{max_p=3, max_q=3, # maximum p and q}
^{m=1,              # frequency of series}
^{d=None,           # let model determine 'd'}
^{seasonal=False,   # No Seasonality}
^start_P=0,
^D=0,
^trace=True,
^{error_action='ignore',}
^{suppress_warnings=True,}
^{stepwise=True)}
^{print(model_autoARIMA.summary())}

Performing stepwise search to minimize aic
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=-16607.561, Time=2.19 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=-16607.961, Time=0.95 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=-16608.035, Time=2.27 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=-16609.560, Time=0.39 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=-16606.477, Time=2.77 sec

Best model:  ARIMA(0,1,0)(0,0,0)[0]          
Total fit time: 9.079 seconds
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                 4665
Model:               SARIMAX(0, 1, 0)   Log Likelihood                8305.780
Date:                Tue, 24 Nov 2020   AIC                         -16609.560
Time:                        20:08:50   BIC                         -16603.113
Sample:                             0   HQIC                        -16607.293
                               - 4665                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
sigma2         0.0017   1.06e-06   1566.660      0.000       0.002       0.002
===================================================================================
Ljung-Box (Q):                       24.41   Jarque-Bera (JB):         859838819.58
Prob(Q):                              0.98   Prob(JB):                         0.00
Heteroskedasticity (H):               7.16   Skew:                           -37.54
Prob(H) (two-sided):                  0.00   Kurtosis:                      2105.12
===================================================================================

model_autoARIMA.plot_diagnostics(figsize=(15,8))
plt.show()

model = ARIMA(train_data, order=(3, 1, 2))
fitted = model.fit(disp=-1)
print(fitted.summary())

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.Close   No. Observations:                 4664
Model:                 ARIMA(3, 1, 2)   Log Likelihood                8309.178
Method:                       css-mle   S.D. of innovations              0.041
Date:                Tue, 24 Nov 2020   AIC                         -16604.355
Time:                        20:09:37   BIC                         -16559.222
Sample:                             1   HQIC                        -16588.481
                                                                              
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const          8.761e-06      0.001      0.015      0.988      -0.001       0.001
ar.L1.D.Close     1.3689      0.251      5.460      0.000       0.877       1.860
ar.L2.D.Close    -0.7118      0.277     -2.567      0.010      -1.255      -0.168
ar.L3.D.Close     0.0094      0.021      0.445      0.657      -0.032       0.051
ma.L1.D.Close    -1.3468      0.250     -5.382      0.000      -1.837      -0.856
ma.L2.D.Close     0.6738      0.282      2.391      0.017       0.122       1.226
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.9772           -0.6979j            1.2008           -0.0987
AR.2            0.9772           +0.6979j            1.2008            0.0987
AR.3           74.0622           -0.0000j           74.0622           -0.0000
MA.1            0.9994           -0.6966j            1.2183           -0.0969
MA.2            0.9994           +0.6966j            1.2183            0.0969
-----------------------------------------------------------------------------

# Forecast

fc, se, conf = fitted.forecast(519, alpha=0.05)  # 95% confidence

fc_series = pd.Series(fc, index=test_data.index)
lower_series = pd.Series(conf[:, 0], index=test_data.index)
upper_series = pd.Series(conf[:, 1], index=test_data.index)
plt.figure(figsize=(12,5), dpi=100)
plt.plot(train_data, label='training')
plt.plot(test_data, color = 'blue', label='Actual Stock Price')
plt.plot(fc_series, color = 'orange',label='Predicted Stock Price')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.10)
plt.title('SBIN Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Actual Stock Price')
plt.legend(loc='upper left', fontsize=8)
plt.show()

Conclusion

Time Series forecasting is really useful when we have to take future decisions or we have to do analysis, we can quickly do that using ARIMA, there are lots of other Models from we can do the time series forecasting but ARIMA is really easy to understand.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

Prabhat Pathak (Linkedin profile) is a Senior Analyst and innovation Enthusiast.

Prabhat

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Gaurav Sinha

Very informative. Keep sharing the knowledge.

Bhargav Mahesh

Hey Prabhat, I thoroughly enjoyed reading your blog post on "How to Create an Arima Model" and this "Stock Price Prediction using Auto Arima". Both the posts are very informative for a Machine Learning beginner like me. But in this blog post I wanted to inform you about making a small correction in the code where you have splitted the Train and Test data. While plotting the training data, you have mistook the entire data as an argument instead of train set[ #### plt.plot(df_log, 'green', label='Train data')], please have a look at that. Also I wanted to ask you 4 questions. 1. Your code contains this statement [###fc, se, conf = fitted.forecast(519, alpha=0.05) # 95% confidence]. Can you explain this statement? What is 519 in that code? 2. While training your model you have used this code [model = ARIMA(train_data, order=(3, 1, 2)) ]. can you please tell me how did you use (p, d, q) = (3,1,2) without plotting ACF and PACF plots? 3. In the Model Summary, I'm getting NaN's for columns [std err, z, P>|z|, [0.025 0.975]]. What does this indicate? 4. What is the reason for using Log Transform for target variable? (anything related to guassian distribution?) Forgive me if the above mentioned questions look silly for you. If you could walk me through these questions, it would be more helpful to me. Thanks Regards Bhargav Mahesh

Ayushi Jaiswal

Quite informative and interesting 👍

Reading list

Introduction

Common Patterns

Validation Techniques

Time Series Forecasting

Exponential Smoothing

ARIMA

Prophet

Deep Learning

Stock Market Price Trend Prediction Using Time Series Forecasting

Introduction

What is time-series data?

Time series analysis vs time series forecasting

Stock market

Definition of ‘Stock’

Machine learning in the stock market

Implementing stock price forecasting

Imports and Reading Data

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID