Whenever we talk about building better forecasting models, the first and foremost step starts with detecting. Decomposing time series components like a trend, seasonality & cyclical component and getting rid of their impacts become explicitly important to ensure adequate data quality of the time-series data we are working on and feeding into the model as getting sturdy time series data (stationary data) having no significance of trend, and seasonal component is a rare phenomenon. While we are blessed with so many techniques, understanding their advantages and disadvantages and the right selection plays a vital role in meeting the objective. In this article, we shall be learning essential steps of selecting the best decomposition techniques through the practical application of each one using python.
Keeping the above objective in mind, I have structured the learning by giving in-depth details on the techniques for detecting and de-attaching the various time series components.
This article was published as a part of the Data Science Blogathon
Time-series data has four major components, as shown in the below figure. Before we proceed further, getting acquainted with these components becomes essential, along with knowing the significant levels of differences within themselves. These cited components are a trend, seasonality, cyclical and irregular components.
Graphically, all these aforesaid components can be distinguished as per the below figure:
Traditional forecasting techniques (Moving Average & Exponential Smoothing) work well for fairly sturdy data having no significance of trend and seasonality. But before applying any forecasting modeling, the best practice is to mandatorily check the presence of trend and seasonality as many time-series datasets have effects of both of these components; hence it becomes essential to find and remove these aforesaid components to get a better forecast. The below figure shows a flow chart that can be referred to as a general procedure for handling series data
Let’s experiment with our learning on the real-world industry-related dataset.
Case-1 is about the Steel Wastage Salse Dataset over a period of 4 years (2018-2022), where a Project Infrastructure based company has recorded the steel waste sales data and wanted to forecast the selling rate for reconciliation of project cost. Using python libraries, let’s try to visualize the data.
The easiest way to begin detecting trends is just by plotting a line plot using the Pandas seaborn library and visualizing the long-term upward/downward movement, if any.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_excel(r"C:Users..Time Series ForecastingScrapRate04.xlsx")
plt.figure(figsize=(15,5))
sns.lineplot(x='Date', y='Rate in Rs./Kg.', data=df, legend=True, color='r', label='Actual Trend')
plt.ylabel('Scrap Rate', fontsize=15)
plt.xlabel('Date of Sale', fontsize=15)
plt.title('Steel Scrap Rate:2018-2022', fontsize=20);
Prima fascia, our data looks to follow an upward trend from 2021 onward but to prove our expression scientifically, we rely on some robust methods as mentioned below.
HP Filter is the most used technique for detecting trends from time series datasets. Mathematically it can be expressed (fig-5) by adding two terms (1) Sum of Squared Variations by penalizing cyclical component (Yt – Tt) and (2) the second term defines multiple λ of the sum of the square of trend components (with second differencing) which penalize variations in the development of trend components.
In the above figure, L denotes the lag operator, which operates on the previous values in time series data. In practice, the common value of λ can be referred to as 100 for yearly data, 1600 for quarterly data, and 14,400 for monthly. The larger the value of λ, the larger the penalty to the variation in the growth rate.
from statsmodels.tsa.filters.hp_filter import hpfilter
sw_cycle,sw_trend = hpfilter(df['Rate in Rs./Kg.'], lamb=100)
sw_trend.plot(figsize=(10,5)).autoscale(axis='x',tight=True)
plt.title('Detecting Trend using HP Filter', fontsize=20)
plt.xlabel('Days', fontsize=15)
plt.ylabel('Steel Waste Sales Rate', fontsize=15)
plt.show()
Wow! Looking at the above figure, an upward trend is clearly visible, which significantly proves our assumption in the above figures.
Detrending is the process of removing trends from the time series data. Identification, modeling, and sometimes removing trends from the time-series data can be beneficial and makes noticeable impacts. The below flow chart shows the significance of detecting the trend before attempting any statistical modeling techniques.
Differencing the original time series is a usual approach for converting a non-stationary process to stationary. It’s straightforward to define it as the difference between the previous day’s and today’s data. The first difference between consecutive Yt can be computed by subtracting the previous day’s data from the day’s.
Mathematically it can be expressed as;
Yt
Pandas function diff() is used both for series and DataFrame by which we can directly get the differencing. It can provide a period value to shift to form a differencing. Let’s plot the difference (difference between the day and the previous day) using a line plot with the following line of codes.
df['diff'] = df['Rate in Rs./Kg.'].diff()
plt.figure(figsize=(15,6))
plt.plot(df['diff'],color='g')
plt.title('Detrending using Differencing', fontsize=20)
plt.xlabel('Days', fontsize=15)
plt.ylabel('Steel Waste Rate', fontsize=15)
plt.legend()
plt.show()
Using the differencing method, we can see that the trend has been removed, and now the plots have no apparent upward or downward movement. However, we followed the first order of differencing to eliminate the trend and got the result. Still, following the second or third order of differencing may be required to meet the objective if the first order differencing fails.
A signal is another form of time series data that increases or decreases in a different order. Using the SciPy library helps us to remove the linear trend from the signal data. By importing a python library called ‘signal,’ we can plot the trend using the below line of code.
from scipy import signal
import warnings
warnings.filterwarnings("ignore")
detrended = signal.detrend(df.Production.values)
plt.figure(figsize=(15,6))
plt.plot(detrended)
plt.xlabel('Days', fontsize =15)
plt.ylabel('Production', fontsize= 15)
plt.title('Detrending using Scipy Signal', fontsize=20)
plt.show()
Along with detecting the trend (already explained in section ref. 2.1.1), this technique has become the benchmark for getting rid of trend movement. It is broadly employed in econometric methods in applied macroeconomic research (i.e., international economic agencies, government macroeconomic research, etc.). This non-parametric technique is significantly used for tuning parameters to control the degree of smoothing. It is used to remove short terms fluctuations.
Being the yearly dataset given here to work with, we shall be using λ value at 100 with the below lines of codes.
from statsmodels.tsa.filters.hp_filter import hpfilter
import warnings
warnings.filterwarnings("ignore")
sw_cycle,sw_trend = hpfilter(df['Rate in Rs./Kg.'],lamb=100)
df['hptrend'] = sw_trend
df['hpdetrended'] = df['Rate in Rs./Kg.'] - df['hptrend']
plt.figure(figsize=(15,6))
plt.plot(df['hpdetrended'], color='darkorange')
plt.title('Detrending using HP Filter', fontsize=20)
plt.xlabel('Days', fontsize=15)
plt.ylabel('Steel Waste Sales Rate', fontsize=15)
plt.show()
Looking at the above plot (Fig-5) shows the short terms trend has been removed and smoothened the data.
Limitation :
It is measured by the seasonality index, which is periodical fluctuation where the same pattern occurs at the regular interval of time within the calendar year.
To detect seasonality, two popular methods are employed.
Boxplot represents data spread over a range to show the first, middle, and third quartile and a maximum spread of a given dataset. Using the below lines of codes, seasonality can be detected.
df['year'] = pd.DatetimeIndex(df['Date']).year
plt.figure(figsize=(15,6))
sns.boxplot(x='year', y='Rate in Rs./Kg.', data=df).set_title("Multi Year-wise Box Plot")
plt.show()
Looking at the above Plot (fig-6), in the month of January to March, the average rate increased, which represents the presence of the seasonality effect. However, for more details year-on-year comparison also helps get more details.
Autocorrelation is used to check randomness in data. For the data having unknown periodicity, it helps in identifying datatype. For instance, for the monthly data, if there is a regular seasonal effect, we would hope to see massive peak lags after every 12 months. The below plot demonstrates an example of detecting seasonality with the help of an autocorrelation plot.
from pandas.plotting import autocorrelation_plot
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.figsize':(10,4), 'figure.dpi':100})
autocorrelation_plot(df['Rate in Rs./Kg.'].tolist())
Deseasoning means removing seasonality from time-series data. It is a stripping of the pattern of seasonal impacts from the data.
Decomposition is the process of understanding generalizations and problems related to time-series forecasting. We can use python’s stats-model library called seasonal decomposition to remove seasonality from data. This will give us the data only with the trend, cyclic, and irregular variations.
from statsmodels.tsa.seasonal import seasonal_decompose
result_mul = seasonal_decompose(df['Rate in Rs./Kg.'],model='multiplicative', extrapolate_trend='freq', freq=12)
deseason = df['Rate in Rs./Kg.'] - result_mul.seasonal
plt.figure(figsize=(15,6))
plt.plot(deseason)
plt.title('Deseasoning using seasonal_decompose', fontsize=16)
plt.xlabel('Days')
plt.ylabel('Steel Waster Sales Rate')
plt.show()
The variations in time series which arise out of the phenomenon of business cycles are called the Cyclical Component. The cyclical component is fluctuation around the trend line that happens due to macroeconomic changes such as recession, unemployment, etc. Cyclical fluctuations have repetitive patterns with a time between repetitions of more than a year. It is a recurrent process and less frequent compared to seasonality. We shall be using HP Filters again to detect the cyclical effect from the data.
As already explained in sections 2.1.1 and 2.2.3, again using python’s library ‘hp filter’, we can derive the cyclical variation using the below lines of codes.
sw_cycle,sw_trend = hpfilter(df['Rate in Rs./Kg.'], lamb=100)
df['cycle'] =sw_cycle
df['trend'] =sw_trend
df[['cycle']].plot(figsize=(15,6)).autoscale(axis='x',tight=True)
plt.title('Extracting Cyclic Variations', fontsize=20)
plt.xlabel('Days')
plt.ylabel('Steel Waste Sales Rate', fontsize =15)
plt.show()
When trend, seasonality, and cyclical behavior are removed, the pattern left behind, which can not be explained, is called an Irregular Component. Various techniques are available to check these terms, such as probability theory, moving average, and Auto-Regressive Methods. Finding cyclic variation itself is considered to be part of the residuals. Using Time Series Decomposing, we can isolate these time series components using the below lines of code.
Time series data can be modeled as an addition or product of trend (Tt), Seasonality (St), cyclical (Ct), or Irregular components (It).
Additive models assume that seasonality and cyclical component are independent of the trend. These are not very common since, in many cases, the seasonality component may not be independent of the trend. The additive model can be used for time series data where linear trends are formed wherein changes are constant over time.
Multiplicative Models are commonly used models for many datasets across industries. For building a forecasting model,
only trend and seasonal components are considered. For cyclical components, a large dataset must have a span of more than 10 years; hence, due to the limitation of availing such a large dataset, cyclical components are rarely used for modeling. The multiplicative models ideally perform well for the nonlinear types of modeling (quadric or exponential).
We shall be using python’s stats-model libraries to obtain time series decomposition.
from statsmodels.tsa.seasonal import seasonal_decompose
tsm_decompose = seasonal_decompose(np.array(df['Rate in Rs./Kg.']), model = 'multiplicative', freq = 12)
plt.figure(figsize = (15,5))
tsm_plot = tsm_decompose.plot()
We can see the increasing trend from the dataset. Also, seasonality can be detected by having an index ranging between -0.5 to 0.5. Using decompose, our dataset has been added with two new columns, ‘trend’ and ‘seasonality.’
df['seasonal'] = tsm_decompose.seasonal
df['trend'] = tsm_decompose.trend
df[30:35] #Final Dataset Just for ref.
Seasonality and cyclicity are both recurring patterns in data, but they differ in their predictability and timescale:
We have learned to isolate time series components such as trend, seasonality, and cyclical effects using multiple techniques for better forecasting accuracy. However, the interpretation of the outcomes of these techniques also plays an important role in the context of the domain special problem statement. While working with the Real-Wold Problem statements of a specific company, as a data scientist, It’s also beneficial to get acquainted with the business processes practiced by the said organization with a fair degree of understanding of the time-series data provided by them along with the input of domain expertise.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.