What I learnt about Time Series Analysis in 3 hour Mini DataHack?

Aarshay Jain Last Updated : 02 Aug, 2019

5 min read

Last weekend, I participated in the Mini DataHack by Analytics Vidhya and I learnt more about Time Series in those 3 hours than I did by spending many hours leading up to the event. Hence, I thought I will share my learnings with all of you.

What was Mini DataHack?

In short, Analytics Vidhya came up with an idea to shorten up their Signature hackathons and the result was Mini DataHack. It was basically a 3 hour hackathon, where the problem area was released upfront. The philosophy behind this Mini Hackathon was to provide a power pack of learnings on a focused area in a short duration.

My preparations

Since it was already decided that the problem would be about Time Series – I made sure I was well equipped with knowledge and packages about Time Series. Infact, I even wrote a guide to Time Series in Python.

Action packed 3 hours

If AV Signature hackathon is equivalent to an ODI in cricket, Mini DataHack was like a T20 match – shorter, action packed and full of twists! Close to 900 people registered for the Mini DataHack – a very high number, given that it was floated only 6 days before the event! It was a very intense competition from the word GO. Vopani made first submission in under 5 minutes and SRK (who won the competition) was not on top till a few minutes before the finish time.

Honestly, it was very difficult to guess the outcome until it happened.

Learnings from Mini DataHack

Needless to say, I learnt a lot about Time series in these 3 hours. Here is a brief summary of my learnings:

If there is one thing which matters the most in creating time series forecasts, it is the importance of plotting and visualizing trends with your eyes. Very often, people get caught in optimizing the evaluation metrics – which a lot of people did in this competition as well. But the winner made sure he plotted past data and forecasts on a single plot and it definitely served him well!
Along with using Time Series Forecasting techniques like ARIMA, a good idea is to formulating the problem as a supervised regression problem. As per the winner and experienced Kagglers, this works better in most cases. The supervised algorithm can contain variables such as Day of the month, Hour of the Day, Day of the Week, Days gone from series starting, Month of the Year, Week of the Year etc. Here is what SRK (the winner) said about his approach:

Approach from SRK:

I used both xgboost and linear regression to get to my final score. Variables used in the models are:
1. Day of the month
2. Hour of the day
3. Day of the week
4. Ordinal date (Number of days from January 1 of year 1)

At first, I plotted the DV using a scatter plot and here were some of my observations:

There is an increasing trend
The trend at the initial part is different from the later part of the training data

I then trained a xgboost model on the full dataset which I think helped to capture the overall trend of all these input variables. This is the one which scored a rmse of 139 on the public LB. But since xgboost is a space splitting algorithm, I thought it won’t be able to capture the increasing trend and so it may not be able to extrapolate the same in test set.
So, I decided to run a linear regression model to capture the increasing trend. Since the initial part has a different pattern compared to the later part, training the linear regression model only on the later part of the training set made more sense to me and I did that. This one scored a rmse of about 182 on the public LB.
I averaged both of these models and made the final submission which scored 155 rmse in public LB and 196 rmse in the Private LB.

One more inference from the modeling is that:
Including month of the year, week of the year variables in XGB gave good results in public LB. But when I checked the plot of predicted counts in test set, it took a dip after a certain time period due to the way in which the xgb captures the information. So including these variables might give a good public LB score but most probably will not give a good private LB score. So I dropped these variables while building the models.

Codes are present in my github and the link is
https://github.com/SudalaiRajkumar/ML/tree/master/AV_MiniHack1
—

I think each word in the approach above can be weighed in Gold! A natural question coming to my mind was “How can XGBoost perform better than Time Series methods?” And here is what Vopani added:

Note from Vopani:

I’m not surprised XGB and linear models performed so well. I tried out a lot of models and found XGB far superior than any other.
I’ve had a good exposure to time series problems since I worked on many such projects, and in almost all of them I converted the problem into a structure which would fit any supervised algorithm, like what was done here by most people, including SRK and me.
Its no fault of the dataset, its just that XGB is way too clever and powerful, and is able to capture linear and seasonal trends pretty well with the basic date features.
A real time-series challenge is one where the values are given in order without the date variable. Then, you can’t really use an XGB-type model and thats when the power of the ARIMA-type models comes into the picture.
Unfortunately, its pointless keeping out the date variable since there is a lot of useful information there which can boost accuracy and hence, ultimately, XGB ends up the winner.

In Summary:

XGBoost is a powerful technique which can be used in this case but should be used wisely. It has a tendency to overfit in local regions and doesn’t cater well to the overall trend.
Some of the experienced players who used only XGBoost ended up with positions below 50. To cover for this flaw, the winner SRK averaged the XGBoost model with a linear regression model in his final submission.
The variables to be modeled with XGBoost should be selected wisely as using all might overfit the data and not generalize well. The winner SRK removed the last 2 in his final submission as he saw the tendency for the model to overfit.
The good performance of XGBoost models doesn’t mean that traditional forecasting techniques should be completely ignored. If applied properly they work nicely as the #2 ranked player used an ARIMA model

End Note:

I learnt a lot from participating in this Mini DataHack and I can’t help wanting more of these! I hope AV comes up with the next action packed weekend soon!

Aarshay Jain

Aarshay graduated from MS in Data Science at Columbia University in 2017 and is currently an ML Engineer at Spotify New York. He works at an intersection or applied research and engineering while designing ML solutions to move product metrics in the required direction. He specializes in designing ML system architecture, developing offline models and deploying them in production for both batch and real time prediction use cases.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

sr1407

Hi Aarshay, When I tried XGB, it gave me negative forecast. Any idea, what could be the reason?

Show 1 reply

Aarshay Jain

Well there could be multiple reasons. You should note that many of the guys who used ONLY XGBoost dropped from a public LB rank of top 10-15 to private LB rank below 50. So you were probably overfitting the model. Its hard to tell exactly what's going wrong unless you share the details. I'll recommend you start a thread on discussion portal with details of the parameters you used in your model. It'll be easier to discuss there and others can also pitch in. :)

Deep8006

The problem set was quite interesting with little pattern identification logic which needed to be applied. My analysis and solution to Mini DataHack on 6th Feb is posted at http://powerofml.blog.com/mini-hackathon/ Please feel free to post your comments/queries

Show 1 reply

Aarshay Jain

Thanks for sharing for approach :)

Mathan

Can someone give the link to the data sets? I missed to participate even though I had registered. Thanks in advance.

Show 1 reply

Aarshay Jain

We haven't made the competition open yet. But we have received many requests for the data. We'll figure out the right solution and reach out to you soon.

Write for us

Write, captivate, and earn accolades and rewards for your work

Reach a Global Audience
Get Expert Feedback
Build Your Brand & Audience

Cash In on Your Knowledge
Join a Thriving Community
Level Up Your Data Science Game

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What I learnt about Time Series Analysis in 3 hour Mini DataHack?

What was Mini DataHack?

Action packed 3 hours

Learnings from Mini DataHack

Approach from SRK:

Note from Vopani:

End Note:

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS