Winner’s Approach – Rampaging DataHulk MiniHack, AV DataFest 2017

Sunil Ray Last Updated : 07 Apr, 2017

8 min read

Introduction

Who are you competing with?

While participating in a hackathon, a lot of people think that they are competing against the top data scientists. While, in reality, most of us really compete with ourselves. The ones who improve themselves, the ones who competing with their own previous self and push their limits to become better are always the eventual winners.

We see this happen very frequently on Analytics Vidhya. We saw this again in our first ML contest of DataFest 2017 – Rampaging DataHulk. In this minihack, we saw experienced professionals, students & previous winners compete with each other for the top 3 ranks. A total of 1458 people participated in the minihack. The competition began at 6 PM on 2 April marking the first competition in DataFest.

After a fist to fist battle in true “Hulk-athon” style, we saw something remarkable. Something which hasn’t happened for a while on Analytics Vidhya. The top 3 ranks were bagged by first-time winners. To top it up, the winner is still in his college days! That is just a testimony to the competitiveness and openness of the platform.

Like always, the winners of the competition have generously shared their detailed approach and the codes they used in the competition.

If you missed out the fun this weekend, make sure you participate in the upcoming Machine Learning Hackathon & The QuickSolver MiniHack.

The problem statement

The problem statement revolved around a hedge fund company “QuickMoney”. They rely on automated systems to carry out trades in the stock market at inter-day frequencies. They wish to create a machine learning-based strategy for predicting the movement in stock prices for maximizing their profit. So they were seeking out a help from top data scientists.

Stock markets are known to have a high degree of unpredictability but it is possible to beat the odds and create a system which will outperform others.

The participants were required to create a trading strategy for maximizing their profit in the stock market. The task was to predict the probability whether the price for a particular stock for next day market close will be higher(1) or lower(0) compared to the price for market close today.

Winners

The winners used different approaches and rose up on the leaderboard. Below are the top 3 winners on the leaderboard:

Rank 1: Akash Gupta

Rank 2: Prince Atul

Rank 3: Santanu Pattanayak

Here are the final rankings of all the participants at the leaderboard.

All the Top 3 winners have shared their detailed approach & code from the competition. I am sure you are eager to know their secrets, go ahead.

Rank 3, Santanu Pattanayak

Santanu Pattanayak

Santanu Pattanayak is Lead Data Scientist at GE Digital. He often participates in machine learning competitions on Analytics Vidhya. He likes to challenge himself.

Following is the approach he took for the Analytics Vidhya Rampaging Datahulk Competition. He secured 3rd place in the competition with a private Leaderboard Score of 0.678784:

1. First, I did some exploratory data analysis. I checked the number of records in train and test datasets and checked whether there is any class imbalance that we need to deal with. The training dataset was quite balanced with 45% of the data belonging to the positive class. Since the dataset sizes were satisfactory i.e. 702739 train records and 101946 test records hence class imbalance adjustments were not necessary. Then I checked the number of different stocks in both train and test and checked whether all the stocks in test are there in train dataset or not. The train dataset has 1955 stocks while the test dataset has 2118 stocks. Since the test has more stocks clearly stock id cannot be used as a feature since the model would learn nothing about those stock ids that are there in test but not in train.

2. The main task as in most of the machine learning tasks is to do proper feature engineering. So, spend quite a bit of time thinking what would be good features with respect to the output that we are going to predict – that is whether the sales of tomorrow’s market close is going to be higher than today’s market close.

There were missing values in the below fields:

Three_Day_Moving_Average
Five_Day_Moving_Average,
Ten_Day_Moving_Average
Twenty_Day_Moving_Average

I replaced the missing values with 99999 and created indicator variables indicating whether these fields have missing values.

Then I created few variables capturing the difference in the moving averages. For example – (Three_Day_Moving_Average – Ten_Day_Moving_Average). I created such variables for each pair of the moving average variables.

I created couple of features by taking the sum and difference of the variables Positive_Directional_Movement and Negative_Directional_Movement. Similarly, I created two features by taking the sum and difference of the variables True_Range and Average_True_Range.

Also, I created few features to hold the moving average of the days prior to a specific period as below:

df['MA_last_10_3'] = (df['Ten_Day_Moving_Average']*10 – df['Three_Day_Moving_Average']*3)/7

df['MA_last_10_5'] = (df['Ten_Day_Moving_Average']*10 - df['Five_Day_Moving_Average']*5)/5

df['MA_last_5_3'] = (df['Five_Day_Moving_Average']*5 - df['Three_Day_Moving_Average']*3)/2

Here the first variable is computing the average of the 7 days prior to the last 3 days.

3. Once I build these features then I split the training data into two parts – 80% of the data for training the models and 20% for validation purpose. Below are the models that I tried –

Gradient boosting from graphlab – It’s always easy to work with graphlab since you can input a dataframe along with the features and target unlike most of the other packages wherein you would have to create a numpy matrix or a sparse matrix before the algorithms can be invoked. Experimented with 300,500 and 700 trees, with the class weights set to “auto”, tree depth of 6, min child weight and minimum loss reduction set to “4” each. Also, the column subsample and the row subsample was set to 80 percent.

It gave good performance with validation logloss of around 0.6820 and public leaderboard of around 0.6855

I tried my hand at a small neural network through Keras with two hidden layers of 300 units each and dropout of each hidden layer set to 0.5. For the hidden layers I chose activation as ‘RELU’ and the output layer as ‘sigmoid’ and got a logloss of around 0.688 in both validation and in leaderboard.

Since the neural network and Gradient boosting are very different models I tried to take the mean of their predicted probabilities and the public leaderboard logloss improved to 0.6831.

Still I was not able to enter the 0.67 range.

Next, I tried my luck at xgboost with kind of similar configuration as that of the graphlab gradient boosting model.

I experimented a bit with the number of trees and finally got the best results with the below parameters.

No of trees	700
Column subsample	0.8
Row subsample	0.8
L2 regularization	2(lambda)
L1 regularization	0.02(alpha)
Minimum child weight	4
Objective	Binary:logistic
Booster	Gbtree
Eta	0.02
Early stopping round	20

The above model gave me 0.6780 logloss on Public leaderboard (9^th rank) and 0.6787 logloss on the private leaderboard (3rd place).

Solution: Code File

Rank 2, Prince Atul

Prince Atul

Prince Atul is a Senior Scientist at Cognizant. Prince has been participating in various competitions at Analytics Vidhya. Prince is also a volunteer for Analytics Vidhya and helps us with our community efforts. This is his approach:

I decided to approach this hackathon with more focus on feature engineering than on model selection and data processing. After reading the problem, I decided to use gradient boosting with binary logistics.

I always submit a preliminary model, generally with all the variables, to set a benchmark score.

There were 4 moving averages in the data set and I expected them to be correlated. So, I plotted correlation matrix and as expected 10 days and 20 days moving average were highly correlated with other moving averages. I removed these two variables and trained my model on rest of the data. This model was giving a 0.68 (approx.) score on public leaderboard.

I checked for null values and there were 4000+ rows which had missing values. I left it as it because it was very small percentage of the train data set. (Wanted to come back to it, didn’t get time)

After this I started creating features. Features which improved my score were (1,0,-1 values) :- comparison of 3 days moving average with other moving averages, comparison of 5 days moving average with other moving averages and sum of these comparison value. I created this to use price movement direction based on moving averages. After creating this, my model was giving a score of 0.677(approx.) on public leaderboard.

I think that hardest part in any mini-hackathon is to create features. It takes some thinking and not every feature you create will add values. But, it is important to keep on doing it even if first few features are not able to improve your model.

Solution: Code File

Rank 1, Akash Gupta

Akash Gupta

Akash Gupta is a final year student at IIT Roorkee. Akash is one of the most competitive students we have come across on Analytics Vidhya. He fetched his last win in The Ultimate Student Hunt competition by securing 5th rank.

Find out what’s his secret for winning this minihack.

Initialization: I started out by trying a basic xgboost model using the given features and filling the missing values with -1. I generally start with xgboost because of its speed and good scores. I had removed the ID and timestamp features.

Cross Validation: To set up a quick cross validation, I randomly sampled out 10% of the dataset and set that up as the eval data. I had planned to write for timestamp-based partitioning later. But the initial eval scores for this setup were similar to the ones I got on the public leaderboard, so I persisted with this setup.

Feature Engineering

On plotting the feature importances using the default set of features, I realized that the MA features were not contributing much. Also, to me using the absolute values of these features was not intuitive. Removing these gave me an improvement in the eval score as well as the public leaderboard score. Then I removed the volume traded feature because it was also having a low contribution and removing it gave me an improvement in both eval and public lb. Later, I created 3 new features:

difference between three day moving average and five day moving average
difference between five day moving average and ten day moving average
difference between positive directional movement and negative directional movement I added these features one by one and saw an improvement in both the eval and public lb scores.

I tried creating a feature for differnce between three day moving average of nth day minus the three day moving average of (n-1)th day. This gave me improvement in eval dataset, but not on the public lb. Possibily this had overfit the data, so I removed this feature.

Parameter Tuning

max depth

I usually start with shallow trees (max depth 3). I prefer to use shallow trees because they dont tend to overfit. I tried increasing the max depth to 4 and 5, but that made the scores worse for public lb. So I stuck to using max depth 3.

min_child_weight

Initially, I set the min_child_weight to 1000 because of the high number of data points. Later I moved it to 1500 and 500 and saw that 500 gave me a better score. Decreasing further to 300 didnt help so I stuck with 500.

Learning Rate, num_rounds and early stopping

I set up the early stopping parameter to 50, i.e. if the eval score doesnt improve in 50 rounds, stop training further. The learning rate was initially set to 0.05 and num rounds were initially set to 1500. But this was very slow and the score was improving even after 1500 rounds. So I changed the learning rate to 0.2 and reduced the num rounds to 800. This gave me stopping near the 600th round and quicker training as a result.

Well, thats it, I did not have the time to try ensemble models which I believe could have improved the score further.

Running the code

Keep all the files(python script, train.csv and test.csv) in the same directory and set the working directory to that directory.
Run the script by command: python try1.py.
The submission is saved as submission_xgb.csv.

Solution: Code File

End Notes

It was great interacting with these winners and know their approach during the competition. Hopefully, you will be able to evaluate where you missed out.

Take a cue from these approaches and participate in upcoming Machine Learning Hackathon & The QuickSolver MiniHack. If you have any questions feel free to post them below.

Check out all the upcoming competitions here.

Sunil Ray

Sunil Ray is Chief Content Officer at Analytics Vidhya, India's largest Analytics community. I am deeply passionate about understanding and explaining concepts from first principles. In my current role, I am responsible for creating top notch content for Analytics Vidhya including its courses, conferences, blogs and Competitions.

I thrive in fast paced environment and love building and scaling products which unleash huge value for customers using data and technology. Over the last 6 years, I have built the content team and created multiple data products at Analytics Vidhya.

Prior to Analytics Vidhya, I have 7+ years of experience working with several insurance companies like Max Life, Max Bupa, Birla Sun Life & Aviva Life Insurance in different data roles.

Industry exposure: Insurance, and EdTech

Major capabilities: Content Development, Product Management, Analytics, Growth Strategy.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Sujatha Sivaraman

It was a very insightful article on how winners are approaching the hackathon. Their code file looks sleek and elegant. I would like to have a look at the data . Where can I get access to it. ?

J_ratt

It would be great if you could include the code file of one of the top codes in R as well. Sometimes the winners are all in Python.

HARI PRASAD

Hi Kunal, Can you please open this competition as practice problem or provide with complete test set for us to do further validations? [email protected]

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Winner’s Approach – Rampaging DataHulk MiniHack, AV DataFest 2017

Introduction

The problem statement

Winners

Rank 3, Santanu Pattanayak

Rank 2, Prince Atul

Rank 1, Akash Gupta

Feature Engineering

Parameter Tuning

Running the code

End Notes

Check out all the upcoming competitions here.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid