Winners Talk: Top 3 Solutions of The Seer’s Accuracy Competition

Analytics Vidhya Last Updated : 11 May, 2016

9 min read

Introduction

Surprises arrive when you expect them least to arrive.

The Seer’s Accuracy turned out to be a challenging surprise for data scientists. So, what changed this time ? Actually, there was no test or train file given. Participants were given just one file to download. Would you believe? Everyone was puzzled. The weak ones gave up at the beginning; but the determined ones stayed till the end and learned something new!

Did you miss out on this experience ? If you didn’t, great! But if you did, unfortunately you just missed out a wonderful opportunity to learn something new. Though, I can’t bring back the thrilling experience, but I can give you one more chance to learn (data set is live again).

The Seer’s Accuracy held from 29th April to 1st May 2016. This competition enticed more than ~2200 participants across the world. In this 72 hour battle, the first thing which participants were required to do is to create a train and test file themselves. After which, the race to seer would start.

Once again, XGBoost and Ensemble Modeling helped winners to discover the highly accurate solutions. Below are the winning solutions of top 3 winners. Here is quick short interview of winners highlighting their approach and thought process which made them got in top 3.

If you participated in this competition, it’s time to analyze your hits and misses and become better for next.

Note: R was extensively used by winning team members. Special thanks to these winners for their immense cooperation in sharing their experience and knowledge.

The Competition

The participants were required to help “ElecMart”, a chain of electronic superstores looking to increase its sales from existing customers. The evaluation metric used was AUC – ROC.

Problem Statement

About ElecMart

ElecMart, as the name suggests is a supermarket for Electronics. They serve the needs of both, retail clients and various corporate clients. Customers not only get to see and feel a wide range of products, they also receive exciting discounts and excellent customer service. ElecMart started in 1999 and launched a customer loyalty program in 2003.

ElecMart aims to be largest Electronic superstore across the nation, but they have a big hurdle ahead!

The problem – Where are the recurring buyers?

The loyalty program is meant for customers who want to take benefit from repeat purchases and register at the time of purchase. They need to present the loyalty card at Point of Sale at time of purchase and the benefits are non-transferrable. Also corporate sales automatically get the benefits of the loyalty program.

In a recent benchmarking activity and market survey which ElecMart sponsored, it was found that the “Repeat purchase rate” i.e. customer who come again for purchases from these customers is very low compared to other competitors. Increasing sales to these customers is the only way to run a successful loyalty program.

Data provided

ElecMart has shared all the transactions it had with their loyalty program customers since the loyalty program has started. They want to do focused campaigns with these customers highlighting the benefits of continued shopping with ElecMart. You are expected to identify the probabilty of the each customer (in the loyalty program) making a purchase in next 12 months.

You are expected to upload the solution in the format of “sample_submission.csv”. The public-private split is 20:80

Note: For practice, the data set is currently available for download on Link. Please note that the data set is available for your practice purpose and will be accessible until 12th May 2016.

Winners!

Rank 3: Bishwarup Bhattacharjee, Kolkata, India

Bishwarup is an entrepreneur and is currently the CEO of Alphinite Analytics. He is a Kaggle Master and is currently ranked 13 on Data Hack. He won INR 20,000 ($300).

He said:

The data for this particular competition was a bit different from the conventional ML problems. It had no target column and no explicit separation between the training and test set. So I discovered, there were more than one potential ways to tackle such problems.

However, since the evaluation metric for the competition was the area under the ROC curve (AUC), I preferred to first formulate the problem as a case of supervised learning which I think majority of the participants did as well.

I used the data from 2003-2005 as my training set and matched the customers who repeated in 2006 to derive the labels for my data. That was pretty straightforward. Just formulating the problem in this way and using a very simple xgboost model, I could get > 0.83 on the public leaderboard.

Then, feature engineering played a huge role to play in my success. Since, we were ultimately supposed to predict the probability of a repeat on per user basis, I summarized multiple user records in the training data to one single training instance. The features which helped me are as follows:

Age of the customer as of in 2007-01-01
Creating user-store association matrix
Creating user-product category association matrix
Creating user-sales executive association matrix (and dropping extremely sparse columns)
Creating user-payment method association matrix
Creating user-lead source category association matrix
Average popularity of all the sales executive who has attended a particular customer
Entropy of price range offered to a customer compared to general price range of a particular product category
Number of transactions in last 1 year
Number of transactions in last 6 months
Number of transactions in last 3 months
Number of transactions in the last month of the training period
First store visited by a customer
Total number of store visited by a customer
Min, Max, Mean, Range of transaction amount for a customer
Time to last purchase in days
Median EMI
Number of unique stores / Number of previous purchase
Number of unique product categories / Number of previous purchase

There were more features which I derived, but they did not help my model’s accuracy.

In the end, I trained two xgboost models on the above features selecting a part of it in each of them and the rank average of them got me to end at 3rd position in this competition with 0.874409 accuracy.

My Solution: Link

Rank 2: Oleksii Renov (Dnipro, Ukraine) and Thakur Raj Anand (Hyderabad, India)

Thakur Raj Anand

This was the first time, a team (Team Or) managed to secure a position in Top 3. This team won INR 35,000 ($500).

Thakur Raj Anand (DataGeek) is a data science analyst with Masters in Quantitative Finance based out of Hyderabad. He mostly uses R and Python for data science competitions. Oleksii Renov (orenov) is a data scientist based out of Dnipro, Ukraine. He loves to do programming in Python, R and Scala.

They said:

We spent 40% of the time exploring data and converted the problem into a Supervised problem.

Oleksii Renov

We generated negative sample by assigning 0 to those IDs which had no transaction in the year 2006 but had a history before 2006. We constructed 4 different representation of data to make models with the idea of capturing different signals from different representations.

For modeling, we mainly used XGBoost but we did try Random Forest and ExtraTrees which unfortunately didn’t help to improve our final predictions accuracy.

Oleksii has an usual habit of looking for unusual patterns in data. He found that predictions from tree model and linear model were very different and averaging them was giving a significant boost in CV as well as on LB.

We kept exploring different styles and finally we made 4 tree models and 1 linear model using XGBoost. We only made a linear model on the final representation of data on which XGBOOST was giving best CV. We finished at Rank 2 with 0.876660 accuracy.

In this competition, we learned a lot about sparse matrices. We decided to learn simple things like aggregating, transformation etc. on sparse matrices which is very helpful in exploring large data sets in an efficient way.

In the end, we would like to tell young aspiring data science folks to never give up. Every time you feel like giving up, try to make a different representation of data and try different models on them.

Our Solution: Link

Rank 1: Rohan Rao, Mumbai, India

Rohan Rao is currently working as a Data Scientist at AdWyze. He is a Kaggle Master and currently ranked 6 on Data Hack. He is a three time National Sudoku Champion and currently ranked 14th in the world. He won INR 70,000 ($1000).

He said:

Hackathons might be meant for quick and smart modelling, but this one restored my faith in focusing on smart.

I’ve been regularly participating in competitions at Data Hack. More than anything, I’ve learned many new things. I am glad I finally got my maiden win!

The road to achieve a seer’s accuracy turned out to be interesting. Unlike a majority of predictive modelling competitions, this hackathon did not have the standard train/test data format.

I started off with understanding how best to build a machine-learning based solution with the data, along with setting up a stable validation framework.

Based on my CV-LB scores from an XGBoost model, that were quite well in sync, I explored each variable and started working on feature engineering. I could see, that there is subtle but good scope of creating new variables.

My final model was an ensemble of 3 XGBoost models, each having a different set of data points, features and parameters. The ensemble was mainly to ensure more stability in the predictions. I explored few other ML-based models, but none were as good as XGBoost. Even their ensembling with XGBoost did not help. This way I won this competition with the accuracy of 0.88002.

I feel, it is always wonderful to work with clean datasets that are designed over a good problem statement. And, this hackathon was very well organized. The CV-LB stable correlation was a huge plus because it enabled me to focus on feature engineering, which is the most exciting part of building machine learning models.

It was nice to see and compete with many of the top data scientists in India, and at the end, I’m glad I finished 1st to win my maiden competition on AnalyticsVidhya.

The biggest learning for me from this competition was the importance of drilling down into understanding the problem statement inside out and building a robust and solid solution step-by-step. And, then practice more so that one can do these as quickly as possible. It might sound cliche but it actually works!

Finally, some of the tips I would like to give to aspiring data scientists:

Always trust your Cross-Validation (CV) score. And to trust your CV, you need to build the right validation method depending on the problem, data and evaluation. During the competition, explore and try out as many ideas as possible. You’d be surprised to know that sometimes, the simplest algorithm or the least obvious ones could also work out. In the end, always be ready to learn from others and never hesitate in asking for help. There’s always something to learn for everyone.

My Solution: Link

Key Takeaways from this Competition

This competition gave a clean well structured data set. Hence, no efforts were required in data cleaning. But, problem framing (which most of us overlook) paved the way towards success. Moving away from a conventional ML competition, turned out to be challenging event for participants but eventually gave them something new to learn. Below are the key takeaways from our top 3 participants:

Understand the Problem: Before you start working on data, make sure you clearly understand what has been asked for. This will avoid all sorts of confusion and help you to start with a definite goal.
Feature Engineering: Once again, feature engineering remains played a crucial part in modeling. Your motive should be to derive new features in order to supply for unique information to the algorithm.
Boosting & Ensemble: Choice of ML algorithm totally depends on participants. But, the magnificent power bestowed by boosting algorithm (XGBoost) outperforms the need to use any other ML algorithm. The cameo played by ensemble in the end helps further in improving prediction accuracy. You must learn boosting and ensemble to perform better in competition. You can start here.

End Notes

Some of your might have sought motivation & some of you take away knowledge from this article. If you have thoroughly read the winners talk, you would have realized that winning this competition didn’t require anything extra ordinary technique. It wasn’t about knowing advanced machine learning algorithms, but required a simple approach of understanding the problem.

Therefore, next time when you come for challenge, make sure you’ve understood what has been asked for, and then start working on predictive modeling. This way you’ll have more confidence while working. Last but not the least, learn about cross validation, xgboost and feature engineering.

Did you find this article helpful ? Were you able to analyze your hits and misses ? Don’t worry, there is always a next time. Winning is a good habit. Coming up soon is Mini Data Hack.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Anurag Sharma

Hi , I couldn't attend this hackathon. I want to try my own approach. It seems I can't see the problem statement as I have not registered for the hackathon. Can you please let me know how can i see the problem statement.,

Show 1 reply

Analytics Vidhya Content Team

Hi Anurag, I just realized problem statement is not available for un-registered users. Therefore, I have added the problem statement in this article above. Now, you can practice.

dhiyar

Thanks for sharing the valuable information here. So i think i got some useful information with this content. Thank you and please keep update like this informative details.

Shiv

Could you share data

Winners Talk: Top 3 Solutions of The Seer’s Accuracy Competition

Introduction

The Competition

Problem Statement

About ElecMart

The problem – Where are the recurring buyers?

Data provided

Winners!

Rank 3: Bishwarup Bhattacharjee, Kolkata, India

Rank 2: Oleksii Renov (Dnipro, Ukraine) and Thakur Raj Anand (Hyderabad, India)

Rank 1: Rohan Rao, Mumbai, India

Key Takeaways from this Competition

End Notes

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr