Winning Solutions of DYD Competition – R and XGBoost Ruled

Analytics Vidhya Last Updated : 18 Mar, 2016

7 min read

Introduction

It’s all about an extra mile one is willing to walk!

Winning a data science competition require 2 things: Persistence and Willingness to try new things. There comes a moment of challenge in every competition when participants feel that nothing seems to work their way and its time to give up. That’s when a person stands up and says, “Why don’t I try one more time, but this time in a different way?” That’s when champions are born.

Competitions organized at Data Hack are meant to challenge your skills & knowledge and give you a chance to learn more and become a better analyst / data scientist.

On a similar note, we organized Date Your Data Competition from 26th Feb’ 16 to 28th Feb ’16. This competition enticed more than ~ 2100 participants around the world. Unlike other dates (romantic ones), this date turned out to be dramatic. No signs of love were shown. Only fierce attempts to slice and dice the data with highest level of granularity.

The emerged winners (top 3) mainly used R and XGBoost to rule the leaderboard. Here’s a complete solution (approach & codes) used by winners in this competition. You’ll shortly see how feature engineering turned out to be a game changer in this competition.

For R users, these solutions are highly helpful and can be used as a practice material.

Note: A special thanks to the winners of this competition for their immense co-operation and time.

Winning solutions for date your data competition

The DYD competition

This competition surpassed our previous high of number of submissions. It recorded more than 3100 submissions. Also, we got our first female data scientist winner in this competition.

This competition involved a supervised machine learning problem. Participants were required to predict the chances of a student’s profile to be of high relevance to employers. In simple words, the participants were required to predict whether a student will be shortlisted or not. The data set used was provided by Internshala, India’s No. 1 platform for internships.

You can read the complete problem statement here: Link. The data set is available for download here. Please note that the data set is available for your practice purpose and will be accessible until 20th March 2016.

Evaluation Metric

The winners were judged on the basis of ROC score. ROC curve is a plot between sensitivity and (1-specificity). To know more, visit here. AUC score close to 1 is always desirable.

After a live feedback session with participants held at slack, it was inferred that this competition was challenging and participants were keen to acknowledge what they missed!

Winners of DYD Competition

A common factor which played a crucial role in their victory is their prolonged reverence for feature engineering and data exploration. Boosting (XGBoost, GBM) imparted their models necessary accuracy. Ensemble modeling played a cameo in further enhancing their model’s accuracy.

Since most of the coding has been done in R, this can be a great resource to practice for R users.

Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )

Sonny Laskar, currently works as a Manager – Strategy at Microland Limited. He says:

Sonny says:

Like everyone, I started with taking a close look at data. I call it as ‘data discovery‘ stage. Since there were 4 files, the chances of oversight were high. So, I realized that data has spelling mistakes. Later, I discovered some of the variables like internship profile, internship skills had good number of repetitive observations. It was evident that such observations row will dominate the prediction process.

This impelled me to do one hot encoding of such variables and added them as separate features. Later, I label encoded the binary features (0,1). In fact, majority of my time went in encoding features.

But, this wasn’t enough. I got a terrible score until here. Then, I created additional features with mean, percentages to supply more information to my model. It worked.

I used caret package. I built 2 XGBoost models with different seed values and nrounds. Due to lack of time, I didn’t do much experiment with machine learning. I then simply, ensembled my 2 XGBoost models.

I think I could have achieved higher score, had I not removed duplicate rows from student experience. I’m sure that lead to loss of information, but it was a race against time too. My final score was 0.700698.

Link to Code

Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)

Prarthana Bhat, currently works as a Data Scientist at Flutura Decision Science and Analytics. She’s the first female participant on Data Hack to secure a rank in Top 3.

Prarthana says:

When I looked at the data, I discerned that feature engineering will turn out to be a game changer. Hence, right from the beginning I kept my focus on discovering new features.

Of course, I started with the basic hygienic steps of data cleaning. There was a lot of mix and match possible in this data set. Since the data was large, I used parallel computing in R for faster computation and also not to run out of patience. R has awesome libraries such as doParallel, doSNOW, foreach to do this job!

I think the features I created were able to add significant information to the model. That’s the key to predictive modeling. One should always attempt to extract as much as information (uncorrelated) from available data.

For modeling, I used XGBoost algorithm. I decided to test for its optimal potential on this data. Then, I did parameter tuning. I decided to stick with only 3 parameters namely eta, colsample_bytree, subsample. In fact, I’d suggest R users to pay attention to these parameters the most for parameter tuning.

Not to make it a repetitive process, I wrote some functions to do this job. This was time consuming. But, in the end, turned out to be worthy enough. My final score stood at 0.709808.

Link to Code

Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )

Santanu Dutta, currently works as a Senior Associate in ACME. He is an experienced analytics professional specializing in BFSI and marketing. He’s a self learned data scientist.

Santanu says:

I had always been curious to know more about the science of data and how it can derive benefits in our daily lives. Since then I have been training myself to build good and stable predictive models by participating in hackathons.

In this competition, the biggest challenge was shortage of time as the data set was quite huge and dirty. Lots of data cleaning was supposed to be done before processing it to build models. An early cursory look on date variables, gave hint that pre-processing is going to be the real game changer.

I have specialized myself in R. But, in last few hackathons, I noticed that Python is quickly gearing up and is becoming the first love of hackathon winners. So, this time I promised myself to walk an extra mile. I used both R and Python to solve this problem (faster). I used R for data wrangling and Python for model building.

Python was a real challenge for me. Because, in the last few months I’ve badly struggled in implementing XGBoost on my windows machine. So, I selected the next best alternative i.e. GBM. In addition, I had built few variations of Random Forest, Boosting , Matrix Factorization models as well and relied on local CV to select the parameters and model.

It’s been a great privilege competing with leading data scientists across the globe. Learning while competing steepens the learning curve. My public lb score was 0.63 and ranking 17 and private lb score resulted in 0.72 which got me the first position.

Link to Code

Key Takeaways from this Competition

In this competition, participants got the chance to work on real life data. Real life data comes in all shapes and dimensions. Hence, it becomes essential to develop business understanding in order to work better with data sets. In DYD, participants worked deeply with data exploration, data engineering and feature engineering techniques. Below are the key takeaways one can take home from this article:

Data Cleaning & Engineering: This data set had all sorts of variables (continuous, categorical, high cardinal) divided in 4 csv files. Some observation had spelling mistakes and others were repeated multiple times. The challenge was to combine them, clean them and prepare them for analysis. The winner did it and got incredible scores. You must learn this skill.
Feature Engineering: In this competition, participants were prudent enough to understand the game changing influence of this concept. The number of features created in this competition varied from 10-15 to 300-400 features. Your motive should be to derive new features in order to supply for unique information to the algorithm.
Boosting & Ensemble: Choice of ML algorithm totally depends on participants. But, the magnificent power bestowed by boosting algorithm (XGBoost & GBM) outperforms the need to use any other ML algorithm. The cameo played by ensemble in the end helps further in improving prediction accuracy. You must learn boosting and ensemble to perform better in competition. You can start here.

End Notes

If you have thoroughly followed this article, you would have noticed that feature engineering and boosting are awfully important in winning competitions. So, the next time you would participate in a competition, make sure you don’t miss out creating new features and render some boost. In fact, the process is simple: Clean the data, create new features, build the model, keep the best features, build the model again (boost) and done. If you have still been indecisive about, whether to learn R or Python, you can start with R from scratch.

In this article, I’ve shared the winning approach of top 3 winners of DYD Competition. These winners took home amazon vouchers worth INR 55k ( $800 ). For your practice, the data set is available for download until 20th March 2016. Make sure you make the most out of this opportunity.

Did you like reading this article ? Did you discover what you missed in the competition? Do share your opinions / suggestions in the comments section below.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

sthothat

I have not registered to this Hackathon, how can I get the Train / Test data set for practice Thanks

Show 1 reply

Analytics Vidhya Content Team

Hi The link to download competition data in already available in this article. Please note one time login is required for download. Go ahead.

kabir ali

Manish :- love to see the solution of the winners but i am not able to download the data set from http://discuss.analyticsvidhya.com/uploads/analyticsvidhya/original/2X/5/590decc7aff355cc145346df8b41f47a1e13a625.zip it says NO File please help me out .

Hello Kabir I just checked and found that data is accessible for download. Please note that one time login is required to download the data. Please go ahead with the login and download the data.

srayagarwal1234

train_test$Earliest_Start_Date_num<-as.Date(max(train_test$Earliest_Start_Date))-as.Date(train_test$Earliest_Start_Date) in prathna code is throwing NA values, guess because there are lot of missing values in Earliest_Start_Date. How did you tackle this?

Prarthana

Hi I ran that part of the code but was not getting any NA values. Please try re running the code again and check. If you find the same problem then please paste your code Will have a look at it.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Winning Solutions of DYD Competition – R and XGBoost Ruled

Introduction

The DYD competition

Evaluation Metric

Winners of DYD Competition

Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )

Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)

Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )

Key Takeaways from this Competition

End Notes

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#