How I Won My First Public Data Science Competition?

Alifia Ghantiwala Last Updated : 22 Mar, 2022

8 min read

This article was published as a part of the Data Science Blogathon.

Data Science Competition — Source: Author

About the Competition

The competition was to predict the likelihood of having Autism. It was sponsored by Google Developers and hosted on Kaggle. The data was collected from people that filled out an app form. Research has proven that early detection of autism can help in incrementing mental development, communication, and social skills. This was the main driving force of the challenge, to provide an automated and reliable solution to healthcare workers so that they can prioritize their resources.

The Approach I Followed to Win the Competition

0) Understanding the data

1) Using label encoder to encode categorical features

2) Removal of features based on their correlation with the target

3) Using Grid Search and 5 fold cross-validation for hyperparameter optimization.

4) For modelling, I used XGBoost and LightGBM and combined their results based on CV performance.

Understanding the Data

Understanding the data would be our first step. The most powerful machine learning model would not produce desired results with garbage input. I did a univariate and bivariate analysis to understand the input completely, based on the same, I did feature selection.

Features in our dataset

ID – ID of the patient

A1_Score to A10_Score – Score based on Autism Spectrum Quotient (AQ)

age – Patient’s age in years

gender – Gender of the patient

ethnicity – Ethnicity of the patient

jaundice – Whether the person was having jaundice at birth

autism – If an immediate family member was diagnosed with autism

contry_of_res – Country of patient

used_app_before – If a person has undergone a screening test

result – Score for AQ1-10 screening test

age_desc – Age of the patient

relation – Relation of a patient who completed the test

Class/ASD – Target label

train.columns

Source: Author

Univariate Analysis

ncounts = pd.<a onclick="parent.postMessage({'referent':'.pandas.DataFrame'}, '*')">DataFrame([train.isna().mean(), test.isna().mean()]).T
ncounts = ncounts.rename(columns={0: "train_missing", 1: "test_missing"})

ncounts.query("train_missing > 0")

Source: Author

The input data does not have NA values.

Gender feature

sns.<a onclick="parent.postMessage({'referent':'.seaborn.set_theme'}, '*')">set_theme(style="darkgrid")
sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(x='gender',data=train)

Ethnicity

plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15,15))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(x='ethnicity',data=train)

Relation feature

sns.<a onclick="parent.postMessage({'referent':'.seaborn.set_theme'}, '*')">set_theme(style="darkgrid")
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15,15))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(x='relation',data=train)

Age feature

sns.<a onclick="parent.postMessage({'referent':'.seaborn.set_theme'}, '*')">set_theme(style="darkgrid")
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15,15))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(x='age_desc',data=train)

We have 800 different age values but all of them are older than 800

Jaundice feature

sns.<a onclick="parent.postMessage({'referent':'.seaborn.set_theme'}, '*')">set_theme(style="darkgrid")
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15,15))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(x='jaundice',data=train)

Autism feature

sns.<a onclick="parent.postMessage({'referent':'.seaborn.set_theme'}, '*')">set_theme(style="darkgrid")
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15,15))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(x='austim',data=train)

Country of patient feature

train["contry_of_res"].value_counts()

A1 to A10 score feature

score_columns = ["A1_Score","A2_Score","A3_Score","A4_Score","A5_Score","A6_Score","A7_Score","A8_Score","A9_Score","A10_Score"]
i = 1
for col in score_columns:
    plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(10,15))
    plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.subplot'}, '*')">subplot(10,1,i)
    sns.<a onclick="parent.postMessage({'referent':'.seaborn.countplot'}, '*')">countplot(train[col])
    i += 1

Result (target feature)

train["result"].describe()

sns.<a onclick="parent.postMessage({'referent':'.seaborn.scatterplot'}, '*')">scatterplot(data=train,x="result",y="Class/ASD")

Observation: Result values less than 0 are only class 0 target

Insights from Univariate Analysis

No missing/NA data
No columns with constant values
Gender distribution in the train and test datasets is the same and is balanced.
The highest number of patients are from ethnicity ‘White European’ in both test and train data. We have a lot of values with ethnicity ?, what could be a suitable replacement strategy?
Most people are completing the test by themselves, but we do have entries with value ?, what would be the best value to replace them logically.
Age variable has all unique values but we know that all of them have an age of more than 18
Train data contains 75% of people that did not have jaundice when they were born. Test data has a similar distribution
The country feature has 61 unique values in the training dataset and the United States, United Arab Emirates, New Zealand, India, and United Kingdom have the highest record counts in that order in the training dataset. In the test dataset, we have 44 unique values of countries United States, United Arab Emirates, New Zealand, Jordan, and India are the top contributors in that order.
Majority of the data that we have did not have a family member having autism.
A1_score to A10_score are binary features that are encoded, we would look at its correlation with the target class variable to understand it better.
Result values in the train data have a min of -2 and a maximum of 13 with a mean of 7
A scatterplot between result and class, tell us that result values less than 0 have a class of negative in the train data

Bivariate Analysis

To understand the correlation of A1 to A10 feature scores with the class variable we plot the below heatmap.

correlation_columns = []
for col in score_columns:
    correlation_columns.append(col)
correlation_columns.append("Class/ASD")

correlation = train[correlation_columns].corr()
mask = np.<a onclick="parent.postMessage({'referent':'.numpy.triu'}, '*')">triu(np.<a onclick="parent.postMessage({'referent':'.numpy.ones_like'}, '*')">ones_like(correlation, dtype=np.<a onclick="parent.postMessage({'referent':'.numpy.bool'}, '*')">bool))
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(11, 9))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(correlation,annot=True,mask=mask)

Source: Author

A3, A6, A9, A4, A5, and A10 have a correlation higher than 0.4

Let’s find out how age correlates with the target

correlation = train[["age","result","Class/ASD"]].corr()
mask = np.<a onclick="parent.postMessage({'referent':'.numpy.triu'}, '*')">triu(np.<a onclick="parent.postMessage({'referent':'.numpy.ones_like'}, '*')">ones_like(correlation, dtype=np.<a onclick="parent.postMessage({'referent':'.numpy.bool'}, '*')">bool))
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(11, 9))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(correlation,annot=True,mask=mask)

Age does not seem to have a very high correlation value with the target

For understanding the relationship between gender and target we plot the below.

sns.<a onclick="parent.postMessage({'referent':'.seaborn.boxplot'}, '*')">boxplot(data=train,x='gender',y='Class/ASD')

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
transformed_gender = le.fit_transform(train["gender"])
train["transformed_gender"] = transformed_gender
corre = train[["transformed_gender","Class/ASD"]].corr()
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(corre,annot=True)

The correlation between gender and target seems to be weak.

Can the country where the patient resides have an effect? Let’s find out.

plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(20, 15))
top_5  = ["United States", "United Arab Emirates", "New Zealand", "India" , "United Kingdom"]
train_subset = train.loc[train["contry_of_res"].isin(top_5)]
sns.<a onclick="parent.postMessage({'referent':'.seaborn.boxplot'}, '*')">boxplot(data=train_subset,x='contry_of_res',y='Class/ASD')

le = LabelEncoder()
transformed_country = le.fit_transform(train["contry_of_res"])
train["transformed_country"] = transformed_country
corre = train[["transformed_country","Class/ASD"]].corr()
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(corre,annot=True)

Let us now see if ethnicity correlates with our target

chart = sns.<a onclick="parent.postMessage({'referent':'.seaborn.boxplot'}, '*')">boxplot(data=train,x='ethnicity',y='Class/ASD')
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)

It seems we have two values others and Others in the ethnicity feature which have the same meaning. Let’s clean it out

train['ethnicity'] = train['ethnicity'].replace('others','Others')

le = LabelEncoder()
transformed_ethnicity = le.fit_transform(train["ethnicity"])
train["transformed_ethnicity"] = transformed_ethnicity
corre = train[["transformed_ethnicity","Class/ASD"]].corr()
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(corre,annot=True)

Ethnicity seems to have a decent correlation with the target

Understanding the relationship between jaundice and the target

chart = sns.<a onclick="parent.postMessage({'referent':'.seaborn.boxplot'}, '*')">boxplot(data=train,x='jaundice',y='Class/ASD')

Jaundice and Target | Data Science Competition — Source: Author

le = LabelEncoder()
transformed_jaundice = le.fit_transform(train["jaundice"])
train["transformed_jaundice"] = transformed_jaundice
corre = train[["transformed_jaundice","Class/ASD"]].corr()
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(corre,annot=True)

Jaundice seems to have a decent correlation with the target

Correlation of autism feature with the target variable

le = LabelEncoder()
transformed_austim = le.fit_transform(train["austim"])
train["transformed_austim"] = transformed_austim
corre = train[["transformed_austim","Class/ASD"]].corr()
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(corre,annot=True)
transformed_austim_te = le.fit_transform(test["austim"])
test["transformed_austim"] = transformed_austim_te

le = LabelEncoder()
transformed_used_app_before = le.fit_transform(train["used_app_before"])
train["transformed_used_app_before"] = transformed_used_app_before
corre = train[["transformed_used_app_before","Class/ASD"]].corr()
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(corre,annot=True)

used_app_before does not have a high correlation with the target

correlation = train.corr()
mask = np.<a onclick="parent.postMessage({'referent':'.numpy.triu'}, '*')">triu(np.<a onclick="parent.postMessage({'referent':'.numpy.ones_like'}, '*')">ones_like(correlation, dtype=np.<a onclick="parent.postMessage({'referent':'.numpy.bool'}, '*')">bool))
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15, 15))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(correlation,annot=True,mask=mask)

Heatmap | Data Science Competition — Source: Author

Features with the Highest Correlation to the Target

A3_Score

A2_Score

A4_Score

A5_Score

A6_Score

A7_Score

A9_Score

A10_Score

Autism

Ethnicity

Result

We would keep only these features in our data frame which would be used as input for our model.

train = train[["A3_Score","A2_Score","A4_Score","A5_Score","A6_Score","A7_Score","A9_Score","A10_Score","transformed_austim","Class/ASD","result"]]
test = test[["A3_Score","A2_Score","A4_Score","A5_Score","A6_Score","A7_Score","A9_Score","A10_Score","transformed_austim","result"]]
y = train["Class/ASD"]
train = train.drop(["Class/ASD"],axis=1)

Importing all required libraries

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Models
from xgboost import XGBClassifier, XGBRegressor
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.linear_model import LogisticRegression

Baseline Model

I used logistic regression as my baseline model for this problem. I have used grid search CV and Stratified K Fold with 5 folds for hyper-parameter optimization.

# Define model
model=LogisticRegression(random_state=0,class_weight='balanced')

# Parameters grid
grid_model = LogisticRegression(solver='saga',
                              C=0.22,
                              penalty='l2',class_weight='balanced')

# Cross validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

# Grid Search
#grid_model = GridSearchCV(model,param_grid,cv=kf)

# Train classifier with optimal parameters
grid_model.fit(train,y)

predictions = grid_model.predict(test)
pred_df = pd.<a onclick="parent.postMessage({'referent':'.pandas.DataFrame'}, '*')">DataFrame()
pred_df["ID"] = test_df["ID"]
pred_df["Class/ASD"] = predictions

pred_df[['ID', 'Class/ASD']].to_csv('submission.csv', index=False)

LightGBM

This gave me the highest performance on the public leaderboard with an AUC score of 0.931 and a private leaderboard score of 0.987

Conclusion

This could be a comprehensive starter for anyone who wants to try their hand out with Kaggle competitions.

What you must keep in mind before solving a data science problem.

1) Study the data religiously, understanding your data thoroughly will only help you to win a competition or for that matter get good results.

2) Start with a baseline model result.

3) Use bagging and boosting aggregation techniques if your problem involves classifying. If you have understood your data well, it would not be too difficult for you to select the correct model.

About the Author

I am an Analyst and like to interrogate data and share my findings through technical articles. You can read other articles published by me on Analytics Vidhya here. You can reach out to me here.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Alifia Ghantiwala

I investigate data on a daily basis to find insights! I write so that I can understand more clearly. Have completed my graduation in Computer Engineering, and won my first public data science competition last month, March 2021, hosted on Kaggle by Google Developers.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How I Won My First Public Data Science Competition?

About the Competition

The Approach I Followed to Win the Competition

Understanding the Data

Insights from Univariate Analysis

Bivariate Analysis

Features with the Highest Correlation to the Target

Baseline Model

LightGBM

Conclusion

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#