Kaggle can often be intimating for beginners so here’s a guide to help you started with data science competitions
We’ll use the House Prices prediction competition on Kaggle to walk you through how to solve Kaggle projects
Kaggle your way to the top of the Data Science World!
Kaggle is the market leader when it comes to data science hackathons. I started my own data science journey by combing my learning on both Analytics Vidhya as well as Kaggle – a combination that helped me augment my theoretical knowledge with practical hands-on coding.
Now, here’s the thing about Kaggle. It has a vast collection of datasets and data science competitions but that can quickly become overwhelming for any beginner. I remember browsing through Kaggle during my initial data science days and thinking, “where do I even begin?”. Given the expertise involved, it’s quite a daunting prospect for newcomers.
In this article, I am going to ease that transition for you.
We will understand how to make your first submission on Kaggle by working through their House Price competition. We’ll go through the different steps you would need to take in order to ace these Kaggle competitions, such as feature engineering, dealing with outliers (data cleaning), and of course, model building.
You can also check out the DataHack platform which has some very interesting data science competitions as well.
Please note that I’m assuming you’re familiar with Python and linear regression. If these are new concepts to you, you can learn or brush up here:
Kaggle notebooks are one of the best things about the entire Kaggle experience. These notebooks are free of cost Jupyter notebooks that run on the browser. They have amazing processing power which allows you to run most of the computational hungry machine learning algorithms with ease!
Just check out the power of these notebooks (with the GPU on):
As I mentioned earlier, we will be working on the House Prices prediction challenge. You can follow the processes in this article by working alongside your own Kaggle notebook.
Just head to the House Prices competition page, join the competition, then head to the Notebooks tab and click Create New Notebook. You should see the following screen:
Here, you have to choose the coding language and accelerator settings you require and hit the Create button:
Your very own Kaggle notebook will load up with the basic libraries already imported for you. Additionally, you can access the training data directly from here and whatever changes you make here will be automatically saved. What more do you need?
Now let’s get cracking on that competition!
Importing the Dataset in Kaggle
Once we have our Kaggle notebook ready, we will load all the datasets in the notebook. In this competition, we are provided with two files – the training and test files. We will load these datasets using Pandas’ read_csv() function:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape, test.shape)
Let’s have a look at our dataset using the DataFrame.head() function which by default outputs the top 5 rows of the dataset:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The dataset has 81 columns. The ‘SalePrice‘ column is our target feature determined by the remaining columns in the dataset. We can also observe that there is a mix of both categorical and continuous columns and there are some missing values in the data. Let us explore the data in detail in the next section.
Let’s Explore the Data
The first step in data exploration is to have a look at the columns in the dataset and what values they represent. We can do this using the DataFrame.info() function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note: You can read about what these features represent in the data description file provided on the competition page.
You will notice that quite a few of the features contain missing values. Before the model building process, we will have to impute these missing values. That’s a preprocessing step and we will handle it in a later section.
But first, let us explore our target feature using the DataFrame.describe() function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here, 25%, 50%, and 75% denote the values at 25th, 50th, and 75th percentile respectively. So, from the output, we can make out that 75% of our values are below 214,000 whereas the maximum sale price of a house is 755,000. There is a significant difference between these two which clearly denotes that the target variable has some outliers.
Let’s visualize the distribution in the SalePrice feature using the sns.distplot() function in Seaborn:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You can see that a lot of the sale prices are clustered between the 100,000 to 200,000 range. But, due to some high sale prices of a few houses, our data does not seem to be centered around any value. This means that the sale prices are not symmetrical about any value. This asymmetry present in our data distribution is called Skewness. In our case, the data distribution is positively-skewed (or right-skewed).
We can check the skewness in our data explicitly using the DataFrame.skew() function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have got a positive value here because our data distribution is skewed towards the right due to the high sale prices of some houses.
Our problem requires us to predict the sale price of houses – a regression problem. So, the first model that we will be fitting to our dataset is a linear regression model. But the skewness in our target feature poses a problem for a linear model because some values will have an asymmetric effect on the prediction. Having a normally distributed data is one of the assumptions of linear regression! But we’ll handle this later when we are transforming our features.
For now, let’s have a look at how our features are correlated with each other using a heatmap in Seaborn:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Heatmaps are a great tool to quickly visualize how a feature correlates with the remaining features. Some striking correlation between features that I can see from the heatmap are:
GrLivArea and TotRmsAbvGrd
GarageYrBlt and YearBuilt
1stFlrSF and TotalBsmtSF
OverallQual and SalePrice
GarageArea and GarageCars
We can plot these features to understand the relationship between them:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It seems obvious that the total number of rooms above the ground should increase with increasing living area above ground:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This relationship is interesting because we can see some linear relationship forming between the Year the house was built and the Year the garage was built. Think about it – it seems intuitive that garages would have been built either simultaneously with the house or after it was constructed, and not before it. Therefore, you can see that most of the points stay on or below the linear line.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Again, we can see a linear relationship between these two features, and most of the dots lie below the line. Most houses have a basement area less than or equivalent to the first-floor area. Although we can see some houses with basement area more than the first-floor area. What do you think the reason could be? I would love to read it in the comments below!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Again, the number of cars that can fit in a garage would increase with its area. You can do a lot more analysis and I encourage you to explore all the features and think of how to deal with them. While you’re at it, don’t forget to share your insights in the comments!
For now, let’s see how the features correlate with our target feature – SalePrice:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We can see that most of the features that we looked at above are also highly correlated with our target feature. So let’s try to visualize their relationship with the target feature.
I will save all of them in my “top_features” list for reference later on.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Ok, we have plotted these values, but what do you concur?
Well, you must have noticed some points in most of these plots are out of their usual place and tend to break the pattern in the feature. These are called Outliers. Outliers affect the mean and standard deviation of the dataset which can affect our predicted values.
For example, in the feature GrLivArea, notice those two points in the bottom right? An above-ground living area of 4500 square feet for just 200,000 while those with 3000 square feet sell for upwards of 200,000! Seems a bit strange, doesn’t it?
Let’s take another example, this time of TotalBsmtSF. Notice the point in the bottom right? It doesn’t make sense.
These outlier values need to be dealt with or they will affect our predictions. We can deal with them in a number of different ways and we’ll handle them later in the preprocessing section next.
Right – we saw how there were a few outliers in our top correlated features above. Although there are a couple of ways to deal with outliers in data, I will be dropping them here.
Any value lying beyond 1.5*IQR (interquartile range) in a feature is considered an outlier. So we will use that to detect our outliers:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These were our top features containing outlier points. Since we have dropped these points, let’s have a look at how many rows we are left with:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have dropped a few rows as they would have affected our predictions later on.
Feature Transformation
Before we start handling the missing values in the data, I am going to make a few tweaks to the train and test dataframes.
I am going to concatenate the train and test dataframes into a single dataframe. This will make it easier to manipulate their data. Along with that, I will make a few changes to each of them:
Store the number of rows in train dataframe to separate train and test dataframe later on
Drop Id from train and test because it is not relevant for predicting sale prices
Take the log transformation of target feature using np.log() to deal with the skewness in the data
Drop the target feature as it is not present in test dataframe
Concatenate train and test datasets
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Have a look at how the log transformation affected our target feature. The distribution now seems to be symmetrical and is more normally distributed:
Now it’s time to handle the missing data!
Handling missing data
Let’s have a look at how many missing values are present in our data:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There seem to be quite a few missing values in our dataset. What do you think could be the reason for this? Here’s a hint – take a look at the data description file and try to figure it out.
There are some features that have NA value for a missing parameter! This is strange but let me show you why that’s the case:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For example, NA in PoolQC feature means no pool is present in the house! This is treated as a null (or np.nan) value by Pandas and similar values are present in quite a few categorical features.
I will replace the null values in categorical features with a ‘None’ value.
For ordinal features, however, I will replace the null values with 0 and the remaining values with an increasing set of numbers. This is called Label Encoding and is used to capture the trend in an ordinal feature.
The null values in nominal features will be handled by replacing them with ‘None’ value which will be treated during One-Hot Encoding of the dataset.
Finally, the missing values in numerical features will be treated by replacing them with either a 0 or some other statistical value:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A null value in Garage features means that there is no garage in the house. These values will be handled the same way as mentioned above:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A null value in basement features indicates an absence of the basement and will be handled as mentioned above:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Null values in the remaining features can also be handled in a similar fashion:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Now that we have dealt with the missing values, we can Label Encode a few other features to convert to a numerical value. This retains the trend in the feature and the regression model will be able to understand the features.
Honestly, feature engineering is perhaps THE most important aspect of Kaggle competitions. A quick glance at previous winning solutions will show you how important feature engineering is. It’s often the difference between a top 20 percentile finish and a mid-leaderboard position.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We can make new features from existing data in the dataset to capture some trends in the data that might not be explicit. This makes the already existing data more useful. For example, adding a new feature that indicates the total square feet of the house is important as a house with a greater area will sell for a higher price. Similarly, a feature telling whether the house is new or not will be important as new houses tend to sell for higher prices compared to older ones.
I have made some new features below. I encourage you to go through the data yourself and see if you can come up with other useful features.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
All these steps that I performed here are part of feature engineering. You can read more about them in detail in this article.
Preparing Data for Prediction
Since there a lot of categorical features in the dataset, we need to apply One-Hot Encoding to our dataset. This will convert categorical data in numbers so that the regression model can understand which category the value belongs to:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Because we had combined training and testing datasets into a single dataframe at the beginning, it is now time to separate the two:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Finally, I will split our train dataframe into training and validation datasets. This will allow us to train our model and validate its predictions without having to look at the testing dataset!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Let’s try to predict the values using linear regression. It is the simplest regression model and you can read more about it in detail in this article.
Linear regression model
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We are looking at the RMSE score here because the competition page states the evaluation metric is the RMSE score. We got a pretty decent RMSE score here without doing a lot. Now let’s see whether we can improve it using another classic machine learning technique.
Ridge regression model
Ridge regression is a type of linear regression model which allows the regularization of features to take place. Now, what is regularization?
Regularization shrinks some feature coefficients towards zero to minimize their effect on predicting the output value.
You can study more about regularization in this article.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We are getting the lowest RMSE score with an alpha value of 3. Since I got the lowest RMSE with Ridge regression, I will be using this model for my final submission:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
But before submitting, we need to take the inverse of the log transformation that we did while training the model. This is done using the np.exp() function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Now we can create a new dataframe for submitting the results:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Once you have created your submission file, it will appear in the output folder which you can access on the right-hand side panel as shown below:
You can download your submission file from here. Once you have done that, just drag and drop it in the upload space provided in the Submit Predictions tab on the competition page:
End Notes
And just like that, you have made your very first Kaggle submission. Congrats!
Going forward, I encourage you to get your hands dirty with this competition and try to improve the accuracy that we have achieved here. You can go on to explore feature engineering and employ ensemble learning for better results.
Now go on and Kaggle your way to becoming a data science master!
I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!
Hello Annirudh Sir I am Prajwal Adhav from Second year (Automobile engineering) but I wanted to switch to data science field
Please tell is that possible to do so.
And how to start this long journey
Siddharth Chi
Informative.. thank you
Finder lards
Hello, good job! Can you explain why is np.log required? It is not clear why it normalizes the distribution.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Hello Annirudh Sir I am Prajwal Adhav from Second year (Automobile engineering) but I wanted to switch to data science field Please tell is that possible to do so. And how to start this long journey
Informative.. thank you
Hello, good job! Can you explain why is np.log required? It is not clear why it normalizes the distribution.
Hi! Log brings large values closer together. If we have data containing values like 10, 20, 50,... and then some values on the higher end like 1000, 2000, etc. On taking the log transformation we end up with values like 1, 1.3, 1.69, ..., and for the higher values we get 3, 3.3, etc. bringing all of them much closer to the median. This way we get a more normal distribution. I hope this helps.