Kaggle is the market leader when it comes to data science hackathons. I started my own data science journey by combing my learning on both Analytics Vidhya as well as Kaggle – a combination that helped me augment my theoretical knowledge with practical hands-on coding.
Now, here’s the thing about Kaggle. It has a vast collection of datasets and data science competitions but that can quickly become overwhelming for any beginner. I remember browsing through Kaggle during my initial data science days and thinking, “where do I even begin?”. Given the expertise involved, it’s quite a daunting prospect for newcomers.
In this article, I am going to ease that transition for you.
We will understand how to make your first submission on Kaggle by working through their House Price competition. We’ll go through the different steps you would need to take in order to ace these Kaggle competitions, such as feature engineering, dealing with outliers (data cleaning), and of course, model building.
You can also check out the DataHack platform which has some very interesting data science competitions as well.
Please note that I’m assuming you’re familiar with Python and linear regression. If these are new concepts to you, you can learn or brush up here:
Kaggle notebooks are one of the best things about the entire Kaggle experience. These notebooks are free of cost Jupyter notebooks that run on the browser. They have amazing processing power which allows you to run most of the computational hungry machine learning algorithms with ease!
Just check out the power of these notebooks (with the GPU on):
As I mentioned earlier, we will be working on the House Prices prediction challenge. You can follow the processes in this article by working alongside your own Kaggle notebook.
Just head to the House Prices competition page, join the competition, then head to the Notebooks tab and click Create New Notebook. You should see the following screen:
Here, you have to choose the coding language and accelerator settings you require and hit the Create button:
Your very own Kaggle notebook will load up with the basic libraries already imported for you. Additionally, you can access the training data directly from here and whatever changes you make here will be automatically saved. What more do you need?
Now let’s get cracking on that competition!
Once we have our Kaggle notebook ready, we will load all the datasets in the notebook. In this competition, we are provided with two files – the training and test files. We will load these datasets using Pandas’ read_csv() function:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape, test.shape)
Let’s have a look at our dataset using the DataFrame.head() function which by default outputs the top 5 rows of the dataset:
The dataset has 81 columns. The ‘SalePrice‘ column is our target feature determined by the remaining columns in the dataset. We can also observe that there is a mix of both categorical and continuous columns and there are some missing values in the data. Let us explore the data in detail in the next section.
The first step in data exploration is to have a look at the columns in the dataset and what values they represent. We can do this using the DataFrame.info() function:
Note: You can read about what these features represent in the data description file provided on the competition page.
You will notice that quite a few of the features contain missing values. Before the model building process, we will have to impute these missing values. That’s a preprocessing step and we will handle it in a later section.
But first, let us explore our target feature using the DataFrame.describe() function:
Here, 25%, 50%, and 75% denote the values at 25th, 50th, and 75th percentile respectively. So, from the output, we can make out that 75% of our values are below 214,000 whereas the maximum sale price of a house is 755,000. There is a significant difference between these two which clearly denotes that the target variable has some outliers.
Read more about percentiles here.
Let’s visualize the distribution in the SalePrice feature using the sns.distplot() function in Seaborn:
You can see that a lot of the sale prices are clustered between the 100,000 to 200,000 range. But, due to some high sale prices of a few houses, our data does not seem to be centered around any value. This means that the sale prices are not symmetrical about any value. This asymmetry present in our data distribution is called Skewness. In our case, the data distribution is positively-skewed (or right-skewed).
Note: You can read more about skewness here.
We can check the skewness in our data explicitly using the DataFrame.skew() function:
We have got a positive value here because our data distribution is skewed towards the right due to the high sale prices of some houses.
Our problem requires us to predict the sale price of houses – a regression problem. So, the first model that we will be fitting to our dataset is a linear regression model. But the skewness in our target feature poses a problem for a linear model because some values will have an asymmetric effect on the prediction. Having a normally distributed data is one of the assumptions of linear regression! But we’ll handle this later when we are transforming our features.
For now, let’s have a look at how our features are correlated with each other using a heatmap in Seaborn:
Heatmaps are a great tool to quickly visualize how a feature correlates with the remaining features. Some striking correlation between features that I can see from the heatmap are:
It seems obvious that the total number of rooms above the ground should increase with increasing living area above ground:
This relationship is interesting because we can see some linear relationship forming between the Year the house was built and the Year the garage was built. Think about it – it seems intuitive that garages would have been built either simultaneously with the house or after it was constructed, and not before it. Therefore, you can see that most of the points stay on or below the linear line.
Again, we can see a linear relationship between these two features, and most of the dots lie below the line. Most houses have a basement area less than or equivalent to the first-floor area. Although we can see some houses with basement area more than the first-floor area. What do you think the reason could be? I would love to read it in the comments below!
Again, the number of cars that can fit in a garage would increase with its area. You can do a lot more analysis and I encourage you to explore all the features and think of how to deal with them. While you’re at it, don’t forget to share your insights in the comments!
For now, let’s see how the features correlate with our target feature – SalePrice:
We can see that most of the features that we looked at above are also highly correlated with our target feature. So let’s try to visualize their relationship with the target feature.
I will save all of them in my “top_features” list for reference later on.
Ok, we have plotted these values, but what do you concur?
Well, you must have noticed some points in most of these plots are out of their usual place and tend to break the pattern in the feature. These are called Outliers. Outliers affect the mean and standard deviation of the dataset which can affect our predicted values.
For example, in the feature GrLivArea, notice those two points in the bottom right? An above-ground living area of 4500 square feet for just 200,000 while those with 3000 square feet sell for upwards of 200,000! Seems a bit strange, doesn’t it?
Let’s take another example, this time of TotalBsmtSF. Notice the point in the bottom right? It doesn’t make sense.
These outlier values need to be dealt with or they will affect our predictions. We can deal with them in a number of different ways and we’ll handle them later in the preprocessing section next.
Note: You can read more about outliers here.
Right – we saw how there were a few outliers in our top correlated features above. Although there are a couple of ways to deal with outliers in data, I will be dropping them here.
Any value lying beyond 1.5*IQR (interquartile range) in a feature is considered an outlier. So we will use that to detect our outliers:
These were our top features containing outlier points. Since we have dropped these points, let’s have a look at how many rows we are left with:
(1327, 81)
We have dropped a few rows as they would have affected our predictions later on.
Before we start handling the missing values in the data, I am going to make a few tweaks to the train and test dataframes.
I am going to concatenate the train and test dataframes into a single dataframe. This will make it easier to manipulate their data. Along with that, I will make a few changes to each of them:
Have a look at how the log transformation affected our target feature. The distribution now seems to be symmetrical and is more normally distributed:
Now it’s time to handle the missing data!
Let’s have a look at how many missing values are present in our data:
There seem to be quite a few missing values in our dataset. What do you think could be the reason for this? Here’s a hint – take a look at the data description file and try to figure it out.
There are some features that have NA value for a missing parameter! This is strange but let me show you why that’s the case:
For example, NA in PoolQC feature means no pool is present in the house! This is treated as a null (or np.nan) value by Pandas and similar values are present in quite a few categorical features.
I will replace the null values in categorical features with a ‘None’ value.
For ordinal features, however, I will replace the null values with 0 and the remaining values with an increasing set of numbers. This is called Label Encoding and is used to capture the trend in an ordinal feature.
The null values in nominal features will be handled by replacing them with ‘None’ value which will be treated during One-Hot Encoding of the dataset.
Finally, the missing values in numerical features will be treated by replacing them with either a 0 or some other statistical value:
A null value in Garage features means that there is no garage in the house. These values will be handled the same way as mentioned above:
A null value in basement features indicates an absence of the basement and will be handled as mentioned above:
Null values in the remaining features can also be handled in a similar fashion:
Now that we have dealt with the missing values, we can Label Encode a few other features to convert to a numerical value. This retains the trend in the feature and the regression model will be able to understand the features.
Honestly, feature engineering is perhaps THE most important aspect of Kaggle competitions. A quick glance at previous winning solutions will show you how important feature engineering is. It’s often the difference between a top 20 percentile finish and a mid-leaderboard position.
We can make new features from existing data in the dataset to capture some trends in the data that might not be explicit. This makes the already existing data more useful. For example, adding a new feature that indicates the total square feet of the house is important as a house with a greater area will sell for a higher price. Similarly, a feature telling whether the house is new or not will be important as new houses tend to sell for higher prices compared to older ones.
I have made some new features below. I encourage you to go through the data yourself and see if you can come up with other useful features.
All these steps that I performed here are part of feature engineering. You can read more about them in detail in this article.
Since there a lot of categorical features in the dataset, we need to apply One-Hot Encoding to our dataset. This will convert categorical data in numbers so that the regression model can understand which category the value belongs to:
Because we had combined training and testing datasets into a single dataframe at the beginning, it is now time to separate the two:
Finally, I will split our train dataframe into training and validation datasets. This will allow us to train our model and validate its predictions without having to look at the testing dataset!
Let’s try to predict the values using linear regression. It is the simplest regression model and you can read more about it in detail in this article.
We are looking at the RMSE score here because the competition page states the evaluation metric is the RMSE score. We got a pretty decent RMSE score here without doing a lot. Now let’s see whether we can improve it using another classic machine learning technique.
Ridge regression is a type of linear regression model which allows the regularization of features to take place. Now, what is regularization?
Regularization shrinks some feature coefficients towards zero to minimize their effect on predicting the output value.
You can study more about regularization in this article.
We are getting the lowest RMSE score with an alpha value of 3. Since I got the lowest RMSE with Ridge regression, I will be using this model for my final submission:
But before submitting, we need to take the inverse of the log transformation that we did while training the model. This is done using the np.exp() function:
Now we can create a new dataframe for submitting the results:
Once you have created your submission file, it will appear in the output folder which you can access on the right-hand side panel as shown below:
You can download your submission file from here. Once you have done that, just drag and drop it in the upload space provided in the Submit Predictions tab on the competition page:
And just like that, you have made your very first Kaggle submission. Congrats!
Going forward, I encourage you to get your hands dirty with this competition and try to improve the accuracy that we have achieved here. You can go on to explore feature engineering and employ ensemble learning for better results.
Now go on and Kaggle your way to becoming a data science master!
Hello Annirudh Sir I am Prajwal Adhav from Second year (Automobile engineering) but I wanted to switch to data science field Please tell is that possible to do so. And how to start this long journey
Informative.. thank you
Hello, good job! Can you explain why is np.log required? It is not clear why it normalizes the distribution.
Hi! Log brings large values closer together. If we have data containing values like 10, 20, 50,... and then some values on the higher end like 1000, 2000, etc. On taking the log transformation we end up with values like 1, 1.3, 1.69, ..., and for the higher values we get 3, 3.3, etc. bringing all of them much closer to the median. This way we get a more normal distribution. I hope this helps.