Introduction
Categorical variables are known to hide and mask lots of interesting information in a data set. It’s crucial to learn the methods of dealing with such variables. If you won’t, many a times, you’d miss out on finding the most important variables in a model. It has happened with me. Initially, I used to focus more on numerical variables. Hence, never actually got an accurate model. But, later I discovered my flaws and learnt the art of dealing with such variables.
If you are a smart data scientist, you’d hunt down the categorical variables in the data set, and dig out as much information as you can. Right? But if you are a beginner, you might not know the smart ways to tackle such situations. Don’t worry. I am here to help you out.
After receiving a lot of requests on this topic, I decided to write down a clear approach to help you improve your models using categorical variables.
Note: This article is best written for beginners and newly turned predictive modelers. If you are an expert, you are welcome to share some useful tips of dealing with categorical variables in the comments section below.
What are the key challenges with categorical variable?
I’ve had nasty experience dealing with categorical variables. I remember working on a data set, where it took me more than 2 days just to understand the science of categorical variables. I’ve faced many such instances where error messages didn’t let me move forward. Even, my proven methods didn’t improve the situation.
But during this process, I learnt how to solve these challenges. I’d like to share all the challenges I faced while dealing with categorical variables. You’d find:
- A categorical variable has too many levels. This pulls down performance level of the model. For example, a cat. variable “zip code” would have numerous levels.
- A categorical variable has levels which rarely occur. Many of these levels have minimal chance of making a real impact on model fit. For example, a variable ‘disease’ might have some levels which would rarely occur.
- There is one level which always occurs i.e. for most of the observations in data set there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.
- If the categorical variable is masked, it becomes a laborious task to decipher its meaning. Such situations are commonly found in data science competitions.
- You can’t fit categorical variables into a regression equation in their raw form. They must be treated.
- Most of the algorithms (or ML libraries) produce better result with numerical variable. In python, library “sklearn” requires features in numerical arrays. Look at the below snapshot. I have applied random forest using sklearn library on titanic data set (only two features sex and pclass are taken as independent variables). It has returned an error because feature “sex” is categorical and has not been converted to numerical form.Python Code:
import pandas as pd
import numpy as np
import sklearn as sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
features = ['pclass','sibsp','parch','sex']
x_train = train[list(features)].values
y_train = train['survived'].values
x_test = test[list(features)].values
rf = RandomForestClassifier(n_estimators=50)
rf.fit(x_train, y_train)
sur=rf.predict(x_test)
Proven methods to deal with Categorical Variables
Here are some methods I used to deal with categorical variable(s). A trick to get good result from these methods is ‘Iterations’. You must know that all these methods may not improve results in all scenarios, but we should iterate our modeling process with different techniques. Later, evaluate the model performance. Below are the methods:
Convert to Number
- Convert to number: As discussed above, some ML libraries do not take categorical variables as input. Thus, we convert them into numerical variables. Below are the methods to convert a categorical (string) input to numerical nature:
- Label Encoder: It is used to transform non-numerical labels to numerical labels (or nominal categorical variables). Numerical labels are always between 0 and n_classes-1. A common challenge with nominal categorical variable is that, it may decrease performance of a model. For example: We have two features “age” (range: 0-80) and “city” (81 different levels). Now, when we’ll apply label encoder to ‘city’ variable, it will represent ‘city’ with numeric values range from 0 to 80. The ‘city’ variable is now similar to ‘age’ variable since both will have similar data points, which is certainly not a right approach.
- Convert numeric bins to number: Let’s say, bins of a continuous variable are available in the data set (shown below).
Above, you can see that variable “Age” has bins (0-17, 17-25, 26-35 …). We can convert these bins into definite numbers using the following methods:
-
-
- Using label encoder for conversion. But, these numerical bins will be treated same as multiple levels of non-numeric feature. Hence, wouldn’t provide any additional information
- Create a new feature using mean or mode (most relevant value) of each age bucket. It would comprise of additional weight for levels.
- Create two new features, one for lower bound of age and another for upper bound. In this method, we’ll obtain more information about these numerical bins compare to earlier two methods.
Combine Levels
- Combine levels: To avoid redundant levels in a categorical variable and to deal with rare levels, we can simply combine the different levels. There are various methods of combining levels. Here are commonly used ones:
- Using Business Logic: It is one of the most effective method of combining levels. It makes sense also to combine similar levels into similar groups based on domain or business experience. For example, we can combine levels of a variable “zip code” at state or district level. This will reduce the number of levels and improve the model performance also.
- Using frequency or response rate: Combining levels based on business logic is effective but we may always not have the domain knowledge. Imagine, you are given a data set from Aerospace Department, US Govt. How would you apply business logic here? In such cases, we combine levels by considering the frequency distribution or response rate.
- To combine levels using their frequency, we first look at the frequency distribution of of each level and combine levels having frequency less than 5% of total observation (5% is standard but you can change it based on distribution). This is an effective method to deal with rare levels.
- We can also combine levels by considering the response rate of each level. We can simply combine levels having similar response rate into same group.
- Finally, you can also look at both frequency and response rate to combine levels. You first combine levels based on response rate then combine rare levels to relevant group.
Dummy Coding
- Dummy Coding: Dummy coding is a commonly used method for converting a categorical input variable into continuous variable. ‘Dummy’, as the name suggests is a duplicate variable which represents one level of a categorical variable. Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created. Look at the representation below to convert a categorical variable using dummy variable.
Note: Assume, we have 500 levels in categorical variables. Then, should we create 500 dummy variables? If you can automate it, very well. Or else, I’d suggest you to first, reduce the levels by using combining methods and then use dummy coding. This would save your time.This method is also known as “One Hot Encoding“.
End Notes
In this article, we discussed the challenges you might face while dealing with categorical variable in modelling. We also discussed various methods to overcome those challenge and improve model performance. I’ve used Python for demonstration purpose and kept the focus of article for beginners.
In order to keep article simple and focused towards beginners, I have not described advanced methods like “feature hashing”. I will take it up as a separate article in itself in future.
You must understand that these methods are subject to the data sets in question. I’ve seen even the most powerful methods failing to bring model improvement. Whereas, a basic approach can do wonders. Hence, you must understand the validity of these models in context to your data set. If you still face any trouble, I shall help you out in comments section below.
Did you find this article helpful ? Do you know of other methods which work well with categorical variables? Please share your thoughts in the comments section below. I’d love to hear you.
Sunil Ray is Chief Content Officer at Analytics Vidhya, India's largest Analytics community. I am deeply passionate about understanding and explaining concepts from first principles. In my current role, I am responsible for creating top notch content for Analytics Vidhya including its courses, conferences, blogs and Competitions.
I thrive in fast paced environment and love building and scaling products which unleash huge value for customers using data and technology. Over the last 6 years, I have built the content team and created multiple data products at Analytics Vidhya.
Prior to Analytics Vidhya, I have 7+ years of experience working with several insurance companies like Max Life, Max Bupa, Birla Sun Life & Aviva Life Insurance in different data roles.
Industry exposure: Insurance, and EdTech
Major capabilities: Content Development, Product Management, Analytics, Growth Strategy.
Sunil, Thanks for sharing your thoughts and experience on how to treat Categorical Variables in a dataset! Can you elaborate more on combining levels based on Response Rate and Frequnecy Distribution? Or any pointers is highly appreciated. Thanks!
I would also appreciate the answer to this question. What do you mean by "response rate' in this context?
Hi Sunil, Thank you for great article. Can you explain how to calculate response rate or what does response rate mean ?. I tried googling but I am unable to relate to this particular data science context.
Please elaborate more on response rate ?