Employee Attrition prediction using Machine Learning is a crucial task for organizations aiming to retain valuable talent. This blog explores the process of building a predictive model for employee attrition using various machine learning techniques. We’ll dive into data exploration, cleaning, and preprocessing steps essential for creating an effective Employee Attrition prediction model. By analyzing factors such as job satisfaction, promotion history, and work environment, we can identify employees at risk of leaving. This predictive approach enables HR departments to take proactive measures, ultimately improving employee retention and maintaining a stable workforce.
This article was published as a part of the Data Science Blogathon
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import datasets
from sklearn.metrics import accuracy_score
attrdata = pd.read_csv("Table_1.csv")
Let’s look at our dataset and plan to torture it!
attrdata.head()
Output:
attrdata.drop(0,inplace=True)
attrdata.isnull().sum()
Output:
table id 0
name 0
phone number 0
Location 0
Emp. Group 0
Function 0
Gender 0
Tenure 0
Tenure Grp. 0
Experience (YY.MM) 4
Marital Status 0
Age in YY. 0
Hiring Source 0
Promoted/Non Promoted 0
Job Role Match 2
Stay/Left 0
dtype: int64
As we can see that there are null values in the Experience and Job role so we have to drop them.
attrdata.dropna(axis=0,inplace=True)
Output:
table id 0
name 0
phone number 0
Location 0
Emp. Group 0
Function 0
Gender 0
Tenure 0
Tenure Grp. 0
Experience (YY.MM) 0
Marital Status 0
Age in YY. 0
Hiring Source 0
Promoted/Non Promoted 0
Job Role Match 0
Stay/Left 0
dtype: int64
The shape of our dataset
attrdata.shape
Output:
(895, 16)
Now, we will use the value_counts function so that we can get the unique values from every categorical type of data.
Gender
gender_dict = attrdata["Gender "].value_counts()
gender_dict
Output:
Male 655
Female 234
other 6
Name: Gender , dtype: int64
attrdata['Gender '].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Count of different gender")
Output:
Here, from the chart, it’s visible that the count of males is more than another category of the gender.
Now, let’s figure out that how gender could be the reason for employees to leave the company or to stay in.
#Create a plot for crosstab
pd.crosstab(attrdata['Gender '],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Gender")
plt.xlabel("Stay/Left")
plt.ylabel("No of people who left based on gender")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)
Output:
Here, from the chart it’s visible that it heavily depends on males, also we can see that it’s either male, female or others but more number of them are staying in the company.
promoted_dict = attrdata["Promoted/Non Promoted"].value_counts()
promoted_dict
Output:
Promoted 457
Non Promoted 438
Name: Promoted/Non Promoted, dtype: int64
attrdata['Promoted/Non Promoted'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Promoted and Non Promoted")
Output:
Now, from the above chart, we can see that when it comes to Promoted and Non-Promoted employees it’s quiet in balanced number.
Now, let’s figure out that how promotion could be the reason for employees to leave the company or to stay in.
#Create a plot for crosstab
pd.crosstab(attrdata['Promoted/Non Promoted'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Promoted/Non Promoted")
plt.xlabel("Stay/Left")
plt.ylabel("No. of people who left/stay based on promotion")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)
Output:
Here, from the chart, it’s visible that the ones who are not promoted are leaving the company more as compared to the ones who are promoted which is also an obvious thing likely to happen.
func_dict = attrdata["Function"].value_counts()
func_dict
Output:
Operation 831
Support 52
Sales 12
Name: Function, dtype: int64
attrdata['Function'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Functions in organization")
Output:
Now, we can see that majority of the function performed by employees are Operation itself then support and at the last it’s sales.
Now, let’s figure out that how function could be the reason for employees to leave the company or to stay in.
#Create a plot for crosstab
pd.crosstab(attrdata['Function'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Function")
plt.xlabel("Stay/Left")
plt.ylabel("No. of people who left/stay based on function of organization")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)
Output:
Here, in the chart, we can see that the maximum number of employees are in the operation section and a high number of employees in the same section are staying in the company.
Hiring_dict = attrdata["Hiring Source"].value_counts()
Hiring_dict
Output:
Direct 708
Agency 116
Employee Referral 71
Name: Hiring Source, dtype: int64
Marital_dict = attrdata["Marital Status"].value_counts()
print(Marital_dict)
Output:
Single 533
Marr. 356
Sep. 2
Div. 2
NTBD 2
Name: Marital Status, dtype: int64
Employee group
Emp_dict = attrdata["Emp. Group"].value_counts()
Emp_dict['other group'] = 1
print(Emp_dict)
Output:
B1 537
B2 275
B3 59
B0 8
B4 7
B5 4
B7 2
D2 1
C3 1
B6 1
other group 1
Name: Emp. Group, dtype: int64
Job role match (Yes/ No)
job_dict = attrdata["Job Role Match"].value_counts()
job_dict
Output:
Yes 480
No 415
Name: Job Role Match, dtype: int64
attrdata['Job Role Match'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Job Role Match")
Output:
Now, we can see that majority of the employees have their correct role in Job.
#Create a plot for crosstab
pd.crosstab(attrdata['Job Role Match'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Job Role Match")
plt.xlabel("Stay/Left")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)
Output:
Here, in the above chart, we can see that the number of employees who got the correct job role is staying in the company rather than the ones who don’t have their right job role.
Tenure group
tenure_dict = attrdata["Tenure Grp."].value_counts()
print(tenure_dict)
Output:
> 1 & < =3 626
< =1 269
Name: Tenure Grp., dtype: int64
# Its Age vs stay/left
sns.jointplot(x='Stay/Left',y='Age in YY.',data=attrdata)
Output:
In the above graph, we can see that the ones who are having more age are staying back in the company rather than the ones who have comparatively less age.
sns.jointplot(x='Stay/Left',y='Experience (YY.MM)',data=attrdata)
Output:
Here in the above graph, we can see that the employees who have got more experience will be staying back in the company rather than the ones who have comparatively less experience.
Here, first, we are trying to get the correlation between variables where the dataset is not processed that’s why we are not able to see the results in the manner we want to, but in the latter half of the article, we will see the better correlation plot with the help of processed data.
# Let's make our correlation matrix visual
corr_matrix=attrdata.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt=".2f"
)
Output:
Build a new dictionary (location) to be used to categorize data columns after values are encoded. Here, in location_dict_new we are using integer values instead of the actual region name so that our machine learning model could interpret it.
location_dict = attrdata["Location"].value_counts()
print(location_dict)
location_dict_new = {
'Chennai': 7,
'Noida': 6,
'Bangalore': 5,
'Hyderabad': 4,
'Pune': 3,
'Madurai': 2,
'Lucknow': 1,
'other place': 0,
}
print(location_dict_new)
Output:
Chennai 255
Noida 236
Bangalore 210
Hyderabad 62
Pune 55
Madurai 29
Lucknow 20
Nagpur 14
Vijayawada 6
Mumbai 4
Gurgaon 3
Kolkata 1
Name: Location, dtype: int64
{'Chennai': 7, 'Noida': 6, 'Bangalore': 5, 'Hyderabad': 4, 'Pune': 3, 'Madurai': 2, 'Lucknow': 1, 'other place': 0}
Now we will make a function for the location column to make a new column where encoded location values will be there because our machine learning algorithm will only understand int/float values.
def location(x):
if str(x) in location_dict_new.keys():
return location_dict_new[str(x)]
else:
return location_dict_new['other place']
data_l = attrdata["Location"].apply(location)
attrdata['New Location'] = data_l
attrdata.head()
Output:
Pandas get_dummies() function is used for manipulating data, this function is used to convert the categorical values to dummy variables and the same thing has been done with:
gen = pd.get_dummies(attrdata["Function"])
gen.head()
Output:
hr = pd.get_dummies(attrdata["Hiring Source"])
hr.head()
Output:
Here, in Mar() functionwe are using Maritial dictionary keys to convert those categorical values into acceptable type values for our ML models.
def Mar(x):
if str(x) in Marital_dict.keys() and Marital_dict[str(x)] > 100:
return str(x)
else:
return 'other status'
data_l = attrdata["Marital Status"].apply(Mar)
attrdata['New Marital'] = data_l
attrdata.head()
Output:
Using the get_dummies to function for New Marital we are converting categorical values into dummy variables
Mr = pd.get_dummies(attrdata["New Marital"])
Mr.head()
Output:
Promoted/Not Promoted
Here, with the help of Promoted function, we are converting Promoted and Non promoted values into 1 and 0 respectively for encoding purposes.
def Promoted(x):
if x == 'Promoted':
return int(1)
else:
return int(0)
data_l = attrdata["Promoted/Non Promoted"].apply(Promoted)
attrdata['New Promotion'] = data_l
attrdata.head()
Output:
Here first, we are creating a dictionary for the employee group and tagging each group to the respective integer values, later we are creating an emp() function where the encoding of the categorical values is done – similar to marital status.
Emp_dict_new = {
'B1': 4,
'B2': 3,
'B3': 2,
'other group': 1,
}
def emp(x):
if str(x) in Emp_dict_new.keys():
return str(x)
else:
return 'other group'
data_l = attrdata["Emp. Group"].apply(emp)
attrdata['New EMP'] = data_l
emp = pd.get_dummies(attrdata["New EMP"])
attrdata.head()
Output:
Here, we are using the Job() function where categorical values are Yes and No which needs to be converted into integer values i.e. 1/0 then we are assigning the New Job Role Match.
def Job(x):
if x == 'Yes':
return int(1)
else:
return int(0)
data_l = attrdata["Job Role Match"].apply(Job)
attrdata['New Job Role Match'] = data_l
attrdata.head()
Output:
Here, we are using the Gen() function using gender_dict (dictionary) which will be encoded first using the dictionary keys, and then the changes will be applied to the dataset based on changes that are done.
def Gen(x):
if x in gender_dict.keys():
return str(x)
else:
return 'other'
data_l = attrdata["Gender "].apply(Gen)
attrdata['New Gender'] = data_l
attrdata.head()
Output:
get_dummies() function for the same purposes for New gender and Tenure groups.
gend = pd.get_dummies(attrdata["New Gender"])
gend.head()
Output:
tengrp = pd.get_dummies(attrdata["Tenure Grp."])
tengrp.head()
Output:
Now, let’s concatenate the columns which are being cleaned, sorted, and manipulated by us as processed data.
dataset = pd.concat([attrdata, hr, Mr, emp, tengrp, gen, gend], axis = 1)
dataset.head()
Output:
dataset.columns
Output:
Index(['table id', 'name', 'phone number', 'Location', 'Emp. Group',
'Function', 'Gender ', 'Tenure', 'Tenure Grp.', 'Experience (YY.MM)',
'Marital Status', 'Age in YY.', 'Hiring Source',
'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left', 'New Location',
'New Marital', 'New Promotion', 'New EMP', 'New Job Role Match',
'New Gender', 'Agency', 'Direct', 'Employee Referral', 'Marr.',
'Single', 'other status', 'B1', 'B2', 'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales', 'Support', 'Female', 'Male',
'other'],
dtype='object')
Let’s drop the columns which are not important anymore
dataset.drop(["table id", "name", "Marital Status","Promoted/Non Promoted","Function","Emp. Group","Job Role Match","Location"
,"Hiring Source","Gender ", 'Tenure', 'New Gender', 'New Marital', 'New EMP'],axis=1,inplace=True)
dataset1 = dataset.drop(['Tenure Grp.', 'phone number'], axis = 1)
dataset1.columns
Output:
Index(['Experience (YY.MM)', 'Age in YY.', 'Stay/Left', 'New Location',
'New Promotion', 'New Job Role Match', 'Agency', 'Direct',
'Employee Referral', 'Marr.', 'Single', 'other status', 'B1', 'B2',
'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales',
'Support', 'Female', 'Male', 'other'],
dtype='object')
As I mentioned, this is the correlation plot on the processed dataset.
# Let's make our correlation matrix visual
corr_matrix=dataset1.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt=".2f"
)
Output:
Let’s see our target column
# Target
"""
def Target(x):
if x in "Stay":
return False
else:
return True
data_l = dataset1["Stay/Left"].apply(Target)
dataset1['Stay/Left'] = data_l
"""
dataset1['Stay/Left'].head()
Output:
1 Stay
2 Stay
3 Stay
4 Stay
5 Stay
Name: Stay/Left, dtype: object
Saving the cleaned dataset into another CSV file
dataset1.to_csv("processed table.csv")
Now, from the processed data we have to separate the features and target column again.
dataset = pd.read_csv("processed table.csv")
dataset = pd.DataFrame(dataset)
y = dataset["Stay/Left"]
X = dataset.drop("Stay/Left",axis=1)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4)
X_train.head()
Output:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
lr=LogisticRegression(C = 0.1, random_state = 42, solver = 'liblinear')
dt=DecisionTreeClassifier()
rm=RandomForestClassifier()
gnb=GaussianNB()
knn = KNeighborsClassifier(n_neighbors=3)
svm = svm.SVC(kernel='linear')
Now, from one block of code, we will check the accuracy of all the model
for a,b in zip([lr,dt,knn,svm,rm,gnb],["Logistic Regression","Decision Tree","KNN","SVM","Random Forest","Naive Bayes"]):
a.fit(X_train,y_train)
prediction=a.predict(X_train)
y_pred=a.predict(X_test)
score1=accuracy_score(y_train,prediction)
score=accuracy_score(y_test,y_pred)
msg1="[%s] training data accuracy is : %f" % (b,score1)
msg2="[%s] test data accuracy is : %f" % (b,score)
print(msg1)
print(msg2)
Output:
[Logistic Regression] training data accuracy is : 0.891061
[Logistic Regression] test data accuracy is : 0.877095
[Decision Tree] training data accuracy is : 1.000000
[Decision Tree] test data accuracy is : 0.849162
[KNN] training data accuracy is : 0.804469
[KNN] test data accuracy is : 0.586592
[SVM] training data accuracy is : 0.878492
[SVM] test data accuracy is : 0.865922
[Random Forest] training data accuracy is : 1.000000
[Random Forest] test data accuracy is : 0.888268
[Naive Bayes] training data accuracy is : 0.870112
[Naive Bayes] test data accuracy is : 0.826816
model_scores={'Logistic Regression':lr.score(X_test,y_test),
'KNN classifier':knn.score(X_test,y_test),
'Support Vector Machine':svm.score(X_test,y_test),
'Random forest':rm.score(X_test,y_test),
'Decision tree':dt.score(X_test,y_test),
'Naive Bayes':gnb.score(X_test,y_test)
}
model_scores
Output:
{'Logistic Regression': 0.8770949720670391,
'KNN classifier': 0.5865921787709497,
'Support Vector Machine': 0.8659217877094972,
'Random forest': 0.888268156424581,
'Decision tree': 0.8491620111731844,
'Naive Bayes': 0.8268156424581006}
Here, we can see that Logistic regression and Random forest have the best accuracy.
from sklearn.metrics import classification_report
rm_y_preds = rm.predict(X_test)
print(classification_report(y_test,rm_y_preds))
Output:
precision recall f1-score support
Left 0.81 0.77 0.79 61
Stay 0.88 0.91 0.90 118
accuracy 0.86 179
macro avg 0.85 0.84 0.84 179
weighted avg 0.86 0.86 0.86 179
from sklearn.metrics import classification_report
lr_y_preds = lr.predict(X_test)
print(classification_report(y_test,lr_y_preds))
Output:
precision recall f1-score support
Left 0.81 0.84 0.82 61
Stay 0.91 0.90 0.91 118
accuracy 0.88 179
macro avg 0.86 0.87 0.86 179
weighted avg 0.88 0.88 0.88 179
Based on the accuracy
model_compare=pd.DataFrame(model_scores,index=['accuracy'])
model_compare
Output:
model_compare.T.plot(kind='bar') # (T is here for transpose)
Output:
Yes, we can see that Random Forest has 1% better accuracy than Logistic regression but Random Forest is an overfitted model hence we will select Logistic regression.
These “coef’s” tell how much and in what way did each one of them contribute to predicting the target variable
# Logistic regression
feature_dict=dict(zip(dataset.columns,list(lr.coef_[0])))
feature_dict
This is a type of Model-driven Exploratory data analysis.
Output:
{'Unnamed: 0': -0.000369638613255323,
'Experience (YY.MM)': 0.13976826898148667,
'Age in YY.': -0.01962203036690505,
'Stay/Left': 0.024627352503955716,
'New Location': 0.09666512057880872,
'New Promotion': 2.7533361395664873,
'New Job Role Match': -0.31312348489873837,
'Agency': -0.020270885128439962,
'Direct': 0.2047872081147744,
'Employee Referral': 0.38796318779035893,
'Marr.': -0.5042922677790985,
'Single': -0.012278081923656183,
'other status': -0.22031197156191443,
'B1': 0.17649315194124499,
'B2': -0.09367965851449515,
'B3': 0.008891316222766718,
'other group': -0.12993373252552307,
' 1 & < =3': -0.05949484149777713,
'Operation': -0.018200464251483424,
'Sales': -0.05091185616314798,
'Support': 0.12249364800096194,
'Female': -0.1951606022872554,
'Male': -0.055940207626112265}
feature_df=pd.DataFrame(feature_dict,index=[0])
feature_df.T.plot(kind="bar",legend=False,title="Feature Importance")
Output:
As we can see that “New promotion” column has the highest feature importance.
Logistic Regression model because it has the best accuracy as well it is neither overfitted nor under fitted
import pickle
# Save the trained model as a pickle string.
saved_model = pickle.dumps(lr)
# Load the pickled model
lr_from_pickle = pickle.loads(saved_model)
# Use the loaded pickled model to make predictions
lr_from_pickle.predict(X_test)
Output:
array(['Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left',
'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay',
'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
'Left', 'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
'Left', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Stay',
'Left', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Stay',
'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Left', 'Stay',
'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay',
'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Left',
'Left', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left',
'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay',
'Stay', 'Stay', 'Stay'], dtype=object)
# loading dependency
import joblib
# saving our model - model - model , filename - model_lr
joblib.dump(lr , 'model_lr')
# opening the file- model_jlib
m_jlib = joblib.load('model_lr')
# check prediction
m_jlib.predict(X_test) # similar output
Output:
Okay, so that’s a wrap from my side!
In this blog, we’ve walked through the entire process of developing an Employee Attrition prediction model. From data exploration to model selection and evaluation, we’ve covered key steps in creating an effective predictive tool. The Logistic Regression model emerged as the best performer for Employee Attrition prediction model, balancing accuracy and avoiding overfitting. By implementing such models, organizations can anticipate potential employee departures and take timely action. Remember, while Employee Attrition prediction using machine learning is a powerful tool, it should be used ethically and in conjunction with positive employee engagement strategies to foster a healthy work environment.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
A. Employee attrition prediction involves forecasting the likelihood of employees leaving an organization based on historical data and predictive analytics. It helps in proactive retention strategies and workforce planning.
A. Forecasted attrition is calculated using historical data on employee turnover rates, analyzing trends, patterns, and factors contributing to attrition. Machine learning models such as logistic regression or random forest can be used for prediction.
A. Employee turnover rate prediction forecasts the percentage or number of employees expected to leave an organization within a specific period. It assists in workforce management and retention strategies.
A. Algorithms for predicting attrition rate include logistic regression, decision trees, random forest, and neural networks. These models analyze factors such as employee demographics, job satisfaction, and performance metrics to predict attrition likelihood accurately.