Employee Attrition Prediction – A Comprehensive Guide

Aman Preet Last Updated : 11 Nov, 2024

14 min read

Employee Attrition prediction using Machine Learning is a crucial task for organizations aiming to retain valuable talent. This blog explores the process of building a predictive model for employee attrition using various machine learning techniques. We’ll dive into data exploration, cleaning, and preprocessing steps essential for creating an effective Employee Attrition prediction model. By analyzing factors such as job satisfaction, promotion history, and work environment, we can identify employees at risk of leaving. This predictive approach enables HR departments to take proactive measures, ultimately improving employee retention and maintaining a stable workforce.

Learning Objective:

Understand the importance of predicting employee attrition prediction model for organizational stability and talent retention.
Explore the process of building a predictive model using machine learning techniques.
Learn essential steps such as data exploration, cleaning, preprocessing, and model development.
Gain insights into factors influencing employee attrition prediction model like job satisfaction and promotion history.

This article was published as a part of the D ata Science Blogathon

Need of Employee Attrition prediction
Importing libraries
- Reading the dataset
Data exploration
Data cleaning
Splitting data – Train test split
Model Development
Conclusion
Frequently Asked Questions

Need of Employee Attrition prediction

Managing workforce: If the supervisors or HR came to know about some employees that they will be planning to leave the company then they could get in touch with those employees which can help them to stay back or they can manage the workforce by hiring the new alternative of those employees.
Smooth pipeline: If all the employees in the current project are working continuously on a project then the pipeline of that project will be smooth but if suppose one efficient asset of the project(employee) suddenly leave that company then the workflow will be not so smooth
Hiring Management: If HR of one particular project came to know about the employee who is willing to leave the company then, they can manage the number of hiring and they can get the valuable asset whenever they need so for the efficient flow of work.

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import datasets
from sklearn.metrics import accuracy_score

Reading the dataset

attrdata = pd.read_csv("Table_1.csv")

Let’s look at our dataset and plan to torture it!

attrdata.head()

Output:

looking at our dataset for employee attrition prediction

Data exploration

Dropping the index column

attrdata.drop(0,inplace=True)
attrdata.isnull().sum()

Output:

table id                 0
name                     0
phone number             0
Location                 0
Emp. Group               0
Function                 0
Gender                   0
Tenure                   0
Tenure Grp.              0
Experience (YY.MM)       4
Marital Status           0
Age in YY.               0
Hiring Source            0
Promoted/Non Promoted    0
Job Role Match           2
Stay/Left                0
dtype: int64

As we can see that there are null values in the Experience and Job role so we have to drop them.

attrdata.dropna(axis=0,inplace=True)

Output:

table id                 0
name                     0
phone number             0
Location                 0
Emp. Group               0
Function                 0
Gender                   0
Tenure                   0
Tenure Grp.              0
Experience (YY.MM)       0
Marital Status           0
Age in YY.               0
Hiring Source            0
Promoted/Non Promoted    0
Job Role Match           0
Stay/Left                0
dtype: int64

The shape of our dataset

attrdata.shape

Output:

(895, 16)

Let’s explore all the categorical values and visualize them

Now, we will use the value_counts function so that we can get the unique values from every categorical type of data.

Gender

gender_dict = attrdata["Gender "].value_counts()
gender_dict

Output:

Male      655
Female    234
other       6
Name: Gender , dtype: int64

Understanding the balancing of the Gender column visually

attrdata['Gender '].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Count of different gender")

Output:

Understanding Gender column for employee attrition prediction

Here, from the chart, it’s visible that the count of males is more than another category of the gender.

Male: 655
Female: 234
Other: 6

Now, let’s figure out that how gender could be the reason for employees to leave the company or to stay in.

#Create a plot for crosstab

pd.crosstab(attrdata['Gender '],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Gender")
plt.xlabel("Stay/Left")
plt.ylabel("No of people who left based on gender")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)

Output:

Here, from the chart it’s visible that it heavily depends on males, also we can see that it’s either male, female or others but more number of them are staying in the company.

Promotion (Promoted/ Non-Promoted)

promoted_dict = attrdata["Promoted/Non Promoted"].value_counts()
promoted_dict

Output:

Promoted        457
Non Promoted    438
Name: Promoted/Non Promoted, dtype: int64

attrdata['Promoted/Non Promoted'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Promoted and Non Promoted")

Output:

Plot for Promoted/ Non-Promoted for employee attrition prediction

Now, from the above chart, we can see that when it comes to Promoted and Non-Promoted employees it’s quiet in balanced number.

Now, let’s figure out that how promotion could be the reason for employees to leave the company or to stay in.

#Create a plot for crosstab

pd.crosstab(attrdata['Promoted/Non Promoted'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Promoted/Non Promoted")
plt.xlabel("Stay/Left")
plt.ylabel("No. of people who left/stay based on promotion")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)

Output:

Plot showing how promotion could be the reason for employees to leave the company or to stay in.

Here, from the chart, it’s visible that the ones who are not promoted are leaving the company more as compared to the ones who are promoted which is also an obvious thing likely to happen.

Function (Operation/ Support/ Sales)

func_dict = attrdata["Function"].value_counts()
func_dict

Output:

Operation    831
Support       52
Sales         12
Name: Function, dtype: int64

attrdata['Function'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Functions in organization")

Output:

Now, we can see that majority of the function performed by employees are Operation itself then support and at the last it’s sales.

Now, let’s figure out that how function could be the reason for employees to leave the company or to stay in.

#Create a plot for crosstab

pd.crosstab(attrdata['Function'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Function")
plt.xlabel("Stay/Left")
plt.ylabel("No. of people who left/stay based on function of organization")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)

Output:

Plot showing No. of people who have left/stayed based on function

Here, in the chart, we can see that the maximum number of employees are in the operation section and a high number of employees in the same section are staying in the company.

Hiring (Direct/ Agency/ Employee referral)

Hiring_dict = attrdata["Hiring Source"].value_counts()
Hiring_dict

Output:

Direct               708
Agency               116
Employee Referral     71
Name: Hiring Source, dtype: int64

Marital Status (Single/ Married/ Seperated/ Div./ NTBD)

Marital_dict = attrdata["Marital Status"].value_counts()
print(Marital_dict)

Output:

Single    533
Marr.     356
Sep.        2
Div.        2
NTBD        2
Name: Marital Status, dtype: int64

Employee group

Emp_dict = attrdata["Emp. Group"].value_counts()
Emp_dict['other group'] = 1
print(Emp_dict)

Output:

B1             537
B2             275
B3              59
B0               8
B4               7
B5               4
B7               2
D2               1
C3               1
B6               1
other group      1
Name: Emp. Group, dtype: int64

Job role match (Yes/ No)

job_dict = attrdata["Job Role Match"].value_counts()
job_dict

Output:

Yes    480
No     415
Name: Job Role Match, dtype: int64

attrdata['Job Role Match'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Job Role Match")

Output:

Job role plot for employee attrition prediction

Now, we can see that majority of the employees have their correct role in Job.

#Create a plot for crosstab

pd.crosstab(attrdata['Job Role Match'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Job Role Match")
plt.xlabel("Stay/Left")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)

Output:

employee stayed/left based on Job role match for employee attrition prediction

Here, in the above chart, we can see that the number of employees who got the correct job role is staying in the company rather than the ones who don’t have their right job role.

Tenure group

tenure_dict = attrdata["Tenure Grp."].value_counts()
print(tenure_dict)

Output:

> 1 & < =3    626
< =1          269
Name: Tenure Grp., dtype: int64

Now let’s visualize some continuous data

# Its Age vs stay/left
sns.jointplot(x='Stay/Left',y='Age in YY.',data=attrdata)

Output:

In the above graph, we can see that the ones who are having more age are staying back in the company rather than the ones who have comparatively less age.

sns.jointplot(x='Stay/Left',y='Experience (YY.MM)',data=attrdata)

Output:

Visualizing Experience vs Stay/Left for employee attrition prediction

Here in the above graph, we can see that the employees who have got more experience will be staying back in the company rather than the ones who have comparatively less experience.

Here, first, we are trying to get the correlation between variables where the dataset is not processed that’s why we are not able to see the results in the manner we want to, but in the latter half of the article, we will see the better correlation plot with the help of processed data.

# Let's make our correlation matrix visual
corr_matrix=attrdata.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
               annot=True,
               linewidths=0.5,
               fmt=".2f"
              )

Output:

Correltion Matrix plot for employee attrition prediction

Data cleaning

Encoding the locations column (categorized)

Build a new dictionary (location) to be used to categorize data columns after values are encoded. Here, in location_dict_new we are using integer values instead of the actual region name so that our machine learning model could interpret it.

location_dict = attrdata["Location"].value_counts()
print(location_dict)

location_dict_new = {
    'Chennai':       7,
    'Noida':         6,
    'Bangalore':     5,
    'Hyderabad':     4,
    'Pune':          3,
    'Madurai':       2,
    'Lucknow':       1,
    'other place':   0,
}

print(location_dict_new)

Output:

Chennai       255
Noida         236
Bangalore     210
Hyderabad      62
Pune           55
Madurai        29
Lucknow        20
Nagpur         14
Vijayawada      6
Mumbai          4
Gurgaon         3
Kolkata         1
Name: Location, dtype: int64
{'Chennai': 7, 'Noida': 6, 'Bangalore': 5, 'Hyderabad': 4, 'Pune': 3, 'Madurai': 2, 'Lucknow': 1, 'other place': 0}

Now we will make a function for the location column to make a new column where encoded location values will be there because our machine learning algorithm will only understand int/float values.

def location(x):
    if str(x) in location_dict_new.keys():
        return location_dict_new[str(x)]
    else:
        return location_dict_new['other place']
    
data_l = attrdata["Location"].apply(location)
attrdata['New Location'] = data_l
attrdata.head()

Output:

Adding a new location column — A new location column has been added

get_dummies()

Pandas get_dummies() function is used for manipulating data, this function is used to convert the categorical values to dummy variables and the same thing has been done with:

Function
Hiring Source
New Marital
New Gender
Tenure group

gen = pd.get_dummies(attrdata["Function"])
gen.head()

Output:

hr = pd.get_dummies(attrdata["Hiring Source"])
hr.head()

Output:

Marital Status

Here, in Mar() functionwe are using Maritial dictionary keys to convert those categorical values into acceptable type values for our ML models.

def Mar(x):
    if str(x) in Marital_dict.keys() and Marital_dict[str(x)] > 100:
        return str(x)
    else:
        return 'other status'
    
data_l = attrdata["Marital Status"].apply(Mar)
attrdata['New Marital'] = data_l
attrdata.head()

Output:

Using the get_dummies to function for New Marital we are converting categorical values into dummy variables

Mr = pd.get_dummies(attrdata["New Marital"])
Mr.head()

Output:

Promoted/Not Promoted

Here, with the help of Promoted function, we are converting Promoted and Non promoted values into 1 and 0 respectively for encoding purposes.

def Promoted(x):
    if x == 'Promoted':
        return int(1)
    else:
        return int(0)

data_l = attrdata["Promoted/Non Promoted"].apply(Promoted)
attrdata['New Promotion'] = data_l
attrdata.head()

Output:

Employee Group

Here first, we are creating a dictionary for the employee group and tagging each group to the respective integer values, later we are creating an emp() function where the encoding of the categorical values is done – similar to marital status.

Emp_dict_new = {
    'B1': 4,
    'B2': 3,
    'B3': 2,
    'other group': 1,
}

def emp(x):
    if str(x) in Emp_dict_new.keys():
        return str(x)
    else:
        return 'other group'
 
data_l = attrdata["Emp. Group"].apply(emp)
attrdata['New EMP'] = data_l

emp = pd.get_dummies(attrdata["New EMP"])
attrdata.head()

Output:

Job Role Match

Here, we are using the Job() function where categorical values are Yes and No which needs to be converted into integer values i.e. 1/0 then we are assigning the New Job Role Match.

def Job(x):
    if x == 'Yes':
        return int(1)
    else:
        return int(0)
    
data_l = attrdata["Job Role Match"].apply(Job)
attrdata['New Job Role Match'] = data_l
attrdata.head()

Output:

Gender

Here, we are using the Gen() function using gender_dict (dictionary) which will be encoded first using the dictionary keys, and then the changes will be applied to the dataset based on changes that are done.

def Gen(x):
    if x in gender_dict.keys():
        return str(x)
    else:
        return 'other'

data_l = attrdata["Gender "].apply(Gen)
attrdata['New Gender'] = data_l
attrdata.head()

Output:

get_dummies() function for the same purposes for New gender and Tenure groups.

gend = pd.get_dummies(attrdata["New Gender"])
gend.head()

Output:

tengrp = pd.get_dummies(attrdata["Tenure Grp."])
tengrp.head()

Output:

Now, let’s concatenate the columns which are being cleaned, sorted, and manipulated by us as processed data.

dataset = pd.concat([attrdata, hr, Mr, emp, tengrp, gen, gend], axis = 1)
dataset.head()

Output:

concatenate the columns which are being cleaned, sorted, and manipulated for employee attrition prediction

dataset.columns

Output:

Index(['table id', 'name', 'phone number', 'Location', 'Emp. Group',
       'Function', 'Gender ', 'Tenure', 'Tenure Grp.', 'Experience (YY.MM)',
       'Marital Status', 'Age in YY.', 'Hiring Source',
       'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left', 'New Location',
       'New Marital', 'New Promotion', 'New EMP', 'New Job Role Match',
       'New Gender', 'Agency', 'Direct', 'Employee Referral', 'Marr.',
       'Single', 'other status', 'B1', 'B2', 'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales', 'Support', 'Female', 'Male',
       'other'],
      dtype='object')

Let’s drop the columns which are not important anymore

dataset.drop(["table id", "name", "Marital Status","Promoted/Non Promoted","Function","Emp. Group","Job Role Match","Location"
              ,"Hiring Source","Gender ", 'Tenure', 'New Gender', 'New Marital', 'New EMP'],axis=1,inplace=True)

dataset1 = dataset.drop(['Tenure Grp.', 'phone number'], axis = 1)
dataset1.columns

Output:

Index(['Experience (YY.MM)', 'Age in YY.', 'Stay/Left', 'New Location',
       'New Promotion', 'New Job Role Match', 'Agency', 'Direct',
       'Employee Referral', 'Marr.', 'Single', 'other status', 'B1', 'B2',
       'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales',
       'Support', 'Female', 'Male', 'other'],
      dtype='object')

As I mentioned, this is the correlation plot on the processed dataset.

# Let's make our correlation matrix visual
corr_matrix=dataset1.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
               annot=True,
               linewidths=0.5,
               fmt=".2f"
              )

Output:

Let’s see our target column

# Target 
"""
def Target(x):
    if x in "Stay":
        return False
    else:
        return True
    
data_l = dataset1["Stay/Left"].apply(Target)
dataset1['Stay/Left'] = data_l
"""
dataset1['Stay/Left'].head()

Output:

1    Stay
2    Stay
3    Stay
4    Stay
5    Stay
Name: Stay/Left, dtype: object

Saving the cleaned dataset into another CSV file

dataset1.to_csv("processed table.csv")

Now, from the processed data we have to separate the features and target column again.

dataset = pd.read_csv("processed table.csv")
dataset = pd.DataFrame(dataset)
y = dataset["Stay/Left"]
X = dataset.drop("Stay/Left",axis=1)

Splitting data – Train test split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4)
X_train.head()

Output:

Splitting data - Train test split | employee attrition prediction

Model Development

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

Initializing the models

Logistic Regression : C: Inverse of regularization strength (float), random state: (int), solver: sag,saga,liblinear (Here, we are using liblinear).
Decision trees: Default parameters
Random forest: Default parameters
Gaussian Naive Bayes: Default parameters
K-nearest neighbors: n_neighbors=3 – we can have another number of neighbors too.
Support vector machines: kernel can be linear, polynomial, RBF, sigmoid. Here we are using a linear kernel function.

lr=LogisticRegression(C = 0.1, random_state = 42, solver = 'liblinear')
dt=DecisionTreeClassifier()
rm=RandomForestClassifier()
gnb=GaussianNB()
knn = KNeighborsClassifier(n_neighbors=3)
svm = svm.SVC(kernel='linear')

Now, from one block of code, we will check the accuracy of all the model

for a,b in zip([lr,dt,knn,svm,rm,gnb],["Logistic Regression","Decision Tree","KNN","SVM","Random Forest","Naive Bayes"]):
    a.fit(X_train,y_train)
    prediction=a.predict(X_train)
    y_pred=a.predict(X_test)
    score1=accuracy_score(y_train,prediction)
    score=accuracy_score(y_test,y_pred)
    msg1="[%s] training data accuracy is : %f" % (b,score1)
    msg2="[%s] test data accuracy is : %f" % (b,score)
    print(msg1)
    print(msg2)

Output:

[Logistic Regression] training data accuracy is : 0.891061
[Logistic Regression] test data accuracy is : 0.877095
[Decision Tree] training data accuracy is : 1.000000
[Decision Tree] test data accuracy is : 0.849162
[KNN] training data accuracy is : 0.804469
[KNN] test data accuracy is : 0.586592
[SVM] training data accuracy is : 0.878492
[SVM] test data accuracy is : 0.865922
[Random Forest] training data accuracy is : 1.000000
[Random Forest] test data accuracy is : 0.888268
[Naive Bayes] training data accuracy is : 0.870112
[Naive Bayes] test data accuracy is : 0.826816

Model Scores (accuracy)

model_scores={'Logistic Regression':lr.score(X_test,y_test),
             'KNN classifier':knn.score(X_test,y_test),
             'Support Vector Machine':svm.score(X_test,y_test),
             'Random forest':rm.score(X_test,y_test),
              'Decision tree':dt.score(X_test,y_test),
              'Naive Bayes':gnb.score(X_test,y_test)
             }
model_scores

Output:

{'Logistic Regression': 0.8770949720670391,
 'KNN classifier': 0.5865921787709497,
 'Support Vector Machine': 0.8659217877094972,
 'Random forest': 0.888268156424581,
 'Decision tree': 0.8491620111731844,
 'Naive Bayes': 0.8268156424581006}

Here, we can see that Logistic regression and Random forest have the best accuracy.

Classification Report of Random forest

from sklearn.metrics import classification_report

rm_y_preds = rm.predict(X_test)

print(classification_report(y_test,rm_y_preds))

Output:

              precision    recall  f1-score   support

        Left       0.81      0.77      0.79        61
        Stay       0.88      0.91      0.90       118

    accuracy                           0.86       179
   macro avg       0.85      0.84      0.84       179
weighted avg       0.86      0.86      0.86       179

Classification Report of Logistic Regression

from sklearn.metrics import classification_report

lr_y_preds = lr.predict(X_test)

print(classification_report(y_test,lr_y_preds))

Output:

              precision    recall  f1-score   support

        Left       0.81      0.84      0.82        61
        Stay       0.91      0.90      0.91       118

    accuracy                           0.88       179
   macro avg       0.86      0.87      0.86       179
weighted avg       0.88      0.88      0.88       179

Model Comparison

Based on the accuracy

model_compare=pd.DataFrame(model_scores,index=['accuracy'])
model_compare

Output:

Model Comparison for employee attrition prediction

Visualize the accuracy of each model

model_compare.T.plot(kind='bar') # (T is here for transpose)

Output:

Visualizing the accuracy of each model | employee attrition rate

Yes, we can see that Random Forest has 1% better accuracy than Logistic regression but Random Forest is an overfitted model hence we will select Logistic regression.

Feature importance

These “coef’s” tell how much and in what way did each one of them contribute to predicting the target variable

# Logistic regression

feature_dict=dict(zip(dataset.columns,list(lr.coef_[0])))
feature_dict

This is a type of Model-driven Exploratory data analysis.

Output:

{'Unnamed: 0': -0.000369638613255323,
 'Experience (YY.MM)': 0.13976826898148667,
 'Age in YY.': -0.01962203036690505,
 'Stay/Left': 0.024627352503955716,
 'New Location': 0.09666512057880872,
 'New Promotion': 2.7533361395664873,
 'New Job Role Match': -0.31312348489873837,
 'Agency': -0.020270885128439962,
 'Direct': 0.2047872081147744,
 'Employee Referral': 0.38796318779035893,
 'Marr.': -0.5042922677790985,
 'Single': -0.012278081923656183,
 'other status': -0.22031197156191443,
 'B1': 0.17649315194124499,
 'B2': -0.09367965851449515,
 'B3': 0.008891316222766718,
 'other group': -0.12993373252552307,
 ' 1 & < =3': -0.05949484149777713,
 'Operation': -0.018200464251483424,
 'Sales': -0.05091185616314798,
 'Support': 0.12249364800096194,
 'Female': -0.1951606022872554,
 'Male': -0.055940207626112265}

Visualize feature importance

feature_df=pd.DataFrame(feature_dict,index=[0])
feature_df.T.plot(kind="bar",legend=False,title="Feature Importance")

Output:

As we can see that “New promotion” column has the highest feature importance.

Saving the best model

Approach -1

Logistic Regression model because it has the best accuracy as well it is neither overfitted nor under fitted

import pickle

# Save the trained model as a pickle string.
saved_model = pickle.dumps(lr)

# Load the pickled model
lr_from_pickle = pickle.loads(saved_model)

# Use the loaded pickled model to make predictions
lr_from_pickle.predict(X_test)

Output:

array(['Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left',
       'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay',
       'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
       'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
       'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
       'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
       'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
       'Left', 'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
       'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
       'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
       'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
       'Left', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
       'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Stay',
       'Left', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Stay',
       'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Left', 'Stay',
       'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay',
       'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
       'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
       'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
       'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Left',
       'Left', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left',
       'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay',
       'Stay', 'Stay', 'Stay'], dtype=object)

Approach – 2

# loading dependency
import joblib

# saving our model - model - model , filename - model_lr
joblib.dump(lr , 'model_lr')

# opening the file- model_jlib
m_jlib = joblib.load('model_lr')

# check prediction
m_jlib.predict(X_test) # similar output

Output:

Okay, so that’s a wrap from my side!

Conclusion

In this blog, we’ve walked through the entire process of developing an Employee Attrition prediction model. From data exploration to model selection and evaluation, we’ve covered key steps in creating an effective predictive tool. The Logistic Regression model emerged as the best performer for Employee Attrition prediction model, balancing accuracy and avoiding overfitting. By implementing such models, organizations can anticipate potential employee departures and take timely action. Remember, while Employee Attrition prediction using machine learning is a powerful tool, it should be used ethically and in conjunction with positive employee engagement strategies to foster a healthy work environment.

Key Takeaways:

Strategic Importance: Predicting employee attrition helps organizations proactively manage workforce stability and ensure project continuity.
Technical Skills: Gain proficiency in data manipulation, preprocessing, and applying machine learning algorithms to real-world HR data.
Impact of Factors: Discover how variables such as promotion history and job role alignment affect employee retention.
Model Evaluation: Evaluate model performance metrics like accuracy and feature importance to optimize retention strategies.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Frequently Asked Questions

Q1. What is an employee attrition prediction?

A. Employee attrition prediction involves forecasting the likelihood of employees leaving an organization based on historical data and predictive analytics. It helps in proactive retention strategies and workforce planning.

Q2. How do you calculate forecasted attrition?

A. Forecasted attrition is calculated using historical data on employee turnover rates, analyzing trends, patterns, and factors contributing to attrition. Machine learning models such as logistic regression or random forest can be used for prediction.

Q3. What is employee turnover rate prediction?

A. Employee turnover rate prediction forecasts the percentage or number of employees expected to leave an organization within a specific period. It assists in workforce management and retention strategies.

Q4. What is the algorithm for attrition rate?

A. Algorithms for predicting attrition rate include logistic regression, decision trees, random forest, and neural networks. These models analyze factors such as employee demographics, job satisfaction, and performance metrics to predict attrition likelihood accurately.

Aman Preet

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Employee Attrition Prediction – A Comprehensive Guide

Learning Objective:

Table of contents

Need of Employee Attrition prediction

Importing libraries

Reading the dataset

Let’s look at our dataset and plan to torture it!

Data exploration

Dropping the index column

Let’s explore all the categorical values and visualize them

Understanding the balancing of the Gender column visually

Promotion (Promoted/ Non-Promoted)

Function (Operation/ Support/ Sales)

Hiring (Direct/ Agency/ Employee referral)

Marital Status (Single/ Married/ Seperated/ Div./ NTBD)

Now let’s visualize some continuous data

Data cleaning

Encoding the locations column (categorized)

get_dummies()

Marital Status

Employee Group

Job Role Match

Gender

Splitting data – Train test split

Model Development

Initializing the models

Model Scores (accuracy)

Classification Report of Random forest

Classification Report of Logistic Regression

Model Comparison

Visualize the accuracy of each model

Feature importance

Visualize feature importance

Saving the best model

Approach -1

Approach – 2

Conclusion

Key Takeaways:

Frequently Asked Questions