Predicting Employee Attrition using Orange(.ows) Visual Programming Software

Aryan Last Updated : 23 Nov, 2020

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

If you want to know – How to use MACHINE LEARNING ALGORITHMS for prediction, then –Keep Reading & Keep Learning.

Although there aren’t any industry standards on ascertaining the expense of losing a salaried employee, a few studies, (for instance SHMR) foresee that each time a business replaces a salaried employee, it costs about 5 to 8 months’ compensation all expenses considered. For a supervisor making $50,000 per year, that is $30,000 to $40,000 in enrolling and hiring costs.

Yet, others foresee that the expense is considerably more – than losing a salaried worker can cost as much as 2x their yearly compensation, particularly for a top/middle level manager. Huge costs incorporate the expense of recruiting, onboarding, and long loss of efficiency.

By reading this article, one can understand two different perspectives: –

Individual perspective – How an organization calculates attrition rate using machine learning algorithms, and it can also estimate the chances of there being an opportunity to apply in an organization or not.

Company Perspective: An organization also gets to know if they should start recruiting or not. If an attrition rate is high, then, an organization also gets to know the time for making amendments in the current employee retention policies.

Analyzing the factors and requisites that can influence the IBM employees attrition rate and finally classifying — on an average what percentage of employees position has not been filled yet.

What is Orange?

Orange is a visual programming software package used for this domain. It has uses widely ranging from machine learning, data mining, and data analysis, etc. Orange tools (called widgets) are within the realm of simple data visualization & pre-processing empirical evaluation of learning algorithms and predictive modeling. Visual programming is implemented via a combination in which workflows are designed by linking user-designed widgets.

At the same time, proficient users can use Orange as a Python library to manipulate data and alter widget.

Orange (click here to download).

Attrition Rate

An attrition rate is utilized to quantify employees lost over a period who aren’t yet replaced. The rate appears to be a percentage contrasted with the total workforce. HR(s) often use an attrition rate to determine the number of vacant positions.

Firstly, the .CSV File was uploaded (containing IBM employees data), then all the target column(s) were selected, i.e., Attrition & then the RANK widget was chosen from the Data Column, as ranking helps in giving a gist of what is required the most in a data. Then the first 20 Data Heads have been selected according to the different ranks.

Viewing the data according to the RANK and then SELECTING the Data from the FILE.

The Data has been checked using Data Table — to know whether the Data has any missing values or not, according to the Data Table — the Data has no missing values.

It shows that Data has NO MISSING VALUES.

The prediction was made by orange using different models and then evaluated by using Test & Score.

Test and Score NN — Using three different models to get the best possible result (1st image is of Random Forest Model) (2nd image is of Support Vector Machine (SVM) Model) (3rd image is of Artificial Neural Network (ANN) Model).

In the previous article, it was mentioned that the RANDOM FOREST was the BEST MODEL. So, this time three different models have been used — RANDOM FOREST, SUPPORT VECTOR MACHINE (SVM) & ARTIFICIAL NEURAL NETWORK (ANN), and then the comparison was made to understand which model will be more efficient and effective for a better prediction.

– Random Forest model was used in the prediction because: –

Random Forest is a tree-based learning algorithm with the power to form accurate decisions as it has many decision trees together. As its name says — it’s a forest of trees. Hence, Random Forest takes more training time than a single decision tree. Each branch and leaf within the decision tree works on the random features to predict the output. Then this algorithm combines all the predictions of individual decision trees to generate the final prediction, and it can also deal with the missing values.

– Support Vector Machine (SVM) model was used in the prediction because: –

SVM has a regularization feature. So, it has good generalization capabilities, which prevent it from over-fitting, and it can also be used to solve both categorical and numerical problems. A small change to the Data does not significantly affect the SVM. So, the SVM model is stable.

– Artificial Neural Network (ANN) model was used in the prediction because: –

ANN is like our brain; millions and billions of cells – called neurons, which processes information in the form of electric signals. Similarly, in ANN, the network structure has an input layer, a hidden layer, and the output layer. It is also called Multi-Layer Perceptron as it has multiple layers. The hidden layer is known as a “distillation layer” that distills some critical patterns from the data/information and passes it onto the next layer. It then makes the network quicker and more productive by distinguishing the data from the data sources, leaving out the excess data.

It captures a non-linear relationship between the inputs.
It helps in converting the information/data into more useful insight.

After Test & Score (it helps analyze and translate qualitative data(characters) into quantitative data(numbers)), Confusion Metrics was used to see all the True Positives, False Negative values, misclassified data & correct data. And lastly, the Distribution visualization was used to understand the data.

Conclusion

The Final Prediction turned out to be different — so, now there is a need to take the average (Average of Total population) (Average of YES CATEGORY) (Average of NO CATEGORY).

The final results were turned out to be different as all the three models have some different values. So, there is a need to take an average of all the results.

Orange Model Results

An average has taken to know the Attrition rate.

By doing this, it can be said that in ATTRITION, only 7.45% of the population (Total Population is 1236) comes under the YES category, and the rest, i.e., 92.55% of the population, comes under the NO category.

And, it can also be said that Artificial Neural Network (ANN) is a better model than Support Vector Machine (SVM) & Random Forest because it has a higher AUC (Area Under Curve) as:

AUC is scale-invariant — i.e., — it measures how well the predictions are ranked.
AUC is also a classification threshold invariant — i.e., — it measures how well the quality of the model’s prediction is.

In this case, it can be seen that — Artificial Neural Network (ANN) is a better model as it has a HIGHER AUC (Area Under Curve).

This is the full MODEL VIEW.

This dataset (.CSV file) is taken from Kaggle.

Filename: IBM HR Analytics Employee Attrition Dataset

Contacts

In case you have any questions or any suggestions on what my next article should be about, please leave a comment below or E-mail me at [email protected]

If you want to keep updated with my latest articles and projects, follow me on Medium.

Connect with me via:

LinkedIn

Instagram

Aryan

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Predicting Employee Attrition using Orange(.ows) Visual Programming Software

Introduction

What is Orange?

Attrition Rate

Conclusion

Contacts

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie