Loan Approval Prediction Machine Learning

Vedansh Last Updated : 04 Feb, 2022

6 min read

Introduction

In this article, we are going to solve the Loan Approval Prediction Hackathon hosted by Analytics Vidhya. This is a classification problem in which we need to classify whether the loan will be approved or not. classification refers to a predictive modeling problem where a class label is predicted for a given example of input data. A few examples of classification problems are Spam Email detection, Cancer detection, Sentiment Analysis, etc.

To understand more about classification problems you can go through this link.

Table of Content

Understanding the problem statement
About the dataset
Load essential Python Libraries
Load Training/Test datasets
Data Preprocessing
Exploratory data analysis (EDA).
Feature Engineering.
Build Machine Learning Model
Make predictions on the test dataset
Prepare submission file
Conclusion

Understanding the Problem Statement

Dream Housing Finance company deals in all kinds of home loans. They have a presence across all urban, semi-urban and rural areas. The customer first applies for a home loan and after that, the company validates the customer eligibility for the loan.

The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out online application forms. These details are Gender, Marital Status, Education, number of Dependents, Income, Loan Amount, Credit History, and others.

To automate this process, they have provided a dataset to identify the customer segments that are eligible for loan amounts so that they can specifically target these customers.

You can find the complete details about the problem statement here and also download the training and test data.

As mentioned above this is a Binary Classification problem in which we need to predict our Target label which is “Loan Status”.

Loan status can have two values: Yes or NO.

Yes: if the loan is approved

NO: if the loan is not approved

So using the training dataset we will train our model and try to predict our target column that is “Loan Status” on the test dataset.

About the dataset

So train and test dataset would have the same columns except for the target column that is “Loan Status”.

Train dataset:

Load Essential Python Libraries

Load Training/ Test Dataset

Size of Train/Test Data

So we have 614 rows and 13 columns in our training dataset.

In test data, we have 367 rows and 12 columns because the target column is not included in the test data.

First look at the Dataset

Categorical Columns: Gender (Male/Female), Married (Yes/No), Number of dependents (Possible values:0,1,2,3+), Education (Graduate / Not Graduate), Self-Employed (No/Yes), credit history(Yes/No), Property Area (Rural/Semi-Urban/Urban) and Loan Status (Y/N)(i. e. Target variable)

Numerical Columns: Loan ID, Applicant Income, Co-applicant Income, Loan Amount, and Loan amount term

Data Preprocessing

Concatenating the train and test data for data preprocessing:

dropping the unwanted column :

Identify missing values :

Imputing the missing values:

Fill null values with mode

Next, we will be using Iterative imputer for filling missing values of LoanAmount and Loan_Amount_Term

So now as we have imputed all the missing values we go on to mapping the categorical variables with the integers.

We map the values so that we can input the train data into the model as the model does not accept any string values.

Exploratory Data Analysis (EDA)

Splitting the data to new_train and new_test so that we can perform EDA.

Mapping ‘N’ to 0 and ‘Y’ to 1

Univariate Analysis:

Output:

Univariate Analysis Observations

More Loans are approved Vs Rejected
Count of Male applicants is more than Female
Count of Married applicant is more than Non-married
Count of graduate is more than non-Graduate
Count of self-employed is less than that of Non-Self-employed
Maximum properties are located in Semiurban areas
Credit History is present for many applicants
The count of applicants with several dependents=0 is maximum.

Bivariate Analysis

Mean ApplicantIncome of 0 and 1 are almost the same (o: no,1: Yes)

Mean Co- ApplicantIncome of 1 is slightly more than 0 (o: no,1 Yes)

The mean value of Loan Amount applied by males (0) is slightly higher than Females(1).

If you are married then the loan amount requested is slightly higher than non-married

Male have higher Co-applicant income than females in all three property areas

Correlation matrix

Output:

Feature Engineering

Total Income :

EMI :

Lets assume that interest rate=10.0 # hence r = ((10/12)/100) = 0.00833

Additional Features :

Bin Information :

Drop Unwanted Column :

Size after feature engineering :

We have added 8 new features

Building Machine Learning Model:

Creating X (input variables) and Y (Target Variable) from the new_train data.

Using train test split on the training data for validation

We have a (70:30) split on the training data.

Using ML algorithm for training

We have used multiple algorithms for training purposes like Decision Tree, Random Forest, SVC, Logistic Regression, XGB Regressor, etc.

Among all the algorithms logistic regression performs best on the validation data with an accuracy score of 82.7%.

After getting an accuracy of 82.7% I tried fine-tuning it to improve my accuracy score using GridSearchCV.

The best parameters I got after fine-tuning were:

After fine-tuning the logistic regression model the accuracy score improved from 82.7% to 83.24%.

Predicting on test data

Prepare Sumbisson file:

Conclusion

After the Final Submission of test data, my accuracy score was 78%.

Feature engineering helped me increase my accuracy.

Amazingly Logistic Regression worked better than all other Ensemble models.

Vedansh

Banking Beginner Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Loan Approval Prediction Machine Learning

Introduction

Table of Content

Understanding the Problem Statement

About the dataset

Load Essential Python Libraries

Load Training/ Test Dataset

Size of Train/Test Data

First look at the Dataset

Data Preprocessing

Exploratory Data Analysis (EDA)

Univariate Analysis Observations

Bivariate Analysis

Feature Engineering

Using ML algorithm for training

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM