There are many financial problems yearly due to credit card fraud transactions. The financial industry has switched from a designation approach to an a priori predictive approach with the design of a fraud detection algorithm for applying instigators.
This case study is focused on giving you an idea of applying Exploratory Data Analysis (EDA) in a real business scenario. In this case study, apart from using the various Exploratory Data Analysis (EDA) techniques, you will also develop a basic understanding of risk analytics and how data can be utilized to minimise the risk of losing loan-providing to customers.
This article was published as a part of the Data Science Blogathon.
Loan-providing companies find it hard to give loans to people due to their inadequate or missing credit. Some consumers use this to their advantage by becoming defaulters. Let us consider your work for a consumer finance company that specialises in lending various types of loans to customers. It would be best if you used Exploratory Data Analysis (EDA) to analyse the patterns present in the data which will make sure that the loans are not rejected for the applicants capable of repaythegright to company receives a loan application, the company has to rights for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s or company’s decision:
The data contains information about the loan application.
When a client applies for a loan, four types of decisions could be taken by the bank/company:
In this case study, you will use Exploratory Data Analysis(EDA) to understand how consumer and loan attributes impact the default tendency.
This case study aims to identify patterns that indicate whether an applicant will repay their instalments. These patterns may be used to take further actions, such as denying the loan, reducing the amount of the loan, lending at a higher interest rate, etc. This will ensure that the applicants capable of repaying the loan are not rejected. Recognition of such aspirants using Exploratory Data Analysis (EDA) techniques is the main focus of this case study.
You can get access to data here.
# Filtering Warnings
import warnings
warnings.filterwarnings('ignore')
#Other's
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import plotly.graph_objects as go
pd.set_option('display.max_columns', 300) #Setting column display limit
plt.style.use('ggplot') #Applying style to graphs
Here, we will use two datasets for our analysis as follows,
df1
df2
Let’s start with reading those files, we’ll begin with df1,
df1 = pd.read_csv("application_data.csv")
df1.head()
We will begin with you giving the data for Inspiratory Data Analysis (EDA) procedures.
df1.shape
df1.info(verbose = True)
Here, if you give verbose = True, it will give all the information about all the columns. Try it and see the results.
df1.describe()
By describing (), you will get all the statistical information for the numeric columns and know their distribution inspection outliers.
After all the data inspecting, let’s check for the null values,
(df1.isnull().sum()/len(df1)*100).sort_values(ascending = False).head(50)
As you can see we are getting lots of null values. Let’s analyse this further.
null_col = df1.isnull().sum().sort_values(ascending = False)
null_col = null_col[null_col.values >(0.35*len(df1))]
#Plotting Bar Graph for null values greater than 35%
plt.figure(figsize=(20,4))
null_col.plot(kind='bar', color="#4CB391")
plt.title('List of Columns & null counts where null values are more than 35%')
plt.xlabel("Null Columns",fontdict={"fontsize":12,"fontweight":5}) #Setting X-label and Y-label
plt.ylabel("Count of null values",fontdict={"fontsize":12,"fontweight":5})
plt.show()
Theoretically, 25 to 30% of the variable’s missing values are allowed, beyond which we might want to drop it from analysis. Practically, we get variables with ~50% missing values, but still, the customer insists on having it for analysis. In those cases, we have to treat them accordingly. Here, we will remove columns with null values of more than 35% after observing those columns.
Let’s check how many columns are with null values greater than 35%. And remove those.
len(null_col)
label = list(null_col.index.values) #Making list of column names having null values greater than 35%
df1.drop(labels = label,axis=1,inplace = True) #Droping those columns
df1.shape
After removing null values, check the percentage of null values for each column again.
null = (df1.isnull().sum()/len(df1)*100).sort_values(ascending = False).head(50)
null.head(30)
Let’s handle these missing values by observing them.
df1.AMT_REQ_CREDIT_BUREAU_YEAR.fillna(df1.AMT_REQ_CREDIT_BUREAU_YEAR.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_YEAR
df1.AMT_REQ_CREDIT_BUREAU_MON.fillna(df1.AMT_REQ_CREDIT_BUREAU_MON.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_MON
df1.AMT_REQ_CREDIT_BUREAU_WEEK.fillna(df1.AMT_REQ_CREDIT_BUREAU_WEEK.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_WEEK
df1.AMT_REQ_CREDIT_BUREAU_DAY.fillna(df1.AMT_REQ_CREDIT_BUREAU_DAY.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_DAY
df1.AMT_REQ_CREDIT_BUREAU_HOUR.fillna(df1.AMT_REQ_CREDIT_BUREAU_HOUR.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_HOUR
df1.AMT_REQ_CREDIT_BUREAU_QRT.fillna(df1.AMT_REQ_CREDIT_BUREAU_QRT.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_QRT
df1.NAME_TYPE_SUITE.fillna(df1.NAME_TYPE_SUITE.mode()[0],inplace = True) #NAME_TYPE_SUITE
df1.OBS_30_CNT_SOCIAL_CIRCLE.fillna( df1.OBS_30_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #OBS_30_CNT_SOCIAL_CIRCLE
df1.DEF_30_CNT_SOCIAL_CIRCLE.fillna( df1.DEF_30_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #DEF_30_CNT_SOCIAL_CIRCLE
df1.OBS_60_CNT_SOCIAL_CIRCLE.fillna( df1.OBS_60_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #OBS_60_CNT_SOCIAL_CIRCLE
df1.DEF_60_CNT_SOCIAL_CIRCLE.fillna( df1.DEF_60_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #DEF_60_CNT_SOCIAL_CIRCLE
df1.CNT_FAM_MEMBERS.fillna(df1.CNT_FAM_MEMBERS.mode() , inplace = True) #CNT_FAM_MEMBERS
df1.DAYS_LAST_PHONE_CHANGE.fillna(df1.DAYS_LAST_PHONE_CHANGE.mode()[0],inplace = True) #DAYS_LAST_PHONE_CHANGE
df1.EXT_SOURCE_2.fillna(df1.EXT_SOURCE_2.median() , inplace = True) #EXT_SOURCE_2
df1.EXT_SOURCE_3.fillna(df1.EXT_SOURCE_3.median() , inplace = True) #EXT_SOURCE_3
Rechecking null values after imputing null values.
(df1.isnull().sum()/len(df1)*100).sort_values(ascending=False)
We didn’t impute OCCUPATION_TYPE because it may contain some hul information, so impOCCUATION_TYPEean or median doesn’t make any sense.
We’ll impute ‘OCCUOATION_TYPE” later by analyzing it.
If you observe the columns carefully, you will find that some columns contain errors. So, let’s make some changes.
df1[["DAYS_BIRTH","DAYS_EMPLOYED","DAYS_REGISTRATION","DAYS_ID_PUBLISH","DAYS_LAST_PHONE_CHANGE"]]
If you look at the data carefully, you will see that, though these are days, they contain negative values, which are not valid. So, let’s make changes accordingly.
As you can see, all the columns start with DAYS. Let’s list columns we want to change for ease of change.
day_cols = [i for i in df1 if i.startswith('DAYS')]
day_cols
df1[day_cols]= abs(df1[day_cols])
print(df1['DAYS_BIRTH'].unique()) print(df1['DAYS_EMPLOYED'].unique()) print(df1['DAYS_REGISTRATION'].unique()) print(df1['DAYS_ID_PUBLISH'].unique()) print(df1['DAYS_LAST_PHONE_CHANGE'].unique())
Some columns contain Y/N type of values, let’s make it 1/0 for ease of understanding.
df1['FLAG_OWN_CAR'] = np.where(df1['FLAG_OWN_CAR']=='Y', 1 , 0) df1['FLAG_OWN_REALTY'] = np.where(df1['FLAG_OWN_REALTY']=='Y', 1 , 0)
df1[['FLAG_OWN_CAR','FLAG_OWN_REALTY']].head()
Let’s check the distribution for columns having categorical values. After checking all the columns, we find that some contain ‘XNA’ values, meaning null. Let’s impute it accordingly.
df1.CODE_GENDER.value_counts()
df1.loc[df1.CODE_GENDER == 'XNA','CODE_GENDER'] = 'F'
df1.CODE_GENDER.value_counts()
Similarly,
df1.ORGANIZATION_TYPE.value_counts().head()
Let’s impute these values and check whether they are missing at random or if there is a pattern between missing values. You can read more about this here.
df1[['ORGANIZATION_TYPE','NAME_INCOME_TYPE']].head(30)
Here, we observe that wherever NAME_INCOME_TYPE
There is a pensioner, and we only have null values in ORGANIZATON_TYPE in the column. Let’s see the count of pensioners, and then we’ll decide whether to impute null values ORGANIZATION_TYPE with Pensioner.
df1.NAME_INCOME_TYPE.value_counts() #Check the counts for each in NAME_INCOME_TYPE
ORGANIZATION_TYPE
column. So, the value is Missing At Random
OCCUPATION_TYPE
with Pensioner as most of the null values for OCCUPATION_TYPE compared to Income type variable values, we found that “Pensioner” is the most frequent value, almost 80% of the null values of OCCUPATION_TYPEdf1['ORGANIZATION_TYPE'] = df1['ORGANIZATION_TYPE'].replace('XNA', 'Pensioner') df1['OCCUPATION_TYPE'].fillna('Pensioner' , inplace = True)
Some columns have nominal categorical values, so let’s impute them accordingly. You can the read more about this here.
df1['AMT_INCOME_TYPE'] = pd.qcut(df1.AMT_INCOME_TOTAL, q=[0, 0.2, 0.5, 0.8, 0.95, 1], labels=['VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH']) df1['AMT_INCOME_TYPE'].head(11)
df1['AMT_CREDIT_TYPE'] = pd.qcut(df1.AMT_CREDIT, q=[0, 0.2, 0.5, 0.8, 0.95, 1], labels=['VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH']) df1['AMT_CREDIT_TYPE'].head(11)
Let’s Bin ‘DAYS_BIRTH’ column by converting it to years based on various “AGE_GROUP.”
df1['DAYS_BIRTH']= (df1['DAYS_BIRTH']/365).astype(int) # Converting df1['DAYS_BIRTH'].unique()
df1['AGE_GROUP']=pd.cut(df1['DAYS_BIRTH'],
bins=[19,25,35,60,100], labels=['Very_Young','Young', 'Middle_Age', 'Senior_Citizen']) #Binning
df1[['DAYS_BIRTH','AGE_GROUP']].head()
Again check the datatypes for all the columns and change them accordingly.
df1.info()
After observing all the columns, we found some that didn’t add any value to our analysis, so we simply dropped them so that the data looked clear.
unwanted=['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
'FLAG_PHONE', 'FLAG_EMAIL','REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
df1.drop(labels=unwanted,axis=1,inplace=True)
Outlier detection is very important for any data science process. Sometimes, removing outliers tends to improve our model, while other times, outliers may give you a very different approach to your analysis.
So let’s make a list of all the numeric columns and plot boxplots to understand the outliers in the data.
numerical_col = df1.select_dtypes(include='number').columns
len(numerical_col)
fig , axes = plt.subplots(nrows=7, ncols=5, constrained_layout=True) # Plot Configuration
fig.subplots_adjust(left= 0, bottom=0, right=3, top=12, wspace=0.09, hspace=0.3)
for ax, column in zip(axes.flatten(),numerical_col): #Using For loop
sns.boxplot(df1[column],ax=ax) #Ploting
You will get a 7×5 boxplot matrix. Let’s have a look at a tiny portion.
Observe the plot and try to make your own insights.
Insights
CNT_CHILDREN
have outlier values of having children more than 5.AMT_INCOME_TOTAL
is very slim, and it has a large number of outliers.AMT_CREDIT
is larger than the first quartile, which means that most of the credit amount for the loan of customers is present in the third quartile. There are a large number of outliers present ,in AMT_CREDIT
.AMT_ANNUITY
is slightly larger than the First quartile and there is a large number of outliers.AMT_GOODS_PRICE
,DAYS_REGISTRATION
AND DAYS_LAST_PHONE_CHANGE
is larger than the first quartile, and all have many outliers.DAYS EMPLOYED
is very slim. Most of the outliers are present below 25000. And an outlier is present 375000.CNT_FAM_MEMBERS
, we can say that most of the clients have four family members. There are some outliers present.DAYS_BIRTH
,DAYS_ID_PUBLISH
and EXT_SOURCE_2
,EXT_SOURCE_3
Don’t have any outliers.DAYS_EMPLOYED
,OBS_30_CNT_SOCIAL_CIRCLE
, DEF_30_CNT_SOCIAL_CIRCLE
,OBS_60_CNT_SOCIAL_CIRCLE
, DEF_60_CNT_SOCIAL_CIRCLE
,AMT_REQ_CREDIT_BUREAU_HOUR
,AMT_REQ_CREDIT_BUREAU_DAY
, AMT_REQ_CREDIT_BUREAU_WEEK
,AMT_REQ_CREDIT_BUREAU_MON
, AMT_REQ_CREDIT_BUREAU_QRT
and AMT_REQ_CREDIT_BUREAU_YEAR
are very slim and quantiles,arge number lieutliers.FLAG_OWN_CAR
: It doesn’t have First and Third quantile and values lies within IQR, So we can conclude that most of the clients own a carFLAG_OWN_REALTY
: It doesn’t have First and Third quantiles, and values lie within IQR, So we can conclude that most of the clients own a House/FlatBefore we start analysing your data, let’s check the imbalance in the data. It’s a very important to step in any machine learning or deep learning to visually check the target variable’s distribution is "11.39"
Let’s check the distribution of the target variable visually.
count1 = 0
count0 = 0
for i in df1['TARGET'].values:
if i == 1:
count1 += 1
else:
count0 += 1
count1 = (count1/len(df1['TARGET']))*100
count0 = (count0/len(df1['TARGET']))*100
x = ['Defaulted Population(TARGET=1)','Non-Defauted Population(TARGET=0)']
y = [count1, count0]
explode = (0.1, 0) # only "explode" the 1st slice
fig1, ax1 = plt.subplots()
ax1.pie(y, explode=explode, labels=x, autopct='%1.1f%%',
shadow=True, startangle=110)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Data imbalance',fontsize=25)
plt.show()
Result:
Insights
Defaulted population is 8.1 % and non- defaulted population is 91.9%
.Ratio is 11.3
We will separately analyse the data based on the target variable for a better understanding.
plt.figure(figsize=(15,8))
plt.subplot(121)
sns.countplot(x='TARGET',hue='CODE_GENDER',data=Target0, palette = 'Set2')
plt.title("Gender Distribution in Target0")
plt.subplot(122)
sns.countplot(x='TARGET',hue='CODE_GENDER',data=Target1, palette = 'Set2')
plt.title("Gender Distribution in Target1")
plt.show()
Insights
66.6% Female
clients are non-defaulters while 33.4% male
clients are non-defaulters.57% Female
clients are defaulters, while 42% male
can apply are defaulters.plt.figure(figsize=(15,7))
plt.subplot(121)
sns.countplot(x='TARGET',hue='AGE_GROUP',data=Target0,palette='Set2')
plt.subplot(122)
sns.countplot(x='TARGET',hue='AGE_GROUP',data=Target1,palette='Set2')
plt.show()
Insights
Middle Age(35-60)
The group seems to apply to any other age groups for loans in the case of defaulters and non-defaulters.Middle Age
group facing paying difficulties the most.Senior Citizens(60-100)
and Very young(19-25)
age group faces paying difficulties less than other age groups.The organization’s Distribution Based on Target 0 and Target 1
plt.figure(figsize=(40,5))
plt.rcParams["axes.labelsize"] = 80
plt.rcParams['axes.titlesize'] = 80 # Plot Configuration
plt.rcParams['axes.titlepad'] = 50
fig, axes = plt.subplots(nrows=1,ncols=2)
sns.set_context('talk')
fig.subplots_adjust(left= 0.09,bottom=1,right=3,top= 12,wspace=0.5,hspace=0.3)
plt.subplot(121)
plt.xscale('log') # For Target0
sns.countplot(data=Target0,y='ORGANIZATION_TYPE',
order=df1['ORGANIZATION_TYPE'].value_counts().index,palette='Set3',hue = 'TARGET')
plt.title("ORGANIZATION_TYPE Vs Target 0")
plt.subplot(122)
plt.xscale('log') # For Target1
sns.countplot(data=Target1,y='ORGANIZATION_TYPE',
order=df1['ORGANIZATION_TYPE'].value_counts().index,palette='Set1',hue = 'TARGET')
plt.title("ORGANIZATION_TYPE Vs Target 1")
plt.show();
Insights
Business Entity Type 3, Self-employed, Other ,Medicine, Government,Business Entity Type 2
applied the most for the loan as compared to othersIndustry: type 13, Trade: type 4, Trade: type 5, Industry: type 8
applied lower for the loan as compared to others.Creating a plot for each feature manually becomes too tedious. So, we will define a function and use a loop to iterate through each categorical column.
def categorical_plot(var):
plt.figure(figsize=(40,20))
plt.rcParams['axes.labelpad'] = 50
plt.subplot(1, 2, 1)
sns.countplot(var, data=Target0, palette = 'Set3', hue='TARGET')
plt.xlabel(var, fontsize= 30, fontweight="bold") #Target 0
plt.ylabel('Non Payment Difficulties', fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.rcParams['axes.labelpad'] = 50
plt.subplot(1, 2, 2)
sns.countplot(var, data=Target1, palette = 'Set1', hue='TARGET') # Target 1
plt.xlabel(var, fontsize= 30, fontweight="bold")
plt.ylabel('Payments Difficulties', fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.show()
Let’s create a list for all categorical columns.
categorical_col = list(df1.select_dtypes(include= 'category').columns)
# Removing 'ORGANIZATION_TYPE','CODE_GENDER','AGE_GROUP' because we have already taken up the isights from above plots
categorical_col.remove('ORGANIZATION_TYPE')
categorical_col.remove('CODE_GENDER')
categorical_col.remove('AGE_GROUP')
categorical_col #Checking after removing columns
for cat in categorical_col: categorical_plot(cat)
Result:
Insights
NAME_CONTRACT_TYPE
: Most of the clients have applied for Cash Loan
while very small proportion have applied for Revolving loan
for both Defaulters as well as Non-defaulters.NAME_TYPE_SUIT
: Most of the clients were accompanied while applying for the loan. For a few clients, a family member was accompanying both Defaulters and Non-Defaulters.NAME_INCOME_TYPE
: Clients who applied for loans were earning income from Work, Commercial associates, and Pensioners. The highest category was the Working class. Businessmen, students, and the unemployed were also likely to apply for loans. The working categories have a high risk of default. The State Servant is at minimal risk of default.def uni(col):
sns.set(style="darkgrid")
plt.figure(figsize=(40,20))
plt.subplot(1,2,1)
sns.distplot(Target0[col], color="g" )
plt.yscale('linear')
plt.xlabel(col, fontsize= 30, fontweight="bold")
plt.ylabel('Non Payment Difficulties', fontsize= 30, fontweight="bold") #Target 0
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.subplot(1,2,2)
sns.distplot(Target1[col], color="r")
plt.yscale('linear')
plt.xlabel(col, fontsize= 30, fontweight="bold")
plt.ylabel('Payment Difficulties', fontsize= 30, fontweight="bold") # Target 1
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.show();
uni(col='AMT_ANNUITY')
Result:
uni(col='AMT_CREDIT')
uni(col='AMT_GOODS_PRICE')
uni(col='AMT_INCOME_TOTAL')
Insights
Let’s check the required columns for analysis.
df1[["TARGET","AMT_INCOME_TOTAL","NAME_EDUCATION_TYPE","NAME_FAMILY_STATUS"]]
For Target 0
plt.figure(figsize=(35,14)) plt.yscale('log') #As the values are too large, it is convinient to use log for better analysis plt.xticks(rotation = 90) sns.boxplot(data =Target0, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', #Boxplot w.r.t Data Target 0 hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2') plt.legend( loc = 'upper right') #Adjusting legend position plt.title('Income amount vs Education Status',fontsize=35 ) plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold") plt.ylabel("AMT_INCOME_TOTAL",fontsize= 30, fontweight="bold") plt.xticks(rotation=90, fontsize=30) plt.yticks(rotation=360, fontsize=30) plt.show()
Insights
plt.figure(figsize=(25,10))
plt.yscale('log') #As the values are too large, it is convinient to use log for better analysis
plt.xticks(rotation = 90)
sns.boxplot(data =Target0, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', #Boxplot w.r.t Data Target 0
hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')
plt.legend( bbox_to_anchor=(1.5, 1),loc = 'upper right') #Adjusting legend position
plt.title('Credit V/s Education',fontsize=35 )
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_CREDIT",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.show()
Insights
plt.figure(figsize=(30,12))
plt.yscale('log') #As the values are too large, it is convinient to use log for better analysis
plt.xticks(rotation = 90)
sns.boxplot(data =Target1, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', #Boxplot w.r.t Data Target 1
hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')
plt.legend( loc = 'upper right') #Adjusting legend position
plt.title('Income amount vs Education Status',fontsize= 35)
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_INCOME_TOTAL",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.show()
Insights
plt.figure(figsize=(30,12)) #As the values are too large, it is convinient to use log for better analysis
plt.yscale('log')
plt.xticks(rotation = 90)
sns.boxplot(data =Target1, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', #Boxplot w.r.t Data Target 1
hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')
plt.legend( bbox_to_anchor=(1.5, 1),loc = 'upper right') #Adjusting legend position
plt.title('Credit V/s Education',fontsize=50 )
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_CREDIT",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)
plt.show()
Insights
Define a function for bivariate plots.
def biplot(df,feature,title):
temp = df[feature].value_counts()
# Calculate the percentage of target=1 per category value
perc = df[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
perc.sort_values(by='TARGET', ascending=False, inplace=True)
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Count of "+ title,"% of Loan Payment difficulties within each category"))
fig.add_trace(go.Bar(x=temp.index, y=temp.values),row=1, col=1)
fig.add_trace(go.Bar(x=perc[feature].to_list(), y=perc['TARGET'].to_list()),row=1, col=2)
fig['layout']['xaxis']['title']=feature
fig['layout']['xaxis2']['title']=feature
fig['layout']['yaxis']['title']='Count'
fig['layout']['yaxis2']['title']='% of Loan Payment Difficulties'
fig.update_layout(height=600, width=1000, title_text=title, showlegend=False)
fig.show()
Distribution of Amount Income Range and the category with maximum % Loan-Payment Difficulties
biplot(df1 ,'AMT_INCOME_TYPE','Income range')
Distribution of Type of Income and the category with maximum Loan-Payment Difficulties
biplot(df1 ,'NAME_INCOME_TYPE','Income type')
Distribution of Contract Type and the category with maximum Loan-Payment Difficulties
biplot(df1 ,'NAME_CONTRACT_TYPE','Contract type')
Distribution of Education Type and the category with maximum Loan-Payment Difficulties
biplot(df1 ,'NAME_EDUCATION_TYPE','Education type')
Distribution of Housing Type and the category with maximum Loan-Payment Difficulties
biplot(df1 ,'NAME_HOUSING_TYPE','Housing type')
Distribution of Occupation Type and the category with maximum Loan-Payment Difficulties
biplot(df1 ,'OCCUPATION_TYPE','Occupation type')
You may be wondering why I haven’t attached screenshots. Well, plot the charts and try to give insights based on that. That’s the best way to learn.
You may be wondering why I haven’t attached screenshots. Well, plot the charts and try to give insights based on that. That’s the best way to learn.
Distribution of CODE_GENDER with respect to AMT_INCOME_RANGE to find maximum % Loan-Payment Difficulties using pivot table
table= pd.pivot_table(df1, values='TARGET', index=['CODE_GENDER','AMT_INCOME_TYPE'],
columns=['NAME_EDUCATION_TYPE'], aggfunc=np.mean)
table
Insights
Let’s visually check correlations in the data. For tha at, make a list of all numeric features.
numerical_col = df1.select_dtypes(include='number').columns
numerical_col
len(numerical_col)
Let’s use pairplot to get the required charts.
pair = Target0[['TARGET','AMT_CREDIT', 'AMT_ANNUITY', 'AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE', 'DAYS_BIRTH','CNT_CHILDREN','DAYS_EMPLOYED']].fillna(0)
sns.pairplot(pair)
plt.show()
pair = Target1[['TARGET','AMT_CREDIT', 'AMT_ANNUITY', 'AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE', 'DAYS_BIRTH','CNT_CHILDREN','DAYS_EMPLOYED']].fillna(0)
sns.pairplot(pair)
plt.show()
Insights
AMT_CREDIT
and AMT_GOODS_PRICE
are highly correlated variables for both defaulters and non-defaulters. So, as the home price increases, the loan amount also increasesAMT_CREDIT
and AMT_ANNUITY
(EMI) are highly correlated variables for both defaulters and non-defaulters. So, as the home price increases, the EMI amount also increases, which is logical.AMT_CREDIT
, AMT_GOODS_PRICE
and AMT_ANNUITY
are highly correlated for both defaulters and non-defaulters, which might not give a good indicator for heat maps detectionLet us now check the correlations in heatmaps.
corr0=df1.iloc[0:,2:]
corr1=df1.iloc[0:,2:]
t0=corr0.corr(method='spearman') # t0 - Corelations distibuted according rank wise for target 0
t1=corr1.corr(method='spearman') # t1 - Corelations distibuted according rank wise for target 1
Source: Author
targets_corr(data=t0,title='Correlation for Target 0')
Source: Author
Insights
AMT_CREDIT
is inversely proportional to the DAYS_BIRTH
, people belong to the low-age group taking high Credit amounts and vice-versaAMT_CREDIT
is inversely proportional to the CNT_CHILDREN
, means the Credit amount is higher for fewer children count clients have and vice-versa.AMT_INCOME_TOTAL
is inversely proportional to the CNT_CHILDREN
, means more income for fewer children clients have and vice-versa.AMT_CREDIT
Is higher in a densely populated area.AMT_INCOME_TOTAL
Is also higher in a densely populated areaInsightsObservations: The map for Target 1 is the same observation as Target 0, but a few points are different. They are listed below.
This is the analysis of current application data. We have one more piece of data for the previous applications and have to analyse that also. Consider that data and do the analysis. Try to give insights.
Find the link to the source code here.
In conclusion, the Exploratory Data Analysis (EDA) conducted for the credit card fraud detection case study highlights the critical importance of thorough data examination in identifying fraudulent activities. We gained valuable insights into transaction patterns and anomalies by utilising various visualisation techniques and statistical methods. This analysis not only aids in developing more effective fraud detection algorithms but also enhances the understanding of the underlying factors contributing to fraudulent behavior. Implementing these findings can significantly improve the accuracy of detection systems, ultimately leading to better risk management and protection for both financial institutions and customers.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.