Exploratory Data Analysis (EDA): Credit Card Fraud Detection Case Study

KAVITA Last Updated : 28 Oct, 2024

17 min read

There are many financial problems yearly due to credit card fraud transactions. The financial industry has switched from a designation approach to an a priori predictive approach with the design of a fraud detection algorithm for applying instigators.

This case study is focused on giving you an idea of applying Exploratory Data Analysis (EDA) in a real business scenario. In this case study, apart from using the various Exploratory Data Analysis (EDA) techniques, you will also develop a basic understanding of risk analytics and how data can be utilized to minimise the risk of losing loan-providing to customers.

This article was published as a part of the Data Science Blogathon.

Business Problem Understanding

Loan-providing companies find it hard to give loans to people due to their inadequate or missing credit. Some consumers use this to their advantage by becoming defaulters. Let us consider your work for a consumer finance company that specialises in lending various types of loans to customers. It would be best if you used Exploratory Data Analysis (EDA) to analyse the patterns present in the data which will make sure that the loans are not rejected for the applicants capable of repaythegright to company receives a loan application, the company has to rights for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s or company’s decision:

If the aspirant is likely to repay the loan, then the loan tends to be a business loss to the company.
If the applicant is aspirant or committed and not likely to repay the loan, i.e., he/she is likely to default or commit fraud, then approving the loan may lead to a financial loss for the company.

The data contains information about the loan application.

When a client applies for a loan, four types of decisions could be taken by the bank/company:

Approved: The loan application meets all requirements and is granted to the client.
Cancelled: The loan application process is stopped, often at the client’s request, before final approval.
Refused: The loan application is denied due to unmet criteria or risk factors.
Unused offer: The loan was approved, but the client did not proceed to utilize the funds.

In this case study, you will use Exploratory Data Analysis(EDA) to understand how consumer and loan attributes impact the default tendency.

Business Goal

This case study aims to identify patterns that indicate whether an applicant will repay their instalments. These patterns may be used to take further actions, such as denying the loan, reducing the amount of the loan, lending at a higher interest rate, etc. This will ensure that the applicants capable of repaying the loan are not rejected. Recognition of such aspirants using Exploratory Data Analysis (EDA) techniques is the main focus of this case study.

Data

You can get access to data here.

Importing Necessary Packages

# Filtering Warnings
import warnings
warnings.filterwarnings('ignore')

#Other's
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import plotly.graph_objects as go
pd.set_option('display.max_columns', 300) #Setting column display limit
plt.style.use('ggplot') #Applying style to graphs

Here, we will use two datasets for our analysis as follows,

application_data.csv as df1
previous_application.csv as df2

Let’s start with reading those files, we’ll begin with df1,

df1 = pd.read_csv("application_data.csv")
df1.head()

Data Inspection

We will begin with you giving the data for Inspiratory Data Analysis (EDA) procedures.

df1.shape

df1.info(verbose = True)

Here, if you give verbose = True, it will give all the information about all the columns. Try it and see the results.

df1.describe()

By describing (), you will get all the statistical information for the numeric columns and know their distribution inspection outliers.

Handling Null Values

After all the data inspecting, let’s check for the null values,

(df1.isnull().sum()/len(df1)*100).sort_values(ascending = False).head(50)

As you can see we are getting lots of null values. Let’s analyse this further.

null_col = df1.isnull().sum().sort_values(ascending = False)
null_col = null_col[null_col.values >(0.35*len(df1))]


#Plotting Bar Graph for null values greater than 35%

plt.figure(figsize=(20,4))
null_col.plot(kind='bar', color="#4CB391")                           
plt.title('List of Columns & null counts where null values are more than 35%') 

plt.xlabel("Null Columns",fontdict={"fontsize":12,"fontweight":5})                  #Setting X-label and Y-label
plt.ylabel("Count of null values",fontdict={"fontsize":12,"fontweight":5})
plt.show()

Theoretically, 25 to 30% of the variable’s missing values are allowed, beyond which we might want to drop it from analysis. Practically, we get variables with ~50% missing values, but still, the customer insists on having it for analysis. In those cases, we have to treat them accordingly. Here, we will remove columns with null values of more than 35% after observing those columns.

Let’s check how many columns are with null values greater than 35%. And remove those.

len(null_col)

label = list(null_col.index.values) #Making list of column names having null values greater than 35%
df1.drop(labels = label,axis=1,inplace = True) #Droping those columns
df1.shape

After removing null values, check the percentage of null values for each column again.

null = (df1.isnull().sum()/len(df1)*100).sort_values(ascending = False).head(50)
null.head(30)

Let’s handle these missing values by observing them.

df1.AMT_REQ_CREDIT_BUREAU_YEAR.fillna(df1.AMT_REQ_CREDIT_BUREAU_YEAR.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_YEAR

df1.AMT_REQ_CREDIT_BUREAU_MON.fillna(df1.AMT_REQ_CREDIT_BUREAU_MON.mode()[0],inplace = True)   #AMT_REQ_CREDIT_BUREAU_MON  

df1.AMT_REQ_CREDIT_BUREAU_WEEK.fillna(df1.AMT_REQ_CREDIT_BUREAU_WEEK.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_WEEK

df1.AMT_REQ_CREDIT_BUREAU_DAY.fillna(df1.AMT_REQ_CREDIT_BUREAU_DAY.mode()[0],inplace = True)   #AMT_REQ_CREDIT_BUREAU_DAY

df1.AMT_REQ_CREDIT_BUREAU_HOUR.fillna(df1.AMT_REQ_CREDIT_BUREAU_HOUR.mode()[0],inplace = True) #AMT_REQ_CREDIT_BUREAU_HOUR

df1.AMT_REQ_CREDIT_BUREAU_QRT.fillna(df1.AMT_REQ_CREDIT_BUREAU_QRT.mode()[0],inplace = True)   #AMT_REQ_CREDIT_BUREAU_QRT

df1.NAME_TYPE_SUITE.fillna(df1.NAME_TYPE_SUITE.mode()[0],inplace = True) #NAME_TYPE_SUITE

df1.OBS_30_CNT_SOCIAL_CIRCLE.fillna( df1.OBS_30_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #OBS_30_CNT_SOCIAL_CIRCLE

df1.DEF_30_CNT_SOCIAL_CIRCLE.fillna( df1.DEF_30_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #DEF_30_CNT_SOCIAL_CIRCLE

df1.OBS_60_CNT_SOCIAL_CIRCLE.fillna( df1.OBS_60_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #OBS_60_CNT_SOCIAL_CIRCLE

df1.DEF_60_CNT_SOCIAL_CIRCLE.fillna( df1.DEF_60_CNT_SOCIAL_CIRCLE.mode()[0],inplace = True) #DEF_60_CNT_SOCIAL_CIRCLE

df1.CNT_FAM_MEMBERS.fillna(df1.CNT_FAM_MEMBERS.mode() , inplace = True) #CNT_FAM_MEMBERS

df1.DAYS_LAST_PHONE_CHANGE.fillna(df1.DAYS_LAST_PHONE_CHANGE.mode()[0],inplace = True) #DAYS_LAST_PHONE_CHANGE

df1.EXT_SOURCE_2.fillna(df1.EXT_SOURCE_2.median() , inplace = True) #EXT_SOURCE_2

df1.EXT_SOURCE_3.fillna(df1.EXT_SOURCE_3.median() , inplace = True) #EXT_SOURCE_3

Rechecking null values after imputing null values.

(df1.isnull().sum()/len(df1)*100).sort_values(ascending=False)

We didn’t impute OCCUPATION_TYPE because it may contain some hul information, so impOCCUATION_TYPEean or median doesn’t make any sense.

We’ll impute ‘OCCUOATION_TYPE” later by analyzing it.

If you observe the columns carefully, you will find that some columns contain errors. So, let’s make some changes.

df1[["DAYS_BIRTH","DAYS_EMPLOYED","DAYS_REGISTRATION","DAYS_ID_PUBLISH","DAYS_LAST_PHONE_CHANGE"]]

If you look at the data carefully, you will see that, though these are days, they contain negative values, which are not valid. So, let’s make changes accordingly.

As you can see, all the columns start with DAYS. Let’s list columns we want to change for ease of change.

day_cols = [i for i in df1 if i.startswith('DAYS')]
day_cols

df1[day_cols]= abs(df1[day_cols])
print(df1['DAYS_BIRTH'].unique()) print(df1['DAYS_EMPLOYED'].unique()) print(df1['DAYS_REGISTRATION'].unique()) print(df1['DAYS_ID_PUBLISH'].unique()) print(df1['DAYS_LAST_PHONE_CHANGE'].unique())

Some columns contain Y/N type of values, let’s make it 1/0 for ease of understanding.

df1['FLAG_OWN_CAR'] = np.where(df1['FLAG_OWN_CAR']=='Y', 1 , 0)  df1['FLAG_OWN_REALTY'] = np.where(df1['FLAG_OWN_REALTY']=='Y', 1 , 0)

df1[['FLAG_OWN_CAR','FLAG_OWN_REALTY']].head()

Let’s check the distribution for columns having categorical values. After checking all the columns, we find that some contain ‘XNA’ values, meaning null. Let’s impute it accordingly.

df1.CODE_GENDER.value_counts()
df1.loc[df1.CODE_GENDER == 'XNA','CODE_GENDER'] = 'F' 
df1.CODE_GENDER.value_counts()

Similarly,

df1.ORGANIZATION_TYPE.value_counts().head()

Let’s impute these values and check whether they are missing at random or if there is a pattern between missing values. You can read more about this here.

df1[['ORGANIZATION_TYPE','NAME_INCOME_TYPE']].head(30)

Here, we observe that wherever NAME_INCOME_TYPE There is a pensioner, and we only have null values in ORGANIZATON_TYPE in the column. Let’s see the count of pensioners, and then we’ll decide whether to impute null values ORGANIZATION_TYPE with Pensioner.

df1.NAME_INCOME_TYPE.value_counts()    #Check the counts for each in NAME_INCOME_TYPE

So, from these data, we can conclude that Pensioner value is approximately equal to null values in ORGANIZATION_TYPE column. So, the value is Missing At Random
Similarly, imputing null values of OCCUPATION_TYPE with Pensioner as most of the null values for OCCUPATION_TYPE compared to Income type variable values, we found that “Pensioner” is the most frequent value, almost 80% of the null values of OCCUPATION_TYPE

df1['ORGANIZATION_TYPE'] = df1['ORGANIZATION_TYPE'].replace('XNA', 'Pensioner')
df1['OCCUPATION_TYPE'].fillna('Pensioner' , inplace = True)

Some columns have nominal categorical values, so let’s impute them accordingly. You can the read more about this here.

df1['AMT_INCOME_TYPE'] = pd.qcut(df1.AMT_INCOME_TOTAL, q=[0, 0.2, 0.5, 0.8, 0.95, 1], labels=['VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH']) df1['AMT_INCOME_TYPE'].head(11)

df1['AMT_CREDIT_TYPE'] = pd.qcut(df1.AMT_CREDIT, q=[0, 0.2, 0.5, 0.8, 0.95, 1], labels=['VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH'])
df1['AMT_CREDIT_TYPE'].head(11)

Let’s Bin ‘DAYS_BIRTH’ column by converting it to years based on various “AGE_GROUP.”

df1['DAYS_BIRTH']= (df1['DAYS_BIRTH']/365).astype(int)    # Converting 
df1['DAYS_BIRTH'].unique()

df1['AGE_GROUP']=pd.cut(df1['DAYS_BIRTH'],                 
                         bins=[19,25,35,60,100], labels=['Very_Young','Young', 'Middle_Age', 'Senior_Citizen'])   #Binning

df1[['DAYS_BIRTH','AGE_GROUP']].head()

Again check the datatypes for all the columns and change them accordingly.

df1.info()

After observing all the columns, we found some that didn’t add any value to our analysis, so we simply dropped them so that the data looked clear.

unwanted=['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL','REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']

df1.drop(labels=unwanted,axis=1,inplace=True)

Outlier Analysis

Outlier detection is very important for any data science process. Sometimes, removing outliers tends to improve our model, while other times, outliers may give you a very different approach to your analysis.

So let’s make a list of all the numeric columns and plot boxplots to understand the outliers in the data.

numerical_col = df1.select_dtypes(include='number').columns
len(numerical_col)

fig , axes = plt.subplots(nrows=7, ncols=5, constrained_layout=True)                 # Plot Configuration 
fig.subplots_adjust(left= 0, bottom=0, right=3, top=12, wspace=0.09, hspace=0.3)


for ax, column in zip(axes.flatten(),numerical_col):        #Using For loop 

    sns.boxplot(df1[column],ax=ax)   #Ploting

You will get a 7×5 boxplot matrix. Let’s have a look at a tiny portion.

Observe the plot and try to make your own insights.

Insights

CNT_CHILDREN have outlier values of having children more than 5.
IQR for AMT_INCOME_TOTAL is very slim, and it has a large number of outliers.
Third quartile of AMT_CREDIT is larger than the first quartile, which means that most of the credit amount for the loan of customers is present in the third quartile. There are a large number of outliers present ,in AMT_CREDIT.
The third quartile AMT_ANNUITY is slightly larger than the First quartile and there is a large number of outliers.
Third quartile of AMT_GOODS_PRICE,DAYS_REGISTRATION AND DAYS_LAST_PHONE_CHANGE is larger than the first quartile, and all have many outliers.
IQR for DAYS EMPLOYED is very slim. Most of the outliers are present below 25000. And an outlier is present 375000.
From boxplot of CNT_FAM_MEMBERS , we can say that most of the clients have four family members. There are some outliers present.
DAYS_BIRTH ,DAYS_ID_PUBLISH and EXT_SOURCE_2,EXT_SOURCE_3 Don’t have any outliers.
Boxplot for DAYS_EMPLOYED ,OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE, DEF_60_CNT_SOCIAL_CIRCLE,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON, AMT_REQ_CREDIT_BUREAU_QRT and AMT_REQ_CREDIT_BUREAU_YEARare very slim and quantiles,arge number lieutliers.
FLAG_OWN_CAR : It doesn’t have First and Third quantile and values lies within IQR, So we can conclude that most of the clients own a car
FLAG_OWN_REALTY : It doesn’t have First and Third quantiles, and values lie within IQR, So we can conclude that most of the clients own a House/Flat

Before we start analysing your data, let’s check the imbalance in the data. It’s a very important to step in any machine learning or deep learning to visually check the target variable’s distribution is "11.39"

Let’s check the distribution of the target variable visually.

count1 = 0 
count0 = 0
for i in df1['TARGET'].values:
    if i == 1:
        count1 += 1
    else:
        count0 += 1
        
count1 = (count1/len(df1['TARGET']))*100
count0 = (count0/len(df1['TARGET']))*100

x = ['Defaulted Population(TARGET=1)','Non-Defauted Population(TARGET=0)']
y = [count1, count0]

explode = (0.1, 0)  # only "explode" the 1st slice

fig1, ax1 = plt.subplots()
ax1.pie(y, explode=explode, labels=x, autopct='%1.1f%%',
        shadow=True, startangle=110)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Data imbalance',fontsize=25)
plt.show()

Result:

Data Imbalance | Exploratory Data Analysis

Insights

df1 dataframe that is application data is highly imbalanced. Defaulted population is 8.1 % and non- defaulted population is 91.9% .Ratio is 11.3

We will separately analyse the data based on the target variable for a better understanding.

plt.figure(figsize=(15,8))
plt.subplot(121)
sns.countplot(x='TARGET',hue='CODE_GENDER',data=Target0, palette = 'Set2')
plt.title("Gender Distribution in Target0")
plt.subplot(122)
sns.countplot(x='TARGET',hue='CODE_GENDER',data=Target1, palette = 'Set2')
plt.title("Gender Distribution in Target1")

plt.show()

Insights

It seems like Female clients applied more than male clients for loan
66.6% Female clients are non-defaulters while 33.4% male clients are non-defaulters.
57% Female clients are defaulters, while 42% male can apply are defaulters.

plt.figure(figsize=(15,7)) 
plt.subplot(121)
sns.countplot(x='TARGET',hue='AGE_GROUP',data=Target0,palette='Set2')
plt.subplot(122)
sns.countplot(x='TARGET',hue='AGE_GROUP',data=Target1,palette='Set2')
plt.show()

Age Distribution | Exploratory Data Analysis

Insights

Middle Age(35-60) The group seems to apply to any other age groups for loans in the case of defaulters and non-defaulters.
Also, Middle Age group facing paying difficulties the most.
While Senior Citizens(60-100) and Very young(19-25) age group faces paying difficulties less than other age groups.

The organization’s Distribution Based on Target 0 and Target 1

plt.figure(figsize=(40,5))
plt.rcParams["axes.labelsize"] = 80
plt.rcParams['axes.titlesize'] = 80                                                           # Plot Configuration 
plt.rcParams['axes.titlepad'] = 50
fig, axes = plt.subplots(nrows=1,ncols=2)                  
sns.set_context('talk')
fig.subplots_adjust(left= 0.09,bottom=1,right=3,top= 12,wspace=0.5,hspace=0.3) 


plt.subplot(121)
plt.xscale('log')                                                                             # For Target0      
sns.countplot(data=Target0,y='ORGANIZATION_TYPE',
              order=df1['ORGANIZATION_TYPE'].value_counts().index,palette='Set3',hue = 'TARGET')
plt.title("ORGANIZATION_TYPE Vs Target 0")



plt.subplot(122)
plt.xscale('log')                                                                              # For Target1
sns.countplot(data=Target1,y='ORGANIZATION_TYPE',
              order=df1['ORGANIZATION_TYPE'].value_counts().index,palette='Set1',hue = 'TARGET')
plt.title("ORGANIZATION_TYPE Vs Target 1")



plt.show();

Insights

(Defaulters as well as Non-defaulters) Clients with ORGANIZATION_TYPE Business Entity Type 3, Self-employed, Other ,Medicine, Government,Business Entity Type 2 applied the most for the loan as compared to others
(Defaulters as well as Non-defaulters) Clients having ORGANIZATION_TYPE Industry: type 13, Trade: type 4, Trade: type 5, Industry: type 8 applied lower for the loan as compared to others.

Creating a plot for each feature manually becomes too tedious. So, we will define a function and use a loop to iterate through each categorical column.

def categorical_plot(var):
    plt.figure(figsize=(40,20))
    
    plt.rcParams['axes.labelpad'] = 50
    plt.subplot(1, 2, 1)
    sns.countplot(var, data=Target0, palette = 'Set3', hue='TARGET') 
    plt.xlabel(var, fontsize= 30, fontweight="bold")                                                         #Target 0
    plt.ylabel('Non Payment Difficulties', fontsize= 30, fontweight="bold")
    plt.xticks(rotation=90, fontsize=30)
    plt.yticks(rotation=360, fontsize=30)
    
    
    plt.rcParams['axes.labelpad'] = 50
    plt.subplot(1, 2, 2)
    sns.countplot(var, data=Target1, palette = 'Set1', hue='TARGET')                                           # Target 1
    plt.xlabel(var, fontsize= 30, fontweight="bold")
    plt.ylabel('Payments Difficulties', fontsize= 30, fontweight="bold")
    plt.xticks(rotation=90, fontsize=30)
    plt.yticks(rotation=360, fontsize=30)
    plt.show()

Let’s create a list for all categorical columns.

categorical_col = list(df1.select_dtypes(include= 'category').columns) 


# Removing 'ORGANIZATION_TYPE','CODE_GENDER','AGE_GROUP' because we have already taken up the isights from  above plots

categorical_col.remove('ORGANIZATION_TYPE') 
categorical_col.remove('CODE_GENDER')
categorical_col.remove('AGE_GROUP')

categorical_col #Checking after removing columns

for cat in categorical_col:
    categorical_plot(cat)

Result:

Insights

NAME_CONTRACT_TYPE : Most of the clients have applied for Cash Loan while very small proportion have applied for Revolving loan for both Defaulters as well as Non-defaulters.
NAME_TYPE_SUIT : Most of the clients were accompanied while applying for the loan. For a few clients, a family member was accompanying both Defaulters and Non-Defaulters.
However, who was accompanying the client while applying for the loan doesn’t impact the default. Also, both populations have the same proportions.
NAME_INCOME_TYPE: Clients who applied for loans were earning income from Work, Commercial associates, and Pensioners. The highest category was the Working class. Businessmen, students, and the unemployed were also likely to apply for loans. The working categories have a high risk of default. The State Servant is at minimal risk of default.
NAME_EDUCATION_TYPE: Clients with secondary or special education are likelier to apply for the loan. Clients with secondary or secondary special education have a higher risk of defaulting. Other education types have minimal risk.
NAME_FAMILY_STATUS : Married Clients are the most frequent applicants for the loan, regardless of default status. Interestingly, single clients are less risky when it comes to default. Widows, on the other hand, show minimal risk.
NAME_HOUSING_TYPE: The majority of clients, both Defaulters and Non-Defaulters, either own a house or live in an apartment. This suggests a stable housing situation, which could be a positive factor for loan approval.
OCCUPATION_TYPE: Pensioners have applied the most for loans in Defaulters and Non-Defaulters. Pensioners are the highest risk of default, followed by laborers.
WEEKDAY_APPR_PROCESS_START: There is no considerable difference in days for both Defaulters and Non-defaulters.
AMT_INCOME_TYPE: Clients with a Medium salary range are more likely to apply for the loan, both for Defaulters and Non-defaulters. Clients with low and medium incomes are at high risk of default.
AMT_CREDIT_TYPE: Most clients applied for a Medium Credit Amount of the loan for both Defaulters and Non-defaulters. Clients applying for high and low credit are at high risk of default.

Univariate Analysis of Numerical Columns W.R.T Target Variable

def uni(col):
    sns.set(style="darkgrid")
    plt.figure(figsize=(40,20))
    
   
    plt.subplot(1,2,1)                                   
    sns.distplot(Target0[col], color="g" )
    plt.yscale('linear') 
    plt.xlabel(col, fontsize= 30, fontweight="bold")
    plt.ylabel('Non Payment Difficulties', fontsize= 30, fontweight="bold")                    #Target 0
    plt.xticks(rotation=90, fontsize=30)
    plt.yticks(rotation=360, fontsize=30)
     
    
    
    
    plt.subplot(1,2,2)                                                                                                      
    sns.distplot(Target1[col], color="r")
    plt.yscale('linear')    
    plt.xlabel(col, fontsize= 30, fontweight="bold")
    plt.ylabel('Payment Difficulties', fontsize= 30, fontweight="bold")                       # Target 1
    plt.xticks(rotation=90, fontsize=30)
    plt.yticks(rotation=360, fontsize=30)
    
    plt.show();

uni(col='AMT_ANNUITY')

Result:

uni(col='AMT_CREDIT')

uni(col='AMT_GOODS_PRICE')

uni(col='AMT_INCOME_TOTAL')

Insights

People with target one have largely staggered incomes compared to those with target zero. The dist. Plot clearly shows that the shape in Income total, Annuity, Credit, and Good Price is similar for Target 0 and Target 1.

The plots also highlight people who have difficulty paying back loans with respect to their income, loan amount, price of goods against which the loan is procured, and Annuity.
The dist plot highlights the curve shape, which is wider for Target 1 than Target 0, which is narrower with well-defined edges.

Bivariate Analysis: Numerical & Categorical W.R.T Target variables

Let’s check the required columns for analysis.

df1[["TARGET","AMT_INCOME_TOTAL","NAME_EDUCATION_TYPE","NAME_FAMILY_STATUS"]]

For Target 0

plt.figure(figsize=(35,14)) 
plt.yscale('log')                     #As the values are too large, it is convinient to use log for better analysis
plt.xticks(rotation = 90)


sns.boxplot(data =Target0, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL',   #Boxplot w.r.t Data Target 0
            hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')


plt.legend( loc = 'upper right')                                              #Adjusting legend position
plt.title('Income amount vs Education Status',fontsize=35 )
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_INCOME_TOTAL",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)

plt.show()

Insights

Widow Clients with an Academic degree have very few outliers and don’t have the First and Third quartiles. Also, Clients with all types of family statuses having academic degrees have very few outliers compared to other kinds of education.
The income of clients with all types of family statuses and the rest of the education type lies below the First quartile, i.e. 25%.
Clients with higher education, incomplete higher education, Lower Secondary Education, and Secondary/Secondary Special education have many outliers.
From the above figure, we can say that clients with higher education tend to have the highest incomes.
Some clients who haven’t completed their Higher Education tend to have higher incomes.
Some clients with Secondary/Secondary Special Education tend to have higher incomes.

plt.figure(figsize=(25,10))
plt.yscale('log')                      #As the values are too large, it is convinient to use log for better analysis
plt.xticks(rotation = 90)


sns.boxplot(data =Target0, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT',           #Boxplot w.r.t Data Target 0
            hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')


plt.legend( bbox_to_anchor=(1.5, 1),loc = 'upper right')            #Adjusting legend position
plt.title('Credit V/s Education',fontsize=35 )
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_CREDIT",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)

plt.show()

Insights

Clients with different Education types except Academic degrees have a large number of outliers**
Most of the population, i.e. clients’ credit, lie below 25%.
Clients with a college degree and who are widows tend to take higher credit loans.**
Some of the clients with higher education, incomplete higher education, Lower Secondary Education, and Secondary/Secondary Special Education are more likely to take leditmarried

plt.figure(figsize=(30,12)) 
plt.yscale('log')                     #As the values are too large, it is convinient to use log for better analysis
plt.xticks(rotation = 90)


sns.boxplot(data =Target1, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL',   #Boxplot w.r.t Data Target 1
            hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')


plt.legend( loc = 'upper right')                                              #Adjusting legend position
plt.title('Income amount vs Education Status',fontsize= 35)
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_INCOME_TOTAL",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)

plt.show()

Insights

The income from an academic degree is much lower than that of others.
(Defaulter) Clients have relatively less income than non-defaulters.

plt.figure(figsize=(30,12))               #As the values are too large, it is convinient to use log for better analysis
plt.yscale('log')                       
plt.xticks(rotation = 90)


sns.boxplot(data =Target1, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT',      #Boxplot w.r.t Data Target 1
            hue ='NAME_FAMILY_STATUS',orient='v',palette='Set2')


 
plt.legend( bbox_to_anchor=(1.5, 1),loc = 'upper right')              #Adjusting legend position
plt.title('Credit V/s Education',fontsize=50 )
plt.xlabel("NAME_EDUCATION_TYPE",fontsize= 30, fontweight="bold")
plt.ylabel("AMT_CREDIT",fontsize= 30, fontweight="bold")
plt.xticks(rotation=90, fontsize=30)
plt.yticks(rotation=360, fontsize=30)

plt.show()

Insights

A married client with academics applied for a higher credit loan. And doesn’t have outliers. Single clients with academic degrees have a very slim boxplot with no outliers.
Some of the clients with higher education, incomplete higher education, Lower Secondary Education, and Secondary/Secondary Special Education are more likely to take large credit loans.

Bivariate Analysis of Categorical-Categorical to Find the Maximum % Clients with Loan-Payment Difficulties

Define a function for bivariate plots.

def biplot(df,feature,title):
    temp = df[feature].value_counts()
    
# Calculate the percentage of target=1 per category value    

    perc = df[[feature, 'TARGET']].groupby([feature],as_index=False).mean() 
    perc.sort_values(by='TARGET', ascending=False, inplace=True)
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=("Count of "+ title,"% of Loan Payment difficulties within each category"))
    fig.add_trace(go.Bar(x=temp.index, y=temp.values),row=1, col=1)
    fig.add_trace(go.Bar(x=perc[feature].to_list(), y=perc['TARGET'].to_list()),row=1, col=2)
    fig['layout']['xaxis']['title']=feature
    fig['layout']['xaxis2']['title']=feature
    fig['layout']['yaxis']['title']='Count'
    fig['layout']['yaxis2']['title']='% of Loan Payment Difficulties'
    fig.update_layout(height=600, width=1000, title_text=title, showlegend=False)
    fig.show()

Distribution of Amount Income Range and the category with maximum % Loan-Payment Difficulties

biplot(df1 ,'AMT_INCOME_TYPE','Income range')

Distribution of Type of Income and the category with maximum Loan-Payment Difficulties

biplot(df1 ,'NAME_INCOME_TYPE','Income type')

Distribution of Contract Type and the category with maximum Loan-Payment Difficulties

biplot(df1 ,'NAME_CONTRACT_TYPE','Contract type')

Distribution of Education Type and the category with maximum Loan-Payment Difficulties

biplot(df1 ,'NAME_EDUCATION_TYPE','Education type')

Distribution of Housing Type and the category with maximum Loan-Payment Difficulties

biplot(df1 ,'NAME_HOUSING_TYPE','Housing type')

Distribution of Occupation Type and the category with maximum Loan-Payment Difficulties

biplot(df1 ,'OCCUPATION_TYPE','Occupation type')

You may be wondering why I haven’t attached screenshots. Well, plot the charts and try to give insights based on that. That’s the best way to learn.

Distribution of CODE_GENDER with respect to AMT_INCOME_RANGE to find maximum % Loan-Payment Difficulties using pivot table

table= pd.pivot_table(df1, values='TARGET', index=['CODE_GENDER','AMT_INCOME_TYPE'],
                      columns=['NAME_EDUCATION_TYPE'], aggfunc=np.mean)

table

Insights

Female clients with an Academic degree and high-income type have a higher risk of default.
Male clients with Secondary/Secondary Special Education having all types of salaries have a higher risk of default.
Male clients with incomplete education and meager salaries have a high risk of default.
Male Clients with Lower Secondary Education having very low or medium have a high risk of defaulting.

Let’s visually check correlations in the data. For tha at, make a list of all numeric features.

numerical_col = df1.select_dtypes(include='number').columns
numerical_col
len(numerical_col)

Correlations between numerical variables

Let’s use pairplot to get the required charts.

pair = Target0[['TARGET','AMT_CREDIT', 'AMT_ANNUITY', 'AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE', 'DAYS_BIRTH','CNT_CHILDREN','DAYS_EMPLOYED']].fillna(0)
sns.pairplot(pair)

plt.show()

pair = Target1[['TARGET','AMT_CREDIT', 'AMT_ANNUITY', 'AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE', 'DAYS_BIRTH','CNT_CHILDREN','DAYS_EMPLOYED']].fillna(0)
sns.pairplot(pair)

plt.show()

Insights

AMT_CREDIT and AMT_GOODS_PRICE are highly correlated variables for both defaulters and non-defaulters. So, as the home price increases, the loan amount also increases
AMT_CREDIT and AMT_ANNUITY (EMI) are highly correlated variables for both defaulters and non-defaulters. So, as the home price increases, the EMI amount also increases, which is logical.
All three variables AMT_CREDIT, AMT_GOODS_PRICE and AMT_ANNUITY are highly correlated for both defaulters and non-defaulters, which might not give a good indicator for heat maps detection

Let us now check the correlations in heatmaps.

corr0=df1.iloc[0:,2:]
corr1=df1.iloc[0:,2:]

t0=corr0.corr(method='spearman')   # t0 - Corelations distibuted according rank wise for target 0
t1=corr1.corr(method='spearman')   # t1 - Corelations distibuted according rank wise for target 1

Source: Author

targets_corr(data=t0,title='Correlation for Target 0')

Source: Author

Insights

AMT_CREDIT is inversely proportional to the DAYS_BIRTH , people belong to the low-age group taking high Credit amounts and vice-versa
AMT_CREDIT is inversely proportional to the CNT_CHILDREN, means the Credit amount is higher for fewer children count clients have and vice-versa.
AMT_INCOME_TOTAL is inversely proportional to the CNT_CHILDREN, means more income for fewer children clients have and vice-versa.
Fewer children clients are in densely populated areas.
AMT_CREDIT Is higher in a densely populated area.
AMT_INCOME_TOTAL Is also higher in a densely populated area

InsightsObservations: The map for Target 1 is the same observation as Target 0, but a few points are different. They are listed below.

The client’s permanent address does not match the contact address, and they have fewer children.
The client’s permanent address does not match the work address, and they have fewer children.

This is the analysis of current application data. We have one more piece of data for the previous applications and have to analyse that also. Consider that data and do the analysis. Try to give insights.

Find the link to the source code here.

Conclusion

In conclusion, the Exploratory Data Analysis (EDA) conducted for the credit card fraud detection case study highlights the critical importance of thorough data examination in identifying fraudulent activities. We gained valuable insights into transaction patterns and anomalies by utilising various visualisation techniques and statistical methods. This analysis not only aids in developing more effective fraud detection algorithms but also enhances the understanding of the underlying factors contributing to fraudulent behavior. Implementing these findings can significantly improve the accuracy of detection systems, ultimately leading to better risk management and protection for both financial institutions and customers.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

KAVITA

A Mathematics student turned Data Scientist. I am an aspiring data scientist who aims at learning all the necessary concepts in Data Science in detail. I am passionate about Data Science knowing data manipulation, data visualization, data analysis, EDA, Machine Learning, etc which will help to find valuable insights from the data.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Exploratory Data Analysis (EDA): Credit Card Fraud Detection Case Study

Business Problem Understanding

Business Goal

Data

Importing Necessary Packages

Data Inspection

Handling Null Values

Outlier Analysis

Univariate Analysis of Numerical Columns W.R.T Target Variable

Bivariate Analysis: Numerical & Categorical W.R.T Target variables

Bivariate Analysis of Categorical-Categorical to Find the Maximum % Clients with Loan-Payment Difficulties

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid