We live in a world that’s filled with data. Think about it – you probably use multiple online accounts every day, from email to social media to online shopping. But have you ever had a moment where you were using one of these accounts and received a notification on your phone asking if it’s really you trying to access your account? Well, this is where anomaly detection in credit card fraud comes in.
For example, let’s say you’re at work and you accidentally mistype your password while trying to log in to your Google account. Suddenly, you receive a message on your phone from Google asking if the login attempt is really you. This might make you wonder how Google is able to know that it’s not actually you trying to log in, especially if it’s just a simple mistake.
Anomaly detection is a fancy way of saying that computers are really good at finding patterns and noticing when something is out of the ordinary. In the case of Google’s login security, they use machine learning algorithms to create a “normal” pattern of your login behavior. This means that they learn what kind of device you usually use, what time of day you usually log in, and other similar details that make up your typical login behavior. Google’s system is able to recognize when some activity happens which is unusual and it as potentially suspicious. It’s kind of like how your bank might call you if they see a purchase on your account from a different country – they’re just making sure that it’s really you and not someone else using your account.
This article was published as a part of the Data Science Blogathon.
Google collects data from various sources such as browser cookies, device fingerprints, IP addresses, and user account information. They use this data to create a baseline of your normal login behavior, including your usual device, location, and time of day for logging in to your Google account. This baseline is then used to train machine learning algorithms that can detect unusual login patterns. These algorithms analyze data points such as the device used to log in, the location, the time of day, and your behavior after logging in, to identify anomalies that deviate significantly from your normal behavior.
For instance, if someone tries to log in to your Google account from a different device, location, or time of day, the algorithm may flag this as a potentially unauthorized login attempt. Similarly, if your behavior after logging in is significantly different from your normal behavior, the algorithm may flag this as a potential security threat.
Google’s anomaly detection algorithms are constantly evolving and improving as they learn from new data. They use a feedback loop to continuously train and update their algorithms based on new login data and user feedback.
To accomplish this, Google needs a large server infrastructure to collect, store, and process the vast amounts of data generated by their users. They also need powerful machine learning algorithms to analyze the data and detect anomalies accurately.
Anomaly detection is widely used across various fields, industries, and platforms. One such application of anomaly detection is in detecting credit card fraud. Our project will focus on implementing anomaly detection techniques to identify potentially fraudulent credit card transactions.
In our credit card fraud detection project, we will use anomaly detection techniques to identify potentially fraudulent transactions. We will use a dataset of credit card transactions to create a baseline of the customer’s normal spending behavior. Then we will apply machine learning algorithms to identify any transactions that deviate significantly from this baseline and flag them as potentially fraudulent.
We will also explore various techniques to improve the accuracy of our anomaly detection model, such as feature engineering, data preprocessing, and hyperparameter tuning. Our ultimate goal is to create a model that can accurately identify fraudulent transactions and minimize false positives, so that cardholders can be alerted in a timely manner and take appropriate actions to protect their accounts.
Here is link to the github repository contains Credit Card fraud detection.
You can download the credit card fraud dataset used in the project from the Kaggle website. Here is the link to the dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
In this section, we will provide an end-to-end guide to implementing anomaly detection using Python. We will use the Credit Card Fraud Detection dataset from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders. The dataset contains 284,807 transactions, out of which 492 are frauds. The dataset is highly imbalanced, with fraud transactions accounting for only 0.17% of the total transactions.
The first step is to prepare the data for anomaly detection. We will start by importing the necessary libraries and loading the dataset into a Pandas.
import pandas as pd
# Load the dataset
df = pd.read_csv("creditcard.csv")
# Check the shape of the dataset
print("Shape of the dataset:", df.shape)
# Check the first few rows of the dataset
print(df.head())
Output :
Shape of the dataset: (284807, 31)
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V21 V22 V23 V24 V25 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010
V26 V27 V28 Amount Class
0 -0.189115 0.133558 -0.021053 149.62 0
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0
[5 rows x 31 columns]
The dataset contains 31 columns, including the Time, Amount, and Class columns. The Class column indicates whether a transaction is fraudulent or not, where 1 indicates fraud and 0 indicates non-fraud.
Before applying any anomaly detection algorithm, it is essential to preprocess the data to ensure that it is in a suitable format for the algorithm. Here are some steps that we can follow to preprocess the dataset:
Missing values can affect the performance of the anomaly
detection algorithm. Therefore, it is essential to check whether there are any
missing values in the dataset and take appropriate action.
# Check if there are any missing values in the dataset
print(df.isnull().sum())
Output :
0
dtype : int64
The output shows that there are no missing values in the dataset.
Anomaly detection algorithms can be sensitive to the scale of the data. Therefore, it is important to scale the data before applying the algorithm. We can use the StandardScaler class from the sklearn.preprocessing module to scale the data.
from sklearn.preprocessing import StandardScaler
# Scale the Amount column
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
# Scale the Time column
df['Time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1))
# Check the first few rows of the dataset after scaling
print(df.head())
Output :
Time V1 V2 V3 V4 V5 V6 \
0 -1.996583 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388
1 -1.996583 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361
2 -1.996562 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499
3 -1.996562 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203
4 -1.996541 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921
V7 V8 V9 ... V21 V22 V23 V24 \
0 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
1 -0.078803 0.085102 -0.255425 ...
There are various anomaly detection algorithms available. In this section, we will discuss some popular algorithms along with their implementation in Python.
Isolation Forest is a popular algorithm for anomaly detection that is based on the concept of decision trees. It works by creating random decision trees for the given data and isolating the anomalies by creating shorter paths for them.
Let’s implement the Isolation Forest algorithm on our credit card fraud dataset.
from sklearn.ensemble import IsolationForest
# Create the Isolation Forest object
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.01),
max_features=1.0, random_state=42)
# Fit the data and tag the outliers
clf.fit(df)
# Get the predictions
y_pred = clf.predict(df)
# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)
# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))
Output :
Number of outliers: 2848
The Isolation Forest algorithm has detected 2848 anomalies in the dataset.
Local Outlier Factor (LOF) is another popular algorithm for anomaly detection that is based on the concept of local density. It works by calculating the density of a data point relative to its neighbors and identifying points that have a much lower density than their neighbors as outliers.
Let’s implement the LOF algorithm on our credit card fraud dataset.
from sklearn.neighbors import LocalOutlierFactor
# Create the LOF object
clf = LocalOutlierFactor(n_neighbors=20, contamination=float(0.01))
# Fit the data and tag the outliers
y_pred = clf.fit_predict(df)
# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)
# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))
Output :
Number of outliers: 2848
The LOF algorithm has also detected 2848 anomalies in the dataset, which is the same as the Isolation Forest algorithm.
One-class SVM is another popular algorithm for anomaly detection that is based on the concept of maximum margin hyperplanes. It works by creating a hyperplane that separates the normal data points from the anomalies and identifying points that lie on the wrong side of the hyperplane as anomalies.
Let’s implement the One-class SVM algorithm on our credit card fraud dataset.
from sklearn.svm import OneClassSVM
# Create the One-class SVM object
clf = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.01)
# Fit the data and tag the outliers
clf.fit(df)
# Get the predictions
y_pred = clf.predict(df)
# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)
# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))
Output
Number of outliers: 492
The One-class SVM algorithm has detected 492 anomalies in the dataset.
In this code, we have evaluated the performance of our models using cross-validation and selected the best performing model. We have used the stratified K-fold cross-validation technique to split the dataset into 5 folds, ensuring that the proportion of fraud cases is the same in each fold. Then, we have trained and evaluated three models – Logistic Regression, Decision Tree – using the cross-validation technique. We have used the average precision score as the evaluation metric because it is a suitable metric for imbalanced datasets.
from sklearn.model_selection import train_test_split
# Define X and y
X = df.drop('Class', axis=1)
y = df['Class']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
# Create a list of classifiers to evaluate
classifiers = [LogisticRegression(), DecisionTreeClassifier()]
# Create parameter grids for each classifier
lr_params = {'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10]}
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': [3, 5, 7]}
rf_params = {'n_estimators': [100, 300, 500], 'max_depth': [3, 5, 7]}
knn_params = {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}
param_grids = [lr_params, dt_params, rf_params, knn_params]
# Loop over classifiers and parameter grids to find the best model
for i, classifier in enumerate(classifiers):
clf = GridSearchCV(classifier, param_grids[i], cv=5)
clf.fit(X_train, y_train)
print(classifier.__class__.__name__)
print(clf.best_params_)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
The final step in the machine learning pipeline is to deploy the selected model to make predictions on new data. In this step, we will use the selected model to make predictions on the test dataset and evaluate its performance using classification metrics.
We will use the predict method of the trained model to make predictions on the test data, and then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics from the sklearn.metrics module.
The code for this step is as follows:
# make predictions on the test set
y_pred = rf_model.predict(X_test)
# evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
# print the classification metrics
print(f"Accuracy: {acc}")
print(f"Precision: {prec}")
print(f"Recall: {rec}")
print(f"F1 Score: {f1}")
Output
Accuracy: 0.9995669627705019
Precision: 0.9090909090909091
Recall: 0.8088235294117647
F1 Score: 0.8560311284046692#import csv
In this code, we first use the predict method of the trained rf_model to make predictions on the test set X_test. We then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics. Finally, we print the classification metrics to the console.
Note that we have imported the required metrics from the sklearn.metrics module. These metrics help us to evaluate the performance of the model and make
informed decisions about its suitability for deployment.
In this article, we have discussed the concept of anomaly detection and various algorithms that can be used to detect anomalies in a dataset. We have also implemented some of these algorithms in Python and applied them to a credit card fraud dataset to detect anomalies. It is important to note that the choice of algorithm and the preprocessing techniques depend on the nature of the data and the problem at hand.
Overall, anomaly detection is a powerful tool that can provide valuable insights and help detect abnormalities in various datasets. As the amount of data continues to grow, the need for effective anomaly detection techniques becomes increasingly important.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.