Since anomaly detection can spot trends or departures from expected behavior in data, it is an essential tool in many industries, such as banking, cybersecurity, and healthcare. Principal Component Analysis (PCA) is an effective technique for detecting anomalies concealed in datasets, among the many other anomaly detection techniques available. A dimensionality reduction method called PCA makes it easier to transform complicated data into a lower-dimensional space while keeping the most important information. PCA uses the data’s inherent structure to detect outliers or anomalies by examining residual errors after transformation.
An anomaly, also known as an outlier, is a data point that significantly deviates from the expected or normal behavior within a dataset. In simpler terms, it stands out as unusual or different compared to most data. Anomalies can occur for various reasons, such as errors in data collection, sensor malfunctions, fraudulent activities, or genuine rare events.
For example, consider a dataset containing daily temperatures recorded over a year in a city. Most of the temperatures follow a typical pattern, with warmer temperatures in summer and cooler temperatures in winter. However, if there’s a day in the dataset where the temperature is exceptionally high during the winter season, significantly deviating from the typical range of temperatures for that time of year, it would be considered an anomaly. A recording error could cause this anomaly, an unusual weather event, or a malfunctioning temperature sensor. Identifying such anomalies is important for ensuring the accuracy and reliability of the data and for taking appropriate actions, if necessary, such as investigating the cause of the anomaly or correcting errors in data collection processes.
Certainly! Let’s include autoencoders in the list of anomaly detection techniques:
Further, in this article, we will talk about the PCA for Anomaly Detection.
Principal Component Analysis (PCA) is a widely used technique in data analysis and machine learning for dimensionality reduction and feature extraction. It aims to transform high-dimensional data into a lower-dimensional space while preserving most of the variance in the original data.
PCA finds the eigenvectors and eigenvalues of the data’s covariance matrix. Eigenvectors represent the directions of maximum variance in the data, while eigenvalues indicate the magnitude of variance along those directions. PCA identifies the principal components and the eigenvectors associated with the largest eigenvalues. These principal components form a new orthogonal basis for the data. By selecting a subset of these components, PCA effectively reduces the dimensionality of the data while retaining as much variance as possible.
The principal components are linear combinations of the original features and are chosen to capture the maximum variance present in the data. PCs are the eigenvectors of the covariance matrix of the original data. They represent the directions in the feature space along which the data exhibits the most variation. The first principal component captures the maximum variance present in the data. Subsequent principal components capture decreasing amounts of variance, with each subsequent component capturing less variance than the previous one.
Also read: An End-to-end Guide on Anomaly Detection
This method is very useful when the dataset is unbalanced. For example, we have plenty of data for Normal transactions but not enough data for fraudulent transactions. PCA-based anomaly detection solves this problem by analyzing available features and determining a normal transaction.
Reconstruction errors are necessary for anomaly detection. After identifying the PCs, we can recreate the original data from the PCA-transformed data without losing important information by choosing the first few principal components. This means we should be able to explain the original data by selecting the PCs that account for most of the variance. Reconstruction error is the term used to describe the error that arises when reconstructing the original data. When there are data anomalies, the reconstruction error is large.
Based on our previous data, we do PCA find reconstruction errors and find the normalized reconstruction error, which will be used to compare with newly ingested data points. Newly ingested data points are projected with those calculated Principal components. Then, we find the reconstruction error. If this reconstruction error is greater than the threshold, i.e., normalized reconstruction error, then it is flagged anomalous.
Also read: Learning Different Techniques of Anomaly Detection
# Importing necessary libraries
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Import seaborn as sns
data = pd.read_csv("creditcard.csv")
data.head()
s = data["Class"].value_counts()
s.iloc[1], s.iloc[0]
X = data.copy()
y = data["Class"]
from sklearn.preprocessing import StandardScaler
Std = StandardScaler()
Std.fit(X)
X = Std.transform(X)
# Applying PCA
pca = PCA()
X_pca = pca.fit_transform(X)
# Variance explained by each component
variance_explained = pca.explained_variance_ratio_
# Plotting the variance explained by each component
plt.figure(figsize=(20, 8))
plt.bar(range(1, len(variance_explained) + 1), variance_explained, alpha=0.7, align='center')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Variance Explained by Each Principal Component')
plt.xticks(range(1, len(variance_explained) + 1))
plt.grid(True)
plt.show()
cum_sum = np.cumsum(pca.explained_variance_ratio_)*100
comp= [n for n in range(len(cum_sum))]
plt.figure(figsize=(20, 8))
plt.plot(comp, cum_sum, marker='o',markersize=10)
plt.xlabel('PCA Components')
plt.ylabel('Cumulative Explained Variance (%)')
plt.title('PCA')
plt.show()
# Summing the variance explained by the 28 components
variance_explained_first_two = sum(variance_explained[:28])
print("Variance explained by the 28 components:", variance_explained_first_two)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
dataX = data.copy().drop(['Class'],axis=1)
dataY = data['Class'].copy()
featuresToScale = dataX.columns
sX = StandardScaler(copy=True)
dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])
X_train, X_test, y_train, y_test = \
train_test_split(dataX, dataY, test_size=0.33, \
random_state=2018, stratify=dataY)
def scatterPlot(xDF, yDF, algoName):
tempDF = pd.DataFrame(data=xDF.loc[:, 0:1], index=xDF.index)
tempDF = pd.concat((tempDF, yDF), axis=1, join="inner")
tempDF.columns = ["First Vector", "Second Vector", "Label"]
sns.lmplot(x="First Vector", y="Second Vector", hue="Label", data=tempDF, fit_reg=False, legend=False)
ax = plt.gca()
ax.set_title("Separation of Observations using " + algoName)
ax.legend(loc = "upper right")
X_train_PCA = pca.fit_transform(X_train)
X_train_PCA = pd.DataFrame(data=X_train_PCA, index=X_train.index)
X_train_PCA_inverse = pca.inverse_transform(X_train_PCA)
X_train_PCA_inverse = pd.DataFrame(data=X_train_PCA_inverse, \
index=X_train.index)
scatterPlot(X_train_PCA, y_train, "PCA")
# Applying PCA
pca = PCA(n_components=28) # Reducing to 2 dimensions for visualization
X_pca = pca.fit_transform(X)
# Reconstructing the dataset
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.sum(np.square(X - X_reconstructed), axis=1)
# Visualizing the reconstruction error
plt.figure(figsize=(20, 8))
counts, bins, _ = plt.hist(reconstruction_error, bins=20, color='skyblue', edgecolor='black', alpha=0.7)
plt.xlabel('Reconstruction Error')
plt.ylabel('Frequency')
plt.title('Distribution of Reconstruction Error')
plt.grid(True)
# Annotate each bin with the count
for i in range(len(counts)):
plt.text(bins[i], counts[i], str(int(counts[i])), ha='center', va='bottom', fontsize = 18)
plt.show()
# Finding anomalies
threshold = np.percentile(reconstruction_error, 99.8) # Adjust percentile as needed
anomalies = X[reconstruction_error > threshold]
print("Number of anomalies:", len(anomalies))
print("Anomalies:")
print(anomalies)
# Identifying anomalies
anomalies_indices = np.where(reconstruction_error > threshold)[0]
anomalies_indices
normal = 0
fraud = 0
for i in anomalies_indices:
if data.iloc[i]["Class"] == 0:
normal = normal + 1
else:
fraud = fraud + 1
normal, fraud
Precision of our pca:
Precision = fraud / (normal + fraud)
Precision*100
Percentage of fraud transactions detected:
Fraud_detected = fraud/s.iloc[1]
Fraud_detected
We have 284807 data points in our dataset, and 492 transactions are fraudulent. We consider these 492 transactions to be anomalous. Upon using Principal Component Analysis (PCA), we detected 570 records as anomalous. This is done based on reconstruction error. Of those 570 data points, 410 were actually fraudulent, i.e., True Positives and 160 were normal, i.e., False positives. With highly imbalanced data and performing unsupervised learning techniques, we got a precision of 71.92 and detected almost 83% of fraudulent transactions.
Also read: Unraveling Data Anomalies in Machine Learning
PCA is more effective for local anomalies that exhibit linear relationships with the principal components of the data. It can be useful when anomalies are small deviations from the normal data’s distribution and are related to the underlying structure captured by PCA. It’s often used as a preprocessing step for anomaly detection when dealing with high-dimensional data.
For certain types of anomalies, such as those with non-linear relationships or when the anomalies are significantly different from the normal data, other techniques like isolation forests, one-class SVMs, or autoencoders might be more suitable.
In summary, while PCA can be used for anomaly detection, it’s important to consider the characteristics of your data and the types of anomalies you are trying to detect. PCA might work well in some cases but might not be the best choice for all anomaly detection scenarios.
Ans. PCA aids in anomaly detection by reducing the dimensionality of high-dimensional data while retaining most of its variance. This reduction simplifies the dataset’s representation and highlights the most significant features. Anomalies often manifest as deviations from the normal patterns captured by PCA, resulting in noticeable reconstruction errors when projecting data back to the original space.
Ans. PCA offers several advantages for anomaly detection. Firstly, it provides a compact representation of the data, making it easier to visualize and interpret anomalies. Secondly, PCA can capture complex relationships between variables, effectively identifying anomalies even in datasets with correlated features. PCA-based anomaly detection is also computationally efficient, making it suitable for analyzing large-scale datasets.
Ans. Anomalies detected using PCA are data points that exhibit significant reconstruction errors when projected back to the original feature space. These anomalies represent instances that deviate substantially from the normal patterns captured by PCA. Interpreting anomalies involves examining their characteristics and understanding the underlying reasons for their divergence from the norm. This process may involve domain knowledge and further investigation to determine whether anomalies are indicative of genuine outliers or errors in the data.
Ans. Yes, PCA can be combined with other anomaly detection methods, such as One-Class SVM or Isolation Forest, to enhance performance. PCA’s dimensionality reduction capabilities complement other techniques by improving feature selection, visualization, and computational efficiency. By reducing the dataset’s dimensionality, PCA simplifies the data representation and makes it easier for other anomaly detection algorithms to identify meaningful patterns and anomalies.
Ans. In unsupervised anomaly detection, PCA simplifies anomaly detection tasks by identifying anomalies without prior knowledge of their labels. However, it may overlook subtle anomalies that require labeled examples for training. In supervised anomaly detection, PCA can still be used for feature extraction, but its effectiveness depends on the availability and quality of labeled data. Additionally, class imbalance and data distribution may impact PCA’s performance differently in unsupervised versus supervised settings.
Ans. PCA helps in anomaly detection on imbalanced datasets by emphasizing variations that differentiate anomalies from normal instances. By reducing dimensionality and focusing on principal components capturing significant variations, PCA enhances sensitivity to subtle anomalies. This aids in detecting rare anomalies amidst a majority of normal instances, improving overall anomaly detection performance