Anomaly Detection Using PCA: Unveiling Insights in Data Anomalies

Badrinarayan M Last Updated : 01 Apr, 2024

12 min read

Introduction

Since anomaly detection can spot trends or departures from expected behavior in data, it is an essential tool in many industries, such as banking, cybersecurity, and healthcare. Principal Component Analysis (PCA) is an effective technique for detecting anomalies concealed in datasets, among the many other anomaly detection techniques available. A dimensionality reduction method called PCA makes it easier to transform complicated data into a lower-dimensional space while keeping the most important information. PCA uses the data’s inherent structure to detect outliers or anomalies by examining residual errors after transformation.

Learning Objectives

Understanding Anomalies, their types, and Anomaly Detection(AD)
Understanding Principal Component Analysis(PCA)
Learning how to use PCA for Anomaly Detection
Implementation of PCA on a dataset for AD

Understanding Anomalies
- What is an Anomaly?
Types of Anomalies
Some Common Techniques for Anomaly Detection
Principal Component Analysis (PCA)
- What is PCA?
- How does PCA work?
PCA for Anomaly Detection
- Why use PCA for Anomaly Detection?
- How does PCA Work for Anomaly Detection?
Implementation of PCA for Anomaly Detection
Pros of Using Principal Component Analysis (PCA) for Anomaly Detection
Cons of Using Principal Component Analysis (PCA) for Anomaly Detection
Frequently Asked Questions

Understanding Anomalies

What is an Anomaly?

An anomaly, also known as an outlier, is a data point that significantly deviates from the expected or normal behavior within a dataset. In simpler terms, it stands out as unusual or different compared to most data. Anomalies can occur for various reasons, such as errors in data collection, sensor malfunctions, fraudulent activities, or genuine rare events.

For example, consider a dataset containing daily temperatures recorded over a year in a city. Most of the temperatures follow a typical pattern, with warmer temperatures in summer and cooler temperatures in winter. However, if there’s a day in the dataset where the temperature is exceptionally high during the winter season, significantly deviating from the typical range of temperatures for that time of year, it would be considered an anomaly. A recording error could cause this anomaly, an unusual weather event, or a malfunctioning temperature sensor. Identifying such anomalies is important for ensuring the accuracy and reliability of the data and for taking appropriate actions, if necessary, such as investigating the cause of the anomaly or correcting errors in data collection processes.

Types of Anomalies

Point Anomaly: When a data point is far from the rest of the dataset, it is called a point Anomaly. Ex: A sudden large transaction from the user with fewer or fewer transactions.
Contextual Anomaly: A data point is anomalous in some context or in a subset of data. For example, a decrease in traffic during nonbusiness hours is considered normal, whereas if the same occurs during peak hours, it’s anomalous.
Collective Anomalies (Cluster Anomalies): Collective anomalies involve a group of data points that are collectively anomalous when considered together, but individually they may not be anomalous. Ex: Consider a scenario where an individual is using a credit card. A single high-value transaction might not raise flags if the user has a history of similar transactions. However, a series of such high-value transactions in a short time span could be considered a collective anomaly, potentially indicating credit card fraud.

Some Common Techniques for Anomaly Detection

Certainly! Let’s include autoencoders in the list of anomaly detection techniques:

Statistical Methods
These methods involve modeling the normal behavior of data and flagging instances that fall outside a defined statistical threshold, such as mean or standard deviation. An example is the z-score method, where data points with z-scores beyond a certain threshold are considered anomalies.
Machine Learning Algorithms
- One-Class Support Vector Machines (SVM): One-Class SVMs learn a decision boundary around normal data instances in feature space and classify instances outside this boundary as anomalies. They are useful for detecting outliers in high-dimensional datasets with normal data points.
- k-Nearest Neighbors (KNN): KNN identifies anomalies by measuring the distance of a data point to its k nearest neighbors. Data points with unusually large distances are classified as anomalies.
- Autoencoders: Autoencoders are neural network architectures trained to reconstruct input data at their output layer. Anomalies result in higher reconstruction errors due to their deviation from the normal patterns learned during training, making autoencoders effective for anomaly detection in various domains.
Clustering Techniques
- K-means Clustering: K-means partitions the data into k clusters based on similarity. Anomalies are instances that do not belong to any cluster or belong to small clusters.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters of high density and flags instances in low-density regions as anomalies. It is effective for detecting local anomalies in data with varying densities.
PCA-Based Methods
Principal Component Analysis (PCA) reduces the dimensionality of high-dimensional data while preserving most of its variance. After projecting back to the original space, anomalies are identified as data points with large reconstruction errors. PCA is effective for detecting anomalies in datasets with correlated features and can help visualize and understand the underlying structure of the data.
Ensemble Methods
- Isolation Forest: Isolation Forest is an ensemble learning algorithm that isolates anomalies by recursively partitioning the data space into subsets. Anomalies are identified as instances that require fewer partitions to be isolated, making Isolation Forest efficient for detecting anomalies in large datasets.

Further, in this article, we will talk about the PCA for Anomaly Detection.

Principal Component Analysis (PCA)

What is PCA?

Principal Component Analysis (PCA) is a widely used technique in data analysis and machine learning for dimensionality reduction and feature extraction. It aims to transform high-dimensional data into a lower-dimensional space while preserving most of the variance in the original data.

How does PCA work?

PCA finds the eigenvectors and eigenvalues of the data’s covariance matrix. Eigenvectors represent the directions of maximum variance in the data, while eigenvalues indicate the magnitude of variance along those directions. PCA identifies the principal components and the eigenvectors associated with the largest eigenvalues. These principal components form a new orthogonal basis for the data. By selecting a subset of these components, PCA effectively reduces the dimensionality of the data while retaining as much variance as possible.

The principal components are linear combinations of the original features and are chosen to capture the maximum variance present in the data. PCs are the eigenvectors of the covariance matrix of the original data. They represent the directions in the feature space along which the data exhibits the most variation. The first principal component captures the maximum variance present in the data. Subsequent principal components capture decreasing amounts of variance, with each subsequent component capturing less variance than the previous one.

Also read: An End-to-end Guide on Anomaly Detection

PCA for Anomaly Detection

Why use PCA for Anomaly Detection?

This method is very useful when the dataset is unbalanced. For example, we have plenty of data for Normal transactions but not enough data for fraudulent transactions. PCA-based anomaly detection solves this problem by analyzing available features and determining a normal transaction.

How does PCA Work for Anomaly Detection?

For anomalies present in the dataset.

Reconstruction errors are necessary for anomaly detection. After identifying the PCs, we can recreate the original data from the PCA-transformed data without losing important information by choosing the first few principal components. This means we should be able to explain the original data by selecting the PCs that account for most of the variance. Reconstruction error is the term used to describe the error that arises when reconstructing the original data. When there are data anomalies, the reconstruction error is large.

For anomalies when ingestion of data.

Based on our previous data, we do PCA find reconstruction errors and find the normalized reconstruction error, which will be used to compare with newly ingested data points. Newly ingested data points are projected with those calculated Principal components. Then, we find the reconstruction error. If this reconstruction error is greater than the threshold, i.e., normalized reconstruction error, then it is flagged anomalous.

Also read: Learning Different Techniques of Anomaly Detection

Implementation of PCA for Anomaly Detection

Step 1: Importing necessary libraries

# Importing necessary libraries
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Import seaborn as sns

Step 2: Loading our dataset

data = pd.read_csv("creditcard.csv")
data.head()

s = data["Class"].value_counts()
s.iloc[1], s.iloc[0]

Step 3: Data preprocessing

X = data.copy()
y = data["Class"]
from sklearn.preprocessing import StandardScaler
Std = StandardScaler()
Std.fit(X)
X = Std.transform(X)

Step 4: Apply PCA and visualize the variance explained by each principal component

# Applying PCA
pca = PCA()
X_pca = pca.fit_transform(X)
# Variance explained by each component
variance_explained = pca.explained_variance_ratio_
# Plotting the variance explained by each component
plt.figure(figsize=(20, 8))
plt.bar(range(1, len(variance_explained) + 1), variance_explained, alpha=0.7, align='center')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Variance Explained by Each Principal Component')
plt.xticks(range(1, len(variance_explained) + 1))
plt.grid(True)
plt.show()

Step 5: Find cumulative variance explained with the addition of a principal component.

cum_sum = np.cumsum(pca.explained_variance_ratio_)*100
comp= [n for n in range(len(cum_sum))]
plt.figure(figsize=(20, 8))
plt.plot(comp, cum_sum, marker='o',markersize=10)
plt.xlabel('PCA Components')
plt.ylabel('Cumulative Explained Variance (%)')
plt.title('PCA')
plt.show()

Step 6: Finding the explained variance with 28 components

# Summing the variance explained by the 28 components
variance_explained_first_two = sum(variance_explained[:28])
print("Variance explained by the 28 components:", variance_explained_first_two)

Step 7: Visualization in the separation of observations using PCA

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
dataX = data.copy().drop(['Class'],axis=1)
dataY = data['Class'].copy()
featuresToScale = dataX.columns
sX = StandardScaler(copy=True)
dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])
X_train, X_test, y_train, y_test = \
train_test_split(dataX, dataY, test_size=0.33, \
random_state=2018, stratify=dataY)

def scatterPlot(xDF, yDF, algoName):
    tempDF = pd.DataFrame(data=xDF.loc[:, 0:1], index=xDF.index)
    tempDF = pd.concat((tempDF, yDF), axis=1, join="inner")
    tempDF.columns = ["First Vector", "Second Vector", "Label"]
    sns.lmplot(x="First Vector", y="Second Vector", hue="Label", data=tempDF, fit_reg=False, legend=False)
    ax = plt.gca()
    ax.set_title("Separation of Observations using " + algoName)
    ax.legend(loc = "upper right")
X_train_PCA = pca.fit_transform(X_train)
X_train_PCA = pd.DataFrame(data=X_train_PCA, index=X_train.index)
X_train_PCA_inverse = pca.inverse_transform(X_train_PCA)
X_train_PCA_inverse = pd.DataFrame(data=X_train_PCA_inverse, \
index=X_train.index)
scatterPlot(X_train_PCA, y_train, "PCA")

Step 8: Applying PCA with 28 components

# Applying PCA
pca = PCA(n_components=28)  # Reducing to 2 dimensions for visualization
X_pca = pca.fit_transform(X)

Step 9: Reconstruction of the dataset

# Reconstructing the dataset
X_reconstructed = pca.inverse_transform(X_pca)

Step 10: Calculate the reconstruction error and visualize them

reconstruction_error = np.sum(np.square(X - X_reconstructed), axis=1)
# Visualizing the reconstruction error
plt.figure(figsize=(20, 8))
counts, bins, _ = plt.hist(reconstruction_error, bins=20, color='skyblue', edgecolor='black', alpha=0.7)
plt.xlabel('Reconstruction Error')
plt.ylabel('Frequency')
plt.title('Distribution of Reconstruction Error')
plt.grid(True)
# Annotate each bin with the count
for i in range(len(counts)):
    plt.text(bins[i], counts[i], str(int(counts[i])), ha='center', va='bottom', fontsize = 18)
plt.show()

Step 11: Find anomalies in our dataset

# Finding anomalies
threshold = np.percentile(reconstruction_error, 99.8)  # Adjust percentile as needed
anomalies = X[reconstruction_error > threshold]
print("Number of anomalies:", len(anomalies))
print("Anomalies:")
print(anomalies)

Step 12: Extracting the indices of our anomalies

# Identifying anomalies
anomalies_indices = np.where(reconstruction_error > threshold)[0]
anomalies_indices

Step 13: Evaluation of our anomalies

normal = 0
fraud = 0
for i in anomalies_indices:
    if data.iloc[i]["Class"] == 0:
        normal = normal + 1
    else:
        fraud = fraud + 1
normal, fraud

Precision of our pca: 
Precision = fraud / (normal + fraud) 
Precision*100

Percentage of fraud transactions detected: 
Fraud_detected = fraud/s.iloc[1] 
Fraud_detected

Inference

We have 284807 data points in our dataset, and 492 transactions are fraudulent. We consider these 492 transactions to be anomalous. Upon using Principal Component Analysis (PCA), we detected 570 records as anomalous. This is done based on reconstruction error. Of those 570 data points, 410 were actually fraudulent, i.e., True Positives and 160 were normal, i.e., False positives. With highly imbalanced data and performing unsupervised learning techniques, we got a precision of 71.92 and detected almost 83% of fraudulent transactions.

Also read: Unraveling Data Anomalies in Machine Learning

Pros of Using Principal Component Analysis (PCA) for Anomaly Detection

Dimensionality Reduction: PCA can help reduce the data’s dimensionality while retaining most of the variance. This can be useful for simplifying complex data and highlighting important features.
Noise Reduction: PCA can help reduce the impact of noise in the data by focusing on the principal components that capture the most significant variations. While low-variance features will be excluded, features with noise will have larger variance; hence, PCA helps reduce this Noise.
PCA’s Dimensionality: While anomalies can be considered noise, PCA’s dimensionality reduction and noise reduction benefits are still advantageous for anomaly detection. By reducing dimensionality, PCA simplifies data representation, aiding in identifying anomalies as deviations from normal patterns in the reduced-dimensional space. Additionally, focusing on principal components helps prioritize features capturing the most significant variations, enhancing anomaly detection sensitivity to genuine deviations amidst noise. Thus, despite anomalies being a form of noise, PCA’s capabilities optimize anomaly detection by emphasizing important features and simplifying data representation.
Visual Inspection: When reducing data to two or three dimensions (principal components), you can visualize the data and anomalies in a scatter plot, which might provide insights.

Cons of Using Principal Component Analysis (PCA) for Anomaly Detection

Computation Time: PCA involves matrix operations such as eigendecomposition or singular value decomposition (SVD), which can be computationally intensive, especially for large datasets with high dimensions. The time complexity of PCA is typically cubic or quadratic with respect to the number of features or samples, making it less scalable for very large datasets.
Memory Requirements: PCA may require storing the entire dataset and its covariance matrix in memory, which can be memory-intensive for large datasets. This can lead to issues with memory constraints, especially on systems with limited memory resources.
Linear Transformation: PCA is a linear transformation technique. PCA might not effectively distinguish if anomalies don’t exhibit linear relationships with the principal components. Example: When considering fuel cars in general there is an inverse correlation between fuels and speed. This is captured well with PCA whereas when cars become hybrid or electric there is no linear relationship between fuel and speed, in this case PCA does not capture relationships well.
Distribution Assumptions: PCA assumes that the data follows a Gaussian distribution. Anomalies can distort the distribution and impact the quality of PCA.
Threshold Selection: Defining a threshold for detecting anomalies based on the residual errors (distance between original and reconstructed data) can be subjective and challenging.
High Dimensionality Requirement: PCA tends to be more effective in high-dimensional data. When you only have a few features, other methods might work better.

Key Takeaways

By reducing the dimensionality of high-dimensional datasets, PCA simplifies data representation and highlights important features for anomaly detection
PCA can be used for highly imbalanced data, by emphasizing features that differentiate anomalies from normal instances.
Using a real-world dataset, such as credit card fraud detection, demonstrates the practical application of PCA-based anomaly detection techniques. This application showcases how PCA can be used to identify anomalies and detect fraudulent activities effectively.
Reconstruction error, calculated from the difference between original and reconstructed data points, is a metric for identifying anomalies. Higher reconstruction errors indicate potential anomalies, enabling the detection of fraudulent or abnormal behavior in the dataset.

Conclusion

PCA is more effective for local anomalies that exhibit linear relationships with the principal components of the data. It can be useful when anomalies are small deviations from the normal data’s distribution and are related to the underlying structure captured by PCA. It’s often used as a preprocessing step for anomaly detection when dealing with high-dimensional data.

For certain types of anomalies, such as those with non-linear relationships or when the anomalies are significantly different from the normal data, other techniques like isolation forests, one-class SVMs, or autoencoders might be more suitable.

In summary, while PCA can be used for anomaly detection, it’s important to consider the characteristics of your data and the types of anomalies you are trying to detect. PCA might work well in some cases but might not be the best choice for all anomaly detection scenarios.

Frequently Asked Questions

Q1. How does Principal Component Analysis (PCA) contribute to anomaly detection?

Ans. PCA aids in anomaly detection by reducing the dimensionality of high-dimensional data while retaining most of its variance. This reduction simplifies the dataset’s representation and highlights the most significant features. Anomalies often manifest as deviations from the normal patterns captured by PCA, resulting in noticeable reconstruction errors when projecting data back to the original space.

Q2. What are the advantages of using PCA for anomaly detection compared to other methods?

Ans. PCA offers several advantages for anomaly detection. Firstly, it provides a compact representation of the data, making it easier to visualize and interpret anomalies. Secondly, PCA can capture complex relationships between variables, effectively identifying anomalies even in datasets with correlated features. PCA-based anomaly detection is also computationally efficient, making it suitable for analyzing large-scale datasets.

Q3. How do you interpret anomalies detected using PCA?

Ans. Anomalies detected using PCA are data points that exhibit significant reconstruction errors when projected back to the original feature space. These anomalies represent instances that deviate substantially from the normal patterns captured by PCA. Interpreting anomalies involves examining their characteristics and understanding the underlying reasons for their divergence from the norm. This process may involve domain knowledge and further investigation to determine whether anomalies are indicative of genuine outliers or errors in the data.

Q4. Can PCA be combined with other anomaly detection techniques for improved performance?

Ans. Yes, PCA can be combined with other anomaly detection methods, such as One-Class SVM or Isolation Forest, to enhance performance. PCA’s dimensionality reduction capabilities complement other techniques by improving feature selection, visualization, and computational efficiency. By reducing the dataset’s dimensionality, PCA simplifies the data representation and makes it easier for other anomaly detection algorithms to identify meaningful patterns and anomalies.

Q5. What are the trade-offs between using PCA for unsupervised anomaly detection versus supervised anomaly detection?

Ans. In unsupervised anomaly detection, PCA simplifies anomaly detection tasks by identifying anomalies without prior knowledge of their labels. However, it may overlook subtle anomalies that require labeled examples for training. In supervised anomaly detection, PCA can still be used for feature extraction, but its effectiveness depends on the availability and quality of labeled data. Additionally, class imbalance and data distribution may impact PCA’s performance differently in unsupervised versus supervised settings.

Q6. How does PCA assist in anomaly detection on highly imbalanced datasets?

Ans. PCA helps in anomaly detection on imbalanced datasets by emphasizing variations that differentiate anomalies from normal instances. By reducing dimensionality and focusing on principal components capturing significant variations, PCA enhances sensitivity to subtle anomalies. This aids in detecting rare anomalies amidst a majority of normal instances, improving overall anomaly detection performance

Badrinarayan M

Data science Trainee at Analytics Vidhya, specializing in ML, DL and Gen AI. Dedicated to sharing insights through articles on these subjects. Eager to learn and contribute to the field's advancements. Passionate about leveraging data to solve complex problems and drive innovation.

Data Analysis Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices