Principal Component Analysis in Machine Learning | PCA in ML

Premanand S Last Updated : 06 Nov, 2024
11 min read

PCA, or Principal Component Analysis, is a term that is well-known to everyone. Notably employed for Curse of Dimensionality issues. In addition to this fundamental issue, we tackle other significant matters in the PCA article. So, let’s start with basic knowledge. In this article, I’ve also added my handwritten manual technique for PCA in machine learning, layman comprehension, some key theory, and a Python approach.

Overview:

  • Learn About Principal Component Analysis (PCA) as a fundamental tool for dimensionality reduction in machine learning.
  • Understand how PCA tackles the curse of dimensionality by transforming correlated features into independent principal components.
  • Explore the step-by-step manual and Python-based approach for applying PCA to datasets.
  • Gain Insights into the key advantages and limitations of PCA in real-time applications.
  • Discover the practical applications of PCA in fields like computer vision, bioinformatics, and data visualization.

This article was published as a part of the Data Science Blogathon.

What is Principal Component Analysis (PCA) in Machine Learning?

  • Principal Component Analysis can be abbreviated PCA
  • PCA comes under the Unsupervised Machine Learning category
  • The main goal of PCA is to reduce the number of variables in a data collection while retaining as much information as feasible. Principal component analysis in machine learning can be mainly used for Dimensionality Reduction and important feature selection.
  • Correlated features to Independent features

Technically, PCA in machine learning provides a complete explanation of the composition of variance and covariance using multiple linear combinations of the core variables. Row scattering may be analyzed using PCA, identifying the distribution-related properties.

PCA, NOTES

Why Do We Need PCA in Machine Learning?

Machine learning often excels when a computer is trained on a big, well-organized dataset. One of the techniques used to handle the curse of dimensionality in machine learning is principal component analysis (PCA). Typically, having sufficient data enables us to create a more accurate prediction model since we have more data to train the computer. However, working with a huge data collection has its drawbacks. The curse of dimensionality is the ultimate trap.

The title of an unreleased Harry Potter novel does not refer to what happens when your data has too many characteristics and perhaps not enough data points; rather, it relates to the curse of dimensionality. One can use dimensionality reduction to escape the dimensionality curse. We can reduce 50 variables to 40, 20, or even 10. This is where we see the strongest effects of dimensionality reduction.

Working with high-dimensional data will cause overfitting issues, and we will use dimensionality reduction to address them. Increasing interpretability and minimizing information loss. Aids in locating important characteristics. Aids in the discovery of a linear combination of varied sequences.

When should Principal Component Analysis be used in ML?

  • Whenever we need to know our features are independent of each other
  • Whenever we need fewer features from higher features

Dimensionality Reduction Work in Real-Time Application

Assume there are 50 questions in the survey. The following three are among them: Please give the following a rating between 1 and 5:

  • I feel comfortable around people
  • I easily make friends
  • I like going out

These queries could appear different now. There is a catch, though. They aren’t, generally speaking. They all gauge how extroverted you are. Therefore, combining them makes it logical, right? That’s where linear algebra and dimensionality reduction methods come in! We want to lessen the complexity of the problem by minimizing the number of variables since we have much too many variables that aren’t all that different. That is the main idea behind dimensionality reduction. And it just so happens that principal component analysis in machine learning is one of the most straightforward and popular techniques in this field. As a general guideline, maintain at least 70–80 percent of the explained variation.

Intuition behind PCA

Let’s assume we are playing a mind game here like,

PersonHeight
A145
B160
C185

from the above table, we need to find the tallest person.

principal component analysis, height

I can by seeing person A is the tallest. Now change the scenario

PersonHeight
D172
E173
F171

Can you guess who’s who? It’s tough when they are very similar in height.

PCA, height

Because of how much their heights vary, we previously had no issue telling a 185cm person from a 160cm and a 145cm person. Similarly, our data contains more information when its variance is bigger. This explains why the terms PCA and maximum variance are frequently used together.

Basic Terminologies of PCA in Machine Learning

Before getting into PCA in machine learning, we need to understand some basic terminologies,

  • Variance: For calculating the variation of data distributed across the dimensionality of the graph
  • Covariance: Calculating dependencies and relationship between features
  • Standardizing data: Scaling our dataset within a specific range for unbiased output
principal component analysis
Source: PCA Terminologies 
  • Covariance matrix: Used for calculating interdependencies between the features or variables and also helps in reducing it to improve the performance
Covariance matrix, PCA
Source: Exceldemy
  • EigenValues and EigenVectors: The eigenvectors aim to find the largest dataset variance to calculate the Principal Component. Eigenvalue means the magnitude of the Eigenvector. The eigenvalue indicates variance in a particular direction, whereas the eigenvector expands or contracts the X-Y (2D) graph without altering the direction.
PCA, Eigenvalues

In this shear mapping, the blue arrow changes direction, whereas the pink arrow does not. In this instance, the pink arrow is an eigenvector because of its constant orientation. The length of this arrow is also unaltered, and its eigenvalue is 1. Technically, a PC is a straight line that captures the data’s maximum variance (information). PC shows direction and magnitude. PCs are perpendicular to each other.

  • Dimensionality Reduction: Transpose of original data and multiply it by transposing of the derived feature vector. Reducing the features without losing information.
PCA,Dimensionality Reduction

How Does PCA Work?

The steps involved for PCA in ML are as follows-

  • Original Data
  • Normalize the original data (mean =0, variance =1)
  • Calculating covariance matrix
  • Calculating Eigen values, Eigen vectors, and normalized Eigenvectors
  • Calculating Principal Component (PC)
  • Plot the graph for orthogonality between PCs

I have solved them manually, and the importance of hand-written notes is getting the crux behind the coding concepts,

principal component analysis

We calculate the means and then the covariance matrix between features.

calculating covariance matrix

After finding the covariance matrix, we are going to calculate the eigenvalue, eigenvector, and normalized eigenvector

calulating the eigen value

Steps involved in eigenvalues and vectors in the manual approach

steps eigenvalue

From this, we are going to calculate the PCs.

PCA, LEARNING

We are going to calculate the normalized eigenvector

pca

Hence, PCA is calculated, and visually, we can see how PCs are orthogonal to each other.

How Many PCAs are Needed for Any Data?

PCA algorithm in machine learning has maximum variance (information), which will be good to select.

Eigenvalues are used to find out which PCA in ML has a maximum variance.

PCA

Advantages of PCA in ML

  • Used for Dimensionality Reduction
  • PCA will assist you in eliminating all related features, sometimes called multi-collinearity.
  • The time required to train your model is substantially shorter because PCA has reduced the number of features.
  • PCA aids in overcoming overfitting by eliminating the extraneous features from your dataset.

Disadvantages of PCA in Machine Learning

  • Useful for quantitative data but not effective with qualitative data.
  • Interpretation of PC is difficult from the original data.

Application for Principal Component Analysis (PCA)

  • Computer Vision
  • Bio-informatics application
  • For compressed images or resizing of the image
  • Discovering patterns from high-dimensional data
  • Reduction of dimensions
  • Multidimensional Data – Visualization

Python Code for Principal Component Analysis in Machine Learning

Before working with any dataset, let’s try it with some randomly generated data:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
PCA Graphs
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.components_)
print(pca.explained_variance_)

def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)

X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
principal component analysis

These vectors represent the data’s principal axes, and the vector’s length indicates how “important” that axis is in describing the data distribution. More precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes is the “principal component” of the data.

If we plot these principal components besides the original data, we see the plots shown here:

principal component analysis GRAPH
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)

One dimension now exists for the converted data. We can run the inverse transform on this reduced data and display it next to the original data to visualize the impact of this dimensionality reduction:

X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
principal component analysis, graph

The bright dots represent the actual data, while the projected data is shown by the dark points. This explains what a PCA machine learning dimensionality reduction means: the data along the primary axis(es) that are least relevant are deleted, leaving only the component(s) of the data with the largest variance. The amount of “information” lost in this decrease of dimensionality is generally measured by the proportion of variance that is eliminated.

For better understanding, we are working with the default pre-loaded dataset called breast cancer.

from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()

print(breast_cancer.feature_names)
print(len(breast_cancer.feature_names))

import numpy as np
print(breast_cancer.target) 
print(breast_cancer.target_names) 
print(np.array(np.unique(breast_cancer.target, return_counts=True)))

 

import numpy as np
import matplotlib.pyplot as plt
_, axes = plt.subplots(6,5, figsize=(15, 15))
malignant = breast_cancer.data[breast_cancer.target==0]
benign = breast_cancer.data[breast_cancer.target==1]
ax = axes.ravel()                     # flatten the 2D array
for i in range(30):                   # for each of the 30 features
    bins = 40

    #---plot histogram for each feature---
    ax[i].hist(malignant[:,i], bins=bins, color='r', alpha=.5)
    ax[i].hist(benign[:,i], bins=bins, color='b', alpha=0.3)
    #---set the title---
    ax[i].set_title(breast_cancer.feature_names[i], fontsize=12)

    #---display the legend---
    ax[i].legend(['malignant','benign'], loc='best', fontsize=8)
plt.tight_layout()
plt.show()
import pandas as pd
df = pd.DataFrame(breast_cancer.data, 
                  columns = breast_cancer.feature_names)
df['diagnosis'] = breast_cancer.target
df

#Training the Model using all the Features
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = df.iloc[:,:-1]      
y = df.iloc[:,-1]
#---perform a split---
random_state = 12
X_train, X_test, y_train, y_test = 
    train_test_split(X, y,
                     test_size = 0.3,
                     shuffle = True,
                     random_state=random_state)

#---train the model using Logistic Regression---
log_reg = LogisticRegression(max_iter = 5000)
log_reg.fit(X_train, y_train)
#---evaluate the model---
log_reg.score(X_test,y_test)

#Training the Model using Reduced Features
df_corr = df.corr()['diagnosis'].abs().sort_values(ascending=False)
df_corr
# get all the features that has at least 0.6 in correlation to the 
# target
features = df_corr[df_corr > 0.6].index.to_list()[1:]
features                          # without the 'diagnosis' column

#Checking for MultiCollinearity
import pandas as pd
from sklearn.linear_model import LinearRegression
def calculate_vif(df, features):    
    vif, tolerance = {}, {}
    # all the features that you want to examine
    for feature in features:
        # extract all the other features you will regress against
        X = [f for f in features if f != feature]        
        X, y = df[X], df[feature]

        # extract r-squared from the fit
        r2 = LinearRegression().fit(X, y).score(X, y)                
        # calculate tolerance
        tolerance[feature] = 1 - r2
        # calculate VIF
        vif[feature] = 1/(tolerance[feature])

    # return VIF DataFrame
    return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})
calculate_vif(df,features)
# try to reduce those feature that has high VIF until each feature 
# has VIF less than 5
features = [
    'worst concave points',
    'mean radius',
    'mean concavity',
]
calculate_vif(df,features)
#Training the Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = df.loc[:,features]            # get the reduced features in the 
                                  # dataframe
y = df.loc[:,'diagnosis']

# perform a split
X_train, X_test, y_train, y_test = 
    train_test_split(X, y, 
                     test_size = 0.3,
                     shuffle = True,                                                    
                     random_state=random_state)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_test,y_test)

#Training the Model using Reduced Features (PCA)
#Performing Standard Scaling
from sklearn.preprocessing import StandardScaler
# get the features and label from the original dataframe
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# performing standardization
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

#Applying Principal Component Analysis (PCA)

from sklearn.decomposition import PCA
components = None
pca = PCA(n_components = components)
# perform PCA on the scaled data
pca.fit(X_scaled)

# print the explained variances
print("Variances (Percentage):")
print(pca.explained_variance_ratio_ * 100)
print()
print("Cumulative Variances (Percentage):")
print(pca.explained_variance_ratio_.cumsum() * 100)
print()

# plot a scree plot
components = len(pca.explained_variance_ratio_) 
    if components is None else components
plt.plot(range(1,components+1), 
         np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components")
plt.ylabel("Explained variance (%)")
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.85)
pca.fit(X_scaled)
print("Cumulative Variances (Percentage):")
print(np.cumsum(pca.explained_variance_ratio_ * 100))
components = len(pca.explained_variance_ratio_)
print(f'Number of components: {components}')

# Make the scree plot
plt.plot(range(1, components + 1), np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components")
plt.ylabel("Explained variance (%)")
pca_components = abs(pca.components_)
print(pca_components)

print('Top 4 most important features in each component')
print('===============================================')
for row in range(pca_components.shape[0]):
    # get the indices of the top 4 values in each row
    temp = np.argpartition(-(pca_components[row]), 4)
    # sort the indices in descending order
    indices = temp[np.argsort((-pca_components[row])[temp])][:4]
    # print the top 4 feature names
    print(f'Component {row}: {df.columns[indices].to_list()}')

#Transforming all the 30 Columns to the 6 Principal Components
X_pca = pca.transform(X_scaled)
print(X_pca.shape)
print(X_pca)

#Creating a Machine Learning Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
_sc = StandardScaler()
_pca = PCA(n_components = components)
_model = LogisticRegression()
log_regress_model = Pipeline([
    ('std_scaler', _sc),
    ('pca', _pca),
    ('regressor', _model)
])

# perform a split
X_train, X_test, y_train, y_test = 
    train_test_split(X, y, 
                     test_size=0.3,
                     shuffle=True, 
                     random_state=random_state)

# train the model using the PCA components
log_regress_model.fit(X_train,y_train)
log_regress_model.score(X_test,y_test)

Conclusion

I anticipate that the learners now have some understanding of Principal Component Analysis (PCA), the most important method in unsupervised machine learning. We use the PCA algorithm in machine learning not only for dimension reduction but also to identify key characteristics and solve multicollinearity issues. Although the knowledge I’ve provided here is important and useful for the projects we’ll be using, we still need to understand many things. In upcoming columns, I’ll be disclosing more information. Coding and theory by themselves won’t make any issues easier to comprehend.

Frequently Asked Questions

Q1. What are PCAs used for?

A. Principal Component Analysis (PCA) reduces the dimensionality of data while preserving as much variability as possible. It helps simplify data, making it easier to visualize and analyze.

Q2. Why is PCA algorithm used?

A. The PCA algorithm is used to:
1. Reduce the number of variables in a dataset (dimensionality reduction).
2. Remove noise and redundant information.
3. Identify patterns and simplify the complexity of high-dimensional data.
4. Facilitate visualization of high-dimensional data in 2D or 3D.

Q3. Is PCA supervised or unsupervised?

A. PCA is an unsupervised learning algorithm. It does not require labeled data and focuses on identifying patterns based on the data’s inherent structure.

Q4. What is the goal of PCA?

A. The goal of PCA is to transform a large set of variables into a smaller one that still contains most of the information in the large set. This is achieved by identifying the principal components and the directions (axes) in which the data varies the most.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Premanand S is a dedicated academic with over a decade of research experience, specializing in Bio-signal Processing, Machine Learning, and Deep Learning. He completed his B.Tech in 2009 from Amrita Vishwa Vidyapeetham, Bangalore, and his M.E. in 2011 from Rajalakshmi Engineering College, Chennai, where his thesis focused on Deep Learning for ECG Signal Processing.

He is pursuing his Ph.D. at VIT-Chennai, with a tentative research title of "Deep Learning Approaches for Enhanced ECG Signal Processing and Arrhythmia Classification." His research aims to leverage cutting-edge deep learning techniques to improve the accuracy and efficiency of ECG signal analysis, contributing significantly to cardiac health monitoring.

A recipient of the prestigious TCS-RSP (Research Scholarship) in 2014, Cycle 9, Premanand has become a recognized figure in the academic community. He has delivered several invited talks on Data Science, Machine Learning, and Deep Learning at prominent institutions across India.

In his role as an Assistant Professor at VIT-Chennai, he continues to inspire the next generation of researchers while advancing the boundaries of knowledge in his field.

Responses From Readers

Clear

Prabakaran Kandasamy
Prabakaran Kandasamy

nice to see your handwritten notes... its is very easy understandable....thanks...i am searching machine learning job...so its helpful lot

College Brawl
College Brawl

Great breakdown of Principal Component Analysis (PCA) in machine learning! As a data scientist, I often find myself struggling to explain PCA to non-technical colleagues, and this post did a great job of simplifying the concept. The examples and illustrations were particularly helpful in illustrating how PCA can be used to visualize and reduce the dimensionality of complex datasets. Thanks for sharing!

Praveencse
Praveencse

Explained very well! Thanks

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details