PCA, or Principal Component Analysis, is a term that is well-known to everyone. Notably employed for Curse of Dimensionality issues. In addition to this fundamental issue, we tackle other significant matters in the PCA article. So, let’s start with basic knowledge. In this article, I’ve also added my handwritten manual technique for PCA in machine learning, layman comprehension, some key theory, and a Python approach.
Overview:
This article was published as a part of the Data Science Blogathon.
Technically, PCA in machine learning provides a complete explanation of the composition of variance and covariance using multiple linear combinations of the core variables. Row scattering may be analyzed using PCA, identifying the distribution-related properties.
Machine learning often excels when a computer is trained on a big, well-organized dataset. One of the techniques used to handle the curse of dimensionality in machine learning is principal component analysis (PCA). Typically, having sufficient data enables us to create a more accurate prediction model since we have more data to train the computer. However, working with a huge data collection has its drawbacks. The curse of dimensionality is the ultimate trap.
The title of an unreleased Harry Potter novel does not refer to what happens when your data has too many characteristics and perhaps not enough data points; rather, it relates to the curse of dimensionality. One can use dimensionality reduction to escape the dimensionality curse. We can reduce 50 variables to 40, 20, or even 10. This is where we see the strongest effects of dimensionality reduction.
Working with high-dimensional data will cause overfitting issues, and we will use dimensionality reduction to address them. Increasing interpretability and minimizing information loss. Aids in locating important characteristics. Aids in the discovery of a linear combination of varied sequences.
Assume there are 50 questions in the survey. The following three are among them: Please give the following a rating between 1 and 5:
These queries could appear different now. There is a catch, though. They aren’t, generally speaking. They all gauge how extroverted you are. Therefore, combining them makes it logical, right? That’s where linear algebra and dimensionality reduction methods come in! We want to lessen the complexity of the problem by minimizing the number of variables since we have much too many variables that aren’t all that different. That is the main idea behind dimensionality reduction. And it just so happens that principal component analysis in machine learning is one of the most straightforward and popular techniques in this field. As a general guideline, maintain at least 70–80 percent of the explained variation.
Let’s assume we are playing a mind game here like,
Person | Height |
A | 145 |
B | 160 |
C | 185 |
from the above table, we need to find the tallest person.
I can by seeing person A is the tallest. Now change the scenario
Person | Height |
D | 172 |
E | 173 |
F | 171 |
Can you guess who’s who? It’s tough when they are very similar in height.
Because of how much their heights vary, we previously had no issue telling a 185cm person from a 160cm and a 145cm person. Similarly, our data contains more information when its variance is bigger. This explains why the terms PCA and maximum variance are frequently used together.
Before getting into PCA in machine learning, we need to understand some basic terminologies,
In this shear mapping, the blue arrow changes direction, whereas the pink arrow does not. In this instance, the pink arrow is an eigenvector because of its constant orientation. The length of this arrow is also unaltered, and its eigenvalue is 1. Technically, a PC is a straight line that captures the data’s maximum variance (information). PC shows direction and magnitude. PCs are perpendicular to each other.
The steps involved for PCA in ML are as follows-
I have solved them manually, and the importance of hand-written notes is getting the crux behind the coding concepts,
We calculate the means and then the covariance matrix between features.
After finding the covariance matrix, we are going to calculate the eigenvalue, eigenvector, and normalized eigenvector
Steps involved in eigenvalues and vectors in the manual approach
From this, we are going to calculate the PCs.
We are going to calculate the normalized eigenvector
Hence, PCA is calculated, and visually, we can see how PCs are orthogonal to each other.
PCA algorithm in machine learning has maximum variance (information), which will be good to select.
Eigenvalues are used to find out which PCA in ML has a maximum variance.
Before working with any dataset, let’s try it with some randomly generated data:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.components_)
print(pca.explained_variance_)
def draw_vector(v0, v1, ax=None):
ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0)
ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
These vectors represent the data’s principal axes, and the vector’s length indicates how “important” that axis is in describing the data distribution. More precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes is the “principal component” of the data.
If we plot these principal components besides the original data, we see the plots shown here:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
One dimension now exists for the converted data. We can run the inverse transform on this reduced data and display it next to the original data to visualize the impact of this dimensionality reduction:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
The bright dots represent the actual data, while the projected data is shown by the dark points. This explains what a PCA machine learning dimensionality reduction means: the data along the primary axis(es) that are least relevant are deleted, leaving only the component(s) of the data with the largest variance. The amount of “information” lost in this decrease of dimensionality is generally measured by the proportion of variance that is eliminated.
For better understanding, we are working with the default pre-loaded dataset called breast cancer.
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
print(breast_cancer.feature_names)
print(len(breast_cancer.feature_names))
import numpy as np
print(breast_cancer.target)
print(breast_cancer.target_names)
print(np.array(np.unique(breast_cancer.target, return_counts=True)))
import numpy as np
import matplotlib.pyplot as plt
_, axes = plt.subplots(6,5, figsize=(15, 15))
malignant = breast_cancer.data[breast_cancer.target==0]
benign = breast_cancer.data[breast_cancer.target==1]
ax = axes.ravel() # flatten the 2D array
for i in range(30): # for each of the 30 features
bins = 40
#---plot histogram for each feature---
ax[i].hist(malignant[:,i], bins=bins, color='r', alpha=.5)
ax[i].hist(benign[:,i], bins=bins, color='b', alpha=0.3)
#---set the title---
ax[i].set_title(breast_cancer.feature_names[i], fontsize=12)
#---display the legend---
ax[i].legend(['malignant','benign'], loc='best', fontsize=8)
plt.tight_layout()
plt.show()
import pandas as pd
df = pd.DataFrame(breast_cancer.data,
columns = breast_cancer.feature_names)
df['diagnosis'] = breast_cancer.target
df
#Training the Model using all the Features
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
#---perform a split---
random_state = 12
X_train, X_test, y_train, y_test =
train_test_split(X, y,
test_size = 0.3,
shuffle = True,
random_state=random_state)
#---train the model using Logistic Regression---
log_reg = LogisticRegression(max_iter = 5000)
log_reg.fit(X_train, y_train)
#---evaluate the model---
log_reg.score(X_test,y_test)
#Training the Model using Reduced Features
df_corr = df.corr()['diagnosis'].abs().sort_values(ascending=False)
df_corr
# get all the features that has at least 0.6 in correlation to the
# target
features = df_corr[df_corr > 0.6].index.to_list()[1:]
features # without the 'diagnosis' column
#Checking for MultiCollinearity
import pandas as pd
from sklearn.linear_model import LinearRegression
def calculate_vif(df, features):
vif, tolerance = {}, {}
# all the features that you want to examine
for feature in features:
# extract all the other features you will regress against
X = [f for f in features if f != feature]
X, y = df[X], df[feature]
# extract r-squared from the fit
r2 = LinearRegression().fit(X, y).score(X, y)
# calculate tolerance
tolerance[feature] = 1 - r2
# calculate VIF
vif[feature] = 1/(tolerance[feature])
# return VIF DataFrame
return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})
calculate_vif(df,features)
# try to reduce those feature that has high VIF until each feature
# has VIF less than 5
features = [
'worst concave points',
'mean radius',
'mean concavity',
]
calculate_vif(df,features)
#Training the Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = df.loc[:,features] # get the reduced features in the
# dataframe
y = df.loc[:,'diagnosis']
# perform a split
X_train, X_test, y_train, y_test =
train_test_split(X, y,
test_size = 0.3,
shuffle = True,
random_state=random_state)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_test,y_test)
#Training the Model using Reduced Features (PCA)
#Performing Standard Scaling
from sklearn.preprocessing import StandardScaler
# get the features and label from the original dataframe
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# performing standardization
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
#Applying Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
components = None
pca = PCA(n_components = components)
# perform PCA on the scaled data
pca.fit(X_scaled)
# print the explained variances
print("Variances (Percentage):")
print(pca.explained_variance_ratio_ * 100)
print()
print("Cumulative Variances (Percentage):")
print(pca.explained_variance_ratio_.cumsum() * 100)
print()
# plot a scree plot
components = len(pca.explained_variance_ratio_)
if components is None else components
plt.plot(range(1,components+1),
np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components")
plt.ylabel("Explained variance (%)")
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.85)
pca.fit(X_scaled)
print("Cumulative Variances (Percentage):")
print(np.cumsum(pca.explained_variance_ratio_ * 100))
components = len(pca.explained_variance_ratio_)
print(f'Number of components: {components}')
# Make the scree plot
plt.plot(range(1, components + 1), np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components")
plt.ylabel("Explained variance (%)")
pca_components = abs(pca.components_)
print(pca_components)
print('Top 4 most important features in each component')
print('===============================================')
for row in range(pca_components.shape[0]):
# get the indices of the top 4 values in each row
temp = np.argpartition(-(pca_components[row]), 4)
# sort the indices in descending order
indices = temp[np.argsort((-pca_components[row])[temp])][:4]
# print the top 4 feature names
print(f'Component {row}: {df.columns[indices].to_list()}')
#Transforming all the 30 Columns to the 6 Principal Components
X_pca = pca.transform(X_scaled)
print(X_pca.shape)
print(X_pca)
#Creating a Machine Learning Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
_sc = StandardScaler()
_pca = PCA(n_components = components)
_model = LogisticRegression()
log_regress_model = Pipeline([
('std_scaler', _sc),
('pca', _pca),
('regressor', _model)
])
# perform a split
X_train, X_test, y_train, y_test =
train_test_split(X, y,
test_size=0.3,
shuffle=True,
random_state=random_state)
# train the model using the PCA components
log_regress_model.fit(X_train,y_train)
log_regress_model.score(X_test,y_test)
I anticipate that the learners now have some understanding of Principal Component Analysis (PCA), the most important method in unsupervised machine learning. We use the PCA algorithm in machine learning not only for dimension reduction but also to identify key characteristics and solve multicollinearity issues. Although the knowledge I’ve provided here is important and useful for the projects we’ll be using, we still need to understand many things. In upcoming columns, I’ll be disclosing more information. Coding and theory by themselves won’t make any issues easier to comprehend.
A. Principal Component Analysis (PCA) reduces the dimensionality of data while preserving as much variability as possible. It helps simplify data, making it easier to visualize and analyze.
A. The PCA algorithm is used to:
1. Reduce the number of variables in a dataset (dimensionality reduction).
2. Remove noise and redundant information.
3. Identify patterns and simplify the complexity of high-dimensional data.
4. Facilitate visualization of high-dimensional data in 2D or 3D.
A. PCA is an unsupervised learning algorithm. It does not require labeled data and focuses on identifying patterns based on the data’s inherent structure.
A. The goal of PCA is to transform a large set of variables into a smaller one that still contains most of the information in the large set. This is achieved by identifying the principal components and the directions (axes) in which the data varies the most.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
nice to see your handwritten notes... its is very easy understandable....thanks...i am searching machine learning job...so its helpful lot
Great breakdown of Principal Component Analysis (PCA) in machine learning! As a data scientist, I often find myself struggling to explain PCA to non-technical colleagues, and this post did a great job of simplifying the concept. The examples and illustrations were particularly helpful in illustrating how PCA can be used to visualize and reduce the dimensionality of complex datasets. Thanks for sharing!
Explained very well! Thanks