Clustering is an unsupervised machine-learning technique. It is the process of division of the dataset into groups in which the members in the same group possess similarities in features. The commonly used clustering techniques are K-Means clustering, Hierarchical clustering, Density-based clustering, Model-based clustering, etc. It can even handle large datasets. We can implement the K-Means clustering machine learning algorithm in the elbow method using the scikit-learn library in Python.
In this article, you will learn about the elbow method in machine learning, specifically how it applies to k-means clustering. We will explore the elbow method formula, demonstrating its use in both Python and R. Additionally, you will see an elbow method example that illustrates how to identify the optimal number of clusters for your data analysis.
This article was published as a part of the Data Science Blogathon.
Quiz Time
Are you up for a challenge in the realm of the Elbow Method for Finding the Optimal Number of Clusters in K-Means? Challenge yourself here!
The Elbow Method is a technique used in data analysis and machine learning for determining the optimal number of clusters in a dataset. It involves plotting the variance explained by different numbers of clusters and identifying the “elbow” point, where the rate of variance decreases sharply levels off, suggesting an appropriate cluster count for analysis or model training.
This method is a visual technique used to determine the best K value for a k-means clustering algorithm. In this method, a graph known as the elbow graph plots the within-cluster-sum-of-square (WCSS) values against various K values. The optimal K value is identified at the point where the graph bends like an elbow.
The elbow method is a graphical representation of finding the optimal ‘K’ in a K-means clustering. It works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance between points in a cluster and the cluster centroid.
Let’s go through the steps involved in K-means clustering for a better understanding:
Finding the optimal number of clusters is an important part of this algorithm. A commonly used method for finding the optimum K value is Elbow Method.
In the Elbow method, we are actually varying the number of clusters (K) from 1 – 10. For each value of K, we are calculating WCSS (Within-Cluster Sum of Square). WCSS is the sum of the squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When we analyze the graph, we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph moves almost parallel to the X-axis. The K value corresponding to this point is the optimal value of K or an optimal number of clusters.
Now let’s implement K-Means clustering using Python.
The elbow method is a common technique used to determine the optimal number of clusters (k) in k-means clustering. It’s a graphical approach that relies on the idea that as you increase the number of clusters, the sum of squared distances between points and their cluster centers (WCSS) will continue to decrease. This is because you’re essentially splitting the data into increasingly finer groups.
Here’s how it works:
Here are some things to keep in mind about the elbow method:
tunesharemore_vert
The dataset we are using here is the Mall Customers data (Download here). It’s unlabeled data that contains the details of customers in a mall (features like genre, age, annual income(k$), and spending score). Our aim is to cluster the customers based on the relevant features of annual income and spending score.
First of all, we have to import the essential libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
Now let’s import the given dataset and slice the important features.
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values
We have to find the optimal K value for clustering the data. Now we are using the Elbow Method to find the optimal K value.
from sklearn.cluster import KMeans
wcss = [] for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
The “init” argument is the method for initializing the centroid. We calculated the WCSS value for each K value. Now we have to plot the WCSS with the K value.
The graph will be like this:
The point at which the elbow shape is created is 5; that is, our K value or an optimal number of clusters is 5. Now let’s train the model on the input data with a number of clusters 5.
kmeans = KMeans(n_clusters = 5, init = "k-means++", random_state = 42)
y_kmeans = kmeans.fit_predict(X)
y_kmeans will be:
array([3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0,
3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 1,
3, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 2, 1, 2, 4, 2, 4, 2,
1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])
y_kmeans gives us different clusters corresponding to X. Now, let’s plot all the clusters using matplotlib.
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 60, c = 'red', label = 'Cluster1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 60, c = 'blue', label = 'Cluster2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 60, c = 'green', label = 'Cluster3)
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 60, c = 'violet', label = 'Cluster4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 60, c = 'yellow', label = 'Cluster5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 100, c = 'black', label = 'Centroids')
plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend()
plt.show()
Graph:
Now we will visualize the clusters using the scatter plot. As you can see, there are 5 clusters in total that are visualized in different colors, and the centroid of each cluster is visualized in black color.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd # Importing the dataset
X = dataset.iloc[:, [3, 4]].values
dataset = pd.read_csv('Mall_Customers.csv')
from sklearn.cluster import KMeans
# Using the elbow method to find the optimal number of clusters wcss = [] for i in range(1, 11):
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(X) plt.plot(range(1, 11), wcss) plt.xlabel('Number of clusters')
y_kmeans = kmeans.fit_predict(X)
plt.ylabel('WCSS') plt.show() # Training the K-Means model on the dataset kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42) y_kmeans = kmeans.fit_predict(X)
# Visualising the clusters
plt.scatter( X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 60, c = 'blue', label = 'Cluster2')
plt.scatter( X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 60, c = 'red', label = 'Cluster1') plt.scatter( X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 60, c = 'green', label = 'Cluster3')
plt.scatter( kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 100, c = 'black', label = 'Centroids')
plt.scatter( X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 60, c = 'violet', label = 'Cluster4') plt.scatter( X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 60, c = 'yellow', label = 'Cluster5') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend()
plt.show()
The elbow method, while a useful tool for determining the optimal number of clusters in K-means clustering, has some drawbacks:
Despite these drawbacks, the elbow method remains a valuable starting point for selecting the number of clusters, and it often provides useful insights into the data’s underlying structure. However, it’s essential to complement it with other validation techniques when working with more complex datasets or different clustering algorithms.
In this article, we covered the basic concepts of the K-Means Clustering algorithm in Machine Learning. We used the Elbow method to find the optimal K value for clustering the data in our sample data set. We then used the matplotlib Python library to visualize the clusters as a scatterplot graph. In the upcoming articles, we can learn more about different ML Algorithms.
Key Takeaways
A. The elbow method is a technique used in clustering analysis to determine the optimal number of clusters. It involves plotting the within-cluster sum of squares (WCSS) for different cluster numbers and identifying the “elbow” point where WCSS starts to level off.
A. The elbow method is a clustering validation technique. It helps find the optimal number of clusters by plotting WCSS for a range of cluster counts and selecting the point where the plot forms an “elbow” or a significant change in the rate of decrease.
A. The elbow method finds the best number of clusters by plotting the variance explained against the number of clusters. The elbow point, where the slope changes, indicates the optimal number.
A. It’s a valuable technique in machine learning for selecting the appropriate number of clusters in various applications.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.