With the rising use of the Internet in today’s society, the quantity of data created is incomprehensibly huge. Even though the nature of individual data is straightforward, the sheer amount of data to be analyzed makes processing difficult for even computers.
To manage such procedures, we need large data analysis tools. Data mining methods and techniques, in conjunction with machine learning algorithms, enable us to analyze large data sets in an intelligible manner. k-means is a technique for data clustering that may be used for unsupervised machine learning. It is capable of classifying unlabeled data into a predetermined optimal number of clusters k.
This article was published as a part of the Data Science Blogathon.
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It is presumptively known how many clusters there are. It is also known as the flat clustering algorithm. The optimal number of clusters found from data by the method is denoted by the letter ‘K’ in K-means.
In this method, data points are assigned to initial clusters in such a way that the sum of the squared distances between the data points and the centroid is as small as possible. It is essential to note that reduced diversity within clusters leads to more identical data points within the same cluster.
The following stages will help us understand how the K-Means clustering technique works-
K-means implements the Expectation-Maximization strategy to solve the problem. The Expectation-step is used to assign data points to the nearest cluster, and the Maximization-step is used to compute the centroid of each cluster.
The final Cluster is as follows:
To further understand K-Means clustering, let’s look at two real-world situations.
Example 1
This is a simple example of how k-means works. In this example, we will first construct a 2D dataset with four distinct blobs and then use the k-means algorithm to observe the results.
To begin, we will import the essential packages.
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans
The code below will build a 2D dataset with four blobs.
from sklearn.datasets.samples_generator import make_blobs X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)
Following that, the code below will assist us in visualizing the dataset.
plt.scatter(X[:, 0], X[:, 1], s=20); plt.show()
Next, create a K – means object while specifying the number of clusters, train the model, and estimate as follows-
kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X)
Now, using the code below, we can plot and visualize the cluster’s centers as determined by the k-means Python estimator-
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer') centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9); plt.show()
Example 2
Consider another example in which we will use K-means clustering on a simple digit’s dataset. Without relying on the original label information, K-means will try to identify numbers that are similar.
To begin, we will import the essential packages- %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans
Load the digit dataset from sklearn, ensuring to handle outliers appropriately, and create an object out of it. Additionally, we can get the total number of rows and the total number of columns in this dataset by doing the following:
from sklearn.datasets import load_digits digits = load_digits() digits.data.shape
Output
(1797, 64)
According to the result, this dataset has 1797 samples with 64 features.
We may cluster the data in the same way that we did in Example 1 above.
kmeans = KMeans(n_clusters=10, random_state=0) clusters = kmeans.fit_predict(digits.data) kmeans.cluster_centers_.shape
Output
(10, 64)
The output above indicates that K-means generated 10 clusters with 64 features.
fig, ax = plt.subplots(2, 5, figsize=(8, 3)) centers = kmeans.cluster_centers_.reshape(10, 8, 8) for axi, center in zip(ax.flat, centers): axi.set(xticks=[], yticks=[]) axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
Output
As a result, we will receive the picture below, which shows clusters centers learned by k-means.
The code below will match the learned cluster labels with the actual labels found in them.
from scipy.stats import mode labels = np.zeros_like(clusters) for i in range(10): mask = (clusters == i) labels[mask] = mode(digits.target[mask])[0] Following that, we can check the accuracy as follows: from sklearn.metrics import accuracy_score accuracy_score(digits.target, labels)
Output
0.7935447968836951
The above output indicates that the accuracy is roughly 80%.
The performance of K-means clustering is sufficient to achieve the given goals. When it comes to the following scenarios, it is useful:
The below are some of the features of K-Means clustering algorithms:
Some of the drawbacks of K-Means clustering techniques are as follows:
Every machine learning engineer want their algorithm work to make accurate predictions. These sorts of algorithms are often classified as supervised learning or unsupervised learning. K-means clustering is an unsupervised learning algorithm that requires no labeled response for the given input data.
K-means clustering is a widely used approach for clustering. Generally, practitioners begin by learning about the architecture of the dataset. K-means clusters data points into unique, non-overlapping groupings. It works very well when the clusters have a spherical form. However, it suffers from the fact that clusters’ geometric forms depart from spherical shapes.
Additionally, it does not learn the number of clusters from the data and needs that it be stated beforehand. It’s always beneficial to understand the assumptions behind algorithms/methods in order to have a better understanding of each technique’s strengths and drawbacks. This will assist you in determining when and under what conditions to utilize each form.
A. The K-means clustering algorithm is a popular unsupervised machine learning technique used for cluster analysis. It aims to partition a dataset into K distinct clusters, where each data point belongs to the cluster with the nearest mean.
A. The K-means clustering algorithm, implemented in libraries like pandas and scikit-learn, partitions a dataset into K distinct clusters by iteratively assigning data points to the nearest cluster centroid and updating the centroids until convergence. K-means models find wide application across various domains for tasks such as customer segmentation, image compression, and dimensionality reduction through principal component analysis.
A. Boosting in machine learning is a meta-algorithm that aims to improve the performance of weak learners by combining them into a strong learner. It works iteratively, sequentially training models to correct the errors of the previous ones. Key concepts include AdaBoost, Gradient Boosting, and XGBoost. Boosting tutorials often cover techniques to append weak models, initialize weights, optimize learning rates, and set the maximum number of iterations.
A. K Means Clustering in Python is a popular unsupervised machine learning algorithm used for cluster analysis. It partitions a dataset into K distinct clusters based on similarities between data points. Tutorials on K Means in Python typically cover initialization of centroids, optimization of the algorithm, setting labels, and plotting graphs with xlabel and ylabel. Finding the optimal value of K and defining a maximum number of iterations are important considerations in K Means clustering.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
I found your blog so interesting and educative. Please sir, I will need your assistance in completing my project. The topic says " A HEURISTIC ANALYSIS OF TOMATOES DEGRADATION USING AN UNSUPERVISED IMAGE CLASSIFICATION" I am currently stuck at chapter three. Your help will be greatly appreciated.