Machine Learning techniques are broadly categorized into two main types: Supervised Machine Learning and Unsupervised Machine Learning. In Supervised Machine Learning, researchers provide labeled data, and the algorithm learns from this labeled training data to make predictions or decisions. Common examples of this approach include Classification and Regression. On the other hand, Unsupervised Machine Learning does not require labeled data; instead, it works with unlabelled data to uncover hidden patterns or structures within the dataset. This method is particularly useful when labeled data is scarce or difficult to obtain, as it allows us to analyze and categorize data without prior supervision. One common application of unsupervised learning is Clustering, where the algorithm processes the data and groups it into distinct “clusters” based on similarities or patterns. Overall, unsupervised machine learning is a powerful tool for discovering unknown insights and organizing data effectively.
This article was published as a part of the Data Science Blogathon
Clustering algorithms try to find natural clusters in data, and you can tune and modify various aspects of how these algorithms cluster data. Clustering is based on the principle that items within the same cluster must be similar to each other. The data groups related elements close to each other.
Diverse and different types of data are subdivided into smaller groups.
In the field of marketing, businesses can use clustering to identify various customer groups based on existing customer data. Based on that, customers can be provided with discounts, offers, promo codes etc.
Clustering can be used to understand and divide various property locations based on value and importance. Clustering algorithms can process through the data and identify various groups of property on the basis of probable price.
Libraries and Bookstores can use Clustering to better manage the book database. With proper book ordering, better operations can be implemented.
Often, we need to group together various research texts and documents according to similarity. And in such cases, we don’t have any labels. Manually labelling large amounts of data is also not possible. Using clustering, the algorithm can process the text and group it into different themes.
These are some of the interesting use cases of clustering.
K-Means clustering is an unsupervised machine learning algorithm that divides the given data into the given number of clusters. Here, the “K” represents the predefined number of clusters that you need to create.
It is a centroid based algorithm in which each cluster is associated with a centroid. The main idea is to reduce the distance between the data points and their respective cluster centroid.
The algorithm takes raw unlabelled data as input, divides the dataset into clusters, and repeats the process until it finds the best clusters.
K-Means is very easy and simple to implement. It is highly scalable, can be applied to both small and large datasets. There is, however, a problem with choosing the number of clusters or K. Also, with the increase in dimensions, stability decreases. But, overall K Means is a simple and robust algorithm that makes clustering very easy.
Mall Customer data is an interesting dataset that has hypothetical customer data. It puts you in the shoes of the owner of a supermarket. You have customer data, and on this basis of the data, you have to divide the customers into various groups.
The data includes the following features:
Let us proceed with the code.
#Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
The necessary libraries are imported.
#Reading the excel file
data=pd.read_excel("Mall_Customers.xlsx")
The data is read. I will share a link to the entire code and excel data at the end of the article.
The data has 200 entries, that is data from 200 customers.
data.head()
So let us have a look at the data.
data.corr()
The data seems to be interesting. Let us look at the data distribution.
Annual Income Distribution:
#Distribution of Annnual Income
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Annual Income (k$)'])
plt.title('Distribution of Annual Income (k$)', fontsize = 20)
plt.xlabel('Range of Annual Income (k$)')
plt.ylabel('Count')
Most of the annual income falls between 50K to 85K.
Age Distribution:
#Distribution of age
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')
There are customers of a wide variety of ages.
Spending Score Distribution:
#Distribution of spending score
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Spending Score (1-100)'])
plt.title('Distribution of Spending Score (1-100)', fontsize = 20)
plt.xlabel('Range of Spending Score (1-100)')
plt.ylabel('Count')
The maximum spending score is in the range of 40 to 60.
genders = data.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
plt.show()
More female customers than male.
I have made more visualizations. Do have a look at the GitHub link at the end to understand the data analysis and overall data exploration.
First, we work with two features only, annual income and spending score.
#We take just the Annual Income and Spending score
df1=data[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
X=df1[["Annual Income (k$)","Spending Score (1-100)"]]
#The input data
X.head()
#Scatterplot of the input data
plt.figure(figsize=(10,6))
sns.scatterplot(x = 'Annual Income (k$)',y = 'Spending Score (1-100)', data = X ,s = 60 )
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Spending Score (1-100) vs Annual Income (k$)')
plt.show()
The data does seem to hold some patterns.
#Importing KMeans from sklearn
from sklearn.cluster import KMeans
Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different values of k. Next, we choose the k for which WSS first starts to diminish. This value of K gives us the best number of clusters to make from the raw data.
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)
#The elbow curve
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss)
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")
plt.show()
The plot:
This is known as the elbow graph, the x-axis being the number of clusters, the number of clusters is taken at the elbow joint point. This point is the point where making clusters is most relevant as here the value of WCSS suddenly stops decreasing. Here in the graph, after 5 the drop is minimal, so we take 5 to be the number of clusters.
#Taking 5 clusters
km1=KMeans(n_clusters=5)
#Fitting the input data
km1.fit(X)
#predicting the labels of the input data
y=km1.predict(X)
#adding the labels to a column named label
df1["label"] = y
#The new dataframe with the clustering done
df1.head()
The labels added to the data.
#Scatterplot of the clusters
plt.figure(figsize=(10,6))
sns.scatterplot(x = 'Annual Income (k$)',y = 'Spending Score (1-100)',hue="label",
palette=['green','orange','brown','dodgerblue','red'], legend='full',data = df1 ,s = 60 )
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Spending Score (1-100) vs Annual Income (k$)')
plt.show()
We can clearly see that 5 different clusters have been formed from the data. The red cluster is the customers with the least income and least spending score, similarly, the blue cluster is the customers with the most income and most spending score.
Now, we shall be working on 3 types of data. Apart from the spending score and annual income of customers, we shall also take in the age of the customers.
#Taking the features
X2=df2[["Age","Annual Income (k$)","Spending Score (1-100)"]]
#Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different values of k.
wcss = []
for k in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means++")
kmeans.fit(X2)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")
plt.show()
The WCSS curve.
Here can assume that K=5 will be a good value.
#We choose the k for which WSS starts to diminish
km2 = KMeans(n_clusters=5)
y2 = km.fit_predict(X2)
df2["label"] = y2
#The data with labels
df2.head()
The data:
Now we plot it.
#3D Plot as we did the clustering on the basis of 3 input features
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df2.Age[df2.label == 0], df2["Annual Income (k$)"][df2.label == 0], df2["Spending Score (1-100)"][df2.label == 0], c='purple', s=60)
ax.scatter(df2.Age[df2.label == 1], df2["Annual Income (k$)"][df2.label == 1], df2["Spending Score (1-100)"][df2.label == 1], c='red', s=60)
ax.scatter(df2.Age[df2.label == 2], df2["Annual Income (k$)"][df2.label == 2], df2["Spending Score (1-100)"][df2.label == 2], c='blue', s=60)
ax.scatter(df2.Age[df2.label == 3], df2["Annual Income (k$)"][df2.label == 3], df2["Spending Score (1-100)"][df2.label == 3], c='green', s=60)
ax.scatter(df2.Age[df2.label == 4], df2["Annual Income (k$)"][df2.label == 4], df2["Spending Score (1-100)"][df2.label == 4], c='yellow', s=60)
ax.view_init(35, 185)
plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")
ax.set_zlabel('Spending Score (1-100)')
plt.show()
Output:
What we get is a 3D plot. Now, if we want to know the customer IDs, we can do that too.
cust1=df2[df2["label"]==1]
print('Number of customer in 1st group=', len(cust1))
print('They are -', cust1["CustomerID"].values)
print("--------------------------------------------")
cust2=df2[df2["label"]==2]
print('Number of customer in 2nd group=', len(cust2))
print('They are -', cust2["CustomerID"].values)
print("--------------------------------------------")
cust3=df2[df2["label"]==0]
print('Number of customer in 3rd group=', len(cust3))
print('They are -', cust3["CustomerID"].values)
print("--------------------------------------------")
cust4=df2[df2["label"]==3]
print('Number of customer in 4th group=', len(cust4))
print('They are -', cust4["CustomerID"].values)
print("--------------------------------------------")
cust5=df2[df2["label"]==4]
print('Number of customer in 5th group=', len(cust5))
print('They are -', cust5["CustomerID"].values)
print("--------------------------------------------")
The output we get:
Number of customer in 1st group= 24
They are - [129 131 135 137 139 141 145 147 149 151 153 155 157 159 161 163 165 167
169 171 173 175 177 179]
——————————————–
Number of the customer in 2nd group= 29
They are - [ 47 51 55 56 57 60 67 72 77 78 80 82 84 86 90 93 94 97
99 102 105 108 113 118 119 120 122 123 127]
——————————————–
Number of the customer in 3rd group= 28
They are - [124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158
160 162 164 166 168 170 172 174 176 178]
——————————————–
Number of the customer in 4th group= 22
They are - [ 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 46]
--------------------------------------------
Number of customer in 5th group= 12
They are - [ 3 7 9 11 13 15 23 25 31 33 35 37]
——————————————–
So, we used K-Means clustering to understand customer data. K-Means is a good clustering algorithm. Almost all the clusters have similar density. It is also fast and efficient in terms of computational cost.
Machine Learning techniques, whether supervised or unsupervised, play a crucial role in extracting insights from data. Supervised learning excels in predictive tasks, while unsupervised learning is invaluable for exploring and organizing unlabelled data. Clustering, particularly through algorithms like K-Means, is a powerful tool for uncovering hidden patterns and enabling data-driven decision-making.
By understanding these techniques and their applications, businesses and researchers can harness the full potential of machine learning to solve complex problems and drive innovation.
A. Supervised learning uses labeled data to train models for prediction, while unsupervised learning works with unlabelled data to discover hidden patterns.
A. Unsupervised learning is ideal when you have unlabelled data and want to explore its structure, such as grouping similar data points or reducing dimensionality.
A. Clustering is an unsupervised learning technique that groups data into clusters based on similarities. It’s used in customer segmentation, document analysis, and more.
A. K-Means requires the number of clusters (K) to be predefined, and it may perform poorly with high-dimensional or irregularly shaped data.
The author uses the media shown in this article at their discretion, and Analytics Vidhya does not own it.
Fabulous! Very clear and detailed.
Wonderful Post and clear explanation. My doubt is that basically the data has 200 customers but in the final output after grouping the sum of customers doesn't get to 200 . Any explanation?