K-Means clustering with Mall Customer Segmentation Data

Prateek Majumder Last Updated : 19 Jan, 2025
9 min read

Machine Learning techniques are broadly categorized into two main types: Supervised Machine Learning and Unsupervised Machine Learning. In Supervised Machine Learning, researchers provide labeled data, and the algorithm learns from this labeled training data to make predictions or decisions. Common examples of this approach include Classification and Regression. On the other hand, Unsupervised Machine Learning does not require labeled data; instead, it works with unlabelled data to uncover hidden patterns or structures within the dataset. This method is particularly useful when labeled data is scarce or difficult to obtain, as it allows us to analyze and categorize data without prior supervision. One common application of unsupervised learning is Clustering, where the algorithm processes the data and groups it into distinct “clusters” based on similarities or patterns. Overall, unsupervised machine learning is a powerful tool for discovering unknown insights and organizing data effectively.

This article was published as a part of the Data Science Blogathon

The Clustering Explained

Clustering algorithms try to find natural clusters in data, and you can tune and modify various aspects of how these algorithms cluster data. Clustering is based on the principle that items within the same cluster must be similar to each other. The data groups related elements close to each other.

Unsupervised clustering | K-Means

Diverse and different types of data are subdivided into smaller groups.

Uses of Clustering

Marketing

In the field of marketing, businesses can use clustering to identify various customer groups based on existing customer data. Based on that, customers can be provided with discounts, offers, promo codes etc.

Real Estate

Clustering can be used to understand and divide various property locations based on value and importance. Clustering algorithms can process through the data and identify various groups of property on the basis of probable price.

BookStore and Library management

Libraries and Bookstores can use Clustering to better manage the book database. With proper book ordering, better operations can be implemented.

Document Analysis

Often, we need to group together various research texts and documents according to similarity. And in such cases, we don’t have any labels. Manually labelling large amounts of data is also not possible. Using clustering, the algorithm can process the text and group it into different themes.

These are some of the interesting use cases of clustering.

K-Means Clustering

K-Means clustering is an unsupervised machine learning algorithm that divides the given data into the given number of clusters. Here, the “K” represents the predefined number of clusters that you need to create.

It is a centroid based algorithm in which each cluster is associated with a centroid. The main idea is to reduce the distance between the data points and their respective cluster centroid.

The algorithm takes raw unlabelled data as input, divides the dataset into clusters, and repeats the process until it finds the best clusters.

K-Means is very easy and simple to implement. It is highly scalable, can be applied to both small and large datasets. There is, however, a problem with choosing the number of clusters or K. Also, with the increase in dimensions, stability decreases. But, overall K Means is a simple and robust algorithm that makes clustering very easy.

Mall Customer Data: Implementation of K-Means in Python

Kaggle Link

Mall Customer data is an interesting dataset that has hypothetical customer data. It puts you in the shoes of the owner of a supermarket. You have customer data, and on this basis of the data, you have to divide the customers into various groups.

K-Means | mall segmentation data
(Image Source: https://www.newindianexpress.com/business/2019/nov/24/virtual-shopping-mall-from-2020-2066176.html)

The data includes the following features:

  1. Customer ID
  2. Customer Gender
  3. Customer Age
  4. Annual Income of the customer (in Thousand Dollars)
  5. Spending score of the customer (based on customer behaviour and spending nature)

Let us proceed with the code.

#Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

The necessary libraries are imported.

#Reading the excel file
data=pd.read_excel("Mall_Customers.xlsx")

The data is read. I will share a link to the entire code and excel data at the end of the article.

The data has 200 entries, that is data from 200 customers.

data.head()

So let us have a look at the data.

dataset
K-Means | corelation matrix
Annual income distribution
K-Means | Gender Analysis
Input data
labeling the customers | K-Means
elbow curve | K-Means
scatterr plot output k=5
data.corr()

The data seems to be interesting. Let us look at the data distribution.

Annual Income Distribution:

#Distribution of Annnual Income
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Annual Income (k$)'])
plt.title('Distribution of Annual Income (k$)', fontsize = 20)
plt.xlabel('Range of Annual Income (k$)')
plt.ylabel('Count')

Most of the annual income falls between 50K to 85K.

Age Distribution:

#Distribution of age
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')
Age distribuion | K-Means

There are customers of a wide variety of ages.

Spending Score Distribution:

#Distribution of spending score
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Spending Score (1-100)'])
plt.title('Distribution of Spending Score (1-100)', fontsize = 20)
plt.xlabel('Range of Spending Score (1-100)')
plt.ylabel('Count')

The maximum spending score is in the range of 40 to 60.

Gender Analysis:

genders = data.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
plt.show()

More female customers than male.

I have made more visualizations. Do have a look at the GitHub link at the end to understand the data analysis and overall data exploration.

Clustering based on 2 features

First, we work with two features only, annual income and spending score.

#We take just the Annual Income and Spending score
df1=data[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
X=df1[["Annual Income (k$)","Spending Score (1-100)"]]
#The input data
X.head()
#Scatterplot of the input data
plt.figure(figsize=(10,6))
sns.scatterplot(x = 'Annual Income (k$)',y = 'Spending Score (1-100)',  data = X  ,s = 60 )
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)') 
plt.title('Spending Score (1-100) vs Annual Income (k$)')
plt.show()

The data does seem to hold some patterns.

K-Means| scatter plot
#Importing KMeans from sklearn
from sklearn.cluster import KMeans

Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different values of k. Next, we choose the k for which WSS first starts to diminish. This value of K gives us the best number of clusters to make from the raw data.

wcss=[]
for i in range(1,11):
    km=KMeans(n_clusters=i)
    km.fit(X)
    wcss.append(km.inertia_)
#The elbow curve
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss)
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")
plt.show()

The plot:

optimal k value

This is known as the elbow graph, the x-axis being the number of clusters, the number of clusters is taken at the elbow joint point. This point is the point where making clusters is most relevant as here the value of WCSS suddenly stops decreasing. Here in the graph, after 5 the drop is minimal, so we take 5 to be the number of clusters.

#Taking 5 clusters
km1=KMeans(n_clusters=5)
#Fitting the input data
km1.fit(X)
#predicting the labels of the input data
y=km1.predict(X)
#adding the labels to a column named label
df1["label"] = y
#The new dataframe with the clustering done
df1.head()

The labels added to the data.

#Scatterplot of the clusters
plt.figure(figsize=(10,6))
sns.scatterplot(x = 'Annual Income (k$)',y = 'Spending Score (1-100)',hue="label",  
                 palette=['green','orange','brown','dodgerblue','red'], legend='full',data = df1  ,s = 60 )
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)') 
plt.title('Spending Score (1-100) vs Annual Income (k$)')
plt.show()
output scatter plot

We can clearly see that 5 different clusters have been formed from the data. The red cluster is the customers with the least income and least spending score, similarly, the blue cluster is the customers with the most income and most spending score.

k-Means Clustering on the basis of 3D data

Now, we shall be working on 3 types of data. Apart from the spending score and annual income of customers, we shall also take in the age of the customers.

#Taking the features
X2=df2[["Age","Annual Income (k$)","Spending Score (1-100)"]]
#Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different values of k.
wcss = []
for k in range(1,11):
    kmeans = KMeans(n_clusters=k, init="k-means++")
    kmeans.fit(X2)
    wcss.append(kmeans.inertia_)
plt.figure(figsize=(12,6))    
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")
plt.show()

The WCSS curve.

Here can assume that K=5 will be a good value.

#We choose the k for which WSS starts to diminish
km2 = KMeans(n_clusters=5)
y2 = km.fit_predict(X2)
df2["label"] = y2
#The data with labels
df2.head()

The data:

labeling data with k= 5

Now we plot it.

#3D Plot as we did the clustering on the basis of 3 input features
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df2.Age[df2.label == 0], df2["Annual Income (k$)"][df2.label == 0], df2["Spending Score (1-100)"][df2.label == 0], c='purple', s=60)
ax.scatter(df2.Age[df2.label == 1], df2["Annual Income (k$)"][df2.label == 1], df2["Spending Score (1-100)"][df2.label == 1], c='red', s=60)
ax.scatter(df2.Age[df2.label == 2], df2["Annual Income (k$)"][df2.label == 2], df2["Spending Score (1-100)"][df2.label == 2], c='blue', s=60)
ax.scatter(df2.Age[df2.label == 3], df2["Annual Income (k$)"][df2.label == 3], df2["Spending Score (1-100)"][df2.label == 3], c='green', s=60)
ax.scatter(df2.Age[df2.label == 4], df2["Annual Income (k$)"][df2.label == 4], df2["Spending Score (1-100)"][df2.label == 4], c='yellow', s=60)
ax.view_init(35, 185)
plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")
ax.set_zlabel('Spending Score (1-100)')
plt.show()

Output:

What we get is a 3D plot. Now, if we want to know the customer IDs, we can do that too.

cust1=df2[df2["label"]==1]
print('Number of customer in 1st group=', len(cust1))
print('They are -', cust1["CustomerID"].values)
print("--------------------------------------------")
cust2=df2[df2["label"]==2]
print('Number of customer in 2nd group=', len(cust2))
print('They are -', cust2["CustomerID"].values)
print("--------------------------------------------")
cust3=df2[df2["label"]==0]
print('Number of customer in 3rd group=', len(cust3))
print('They are -', cust3["CustomerID"].values)
print("--------------------------------------------")
cust4=df2[df2["label"]==3]
print('Number of customer in 4th group=', len(cust4))
print('They are -', cust4["CustomerID"].values)
print("--------------------------------------------")
cust5=df2[df2["label"]==4]
print('Number of customer in 5th group=', len(cust5))
print('They are -', cust5["CustomerID"].values)
print("--------------------------------------------")

The output we get:

Number of customer in 1st group= 24
They are - [129 131 135 137 139 141 145 147 149 151 153 155 157 159 161 163 165 167
169 171 173 175 177 179]

——————————————–

Number of the customer in 2nd group= 29
They are - [ 47 51 55 56 57 60 67 72 77 78 80 82 84 86 90 93 94 97
99 102 105 108 113 118 119 120 122 123 127]

——————————————–

Number of the customer in 3rd group= 28

They are - [124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158

160 162 164 166 168 170 172 174 176 178]

——————————————–

Number of the customer in 4th group= 22
They are - [ 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 46]
--------------------------------------------
Number of customer in 5th group= 12
They are - [ 3 7 9 11 13 15 23 25 31 33 35 37]

——————————————–

So, we used K-Means clustering to understand customer data. K-Means is a good clustering algorithm. Almost all the clusters have similar density. It is also fast and efficient in terms of computational cost.

Code on Github

Conclusion

Machine Learning techniques, whether supervised or unsupervised, play a crucial role in extracting insights from data. Supervised learning excels in predictive tasks, while unsupervised learning is invaluable for exploring and organizing unlabelled data. Clustering, particularly through algorithms like K-Means, is a powerful tool for uncovering hidden patterns and enabling data-driven decision-making.

By understanding these techniques and their applications, businesses and researchers can harness the full potential of machine learning to solve complex problems and drive innovation.

Frequently Asked Questions

Q1. What is the main difference between supervised and unsupervised learning?

A. Supervised learning uses labeled data to train models for prediction, while unsupervised learning works with unlabelled data to discover hidden patterns.

Q2. When should I use unsupervised learning?

A. Unsupervised learning is ideal when you have unlabelled data and want to explore its structure, such as grouping similar data points or reducing dimensionality.

Q3. What is clustering, and how is it used?

A. Clustering is an unsupervised learning technique that groups data into clusters based on similarities. It’s used in customer segmentation, document analysis, and more.

Q4. What are the limitations of K-Means clustering?

A. K-Means requires the number of clusters (K) to be predefined, and it may perform poorly with high-dimensional or irregularly shaped data.

The author uses the media shown in this article at their discretion, and Analytics Vidhya does not own it.

Prateek is a dynamic professional with a strong foundation in Artificial Intelligence and Data Science, currently pursuing his PGP at Jio Institute. He holds a Bachelor's degree in Electrical Engineering and has hands-on experience as a System Engineer at TCS Digital, where he excelled in API management and data integration. Prateek also has a background in product marketing and analytics from his time with start-ups like AppleX and Milkie Way, Inc., where he was involved in growth campaigns and technical blog management. Recognized for his structured thinking and problem-solving abilities, he has received accolades like the Dr. Sudarshan Chakraborty Award for Best Student Performance. Fluent in multiple languages and passionate about technology, Prateek continues to expand his expertise in the rapidly evolving AI and tech landscape.

Responses From Readers

Clear

G_Nidhoggr
G_Nidhoggr

Fabulous! Very clear and detailed.

Rudhresh
Rudhresh

Wonderful Post and clear explanation. My doubt is that basically the data has 200 customers but in the final output after grouping the sum of customers doesn't get to 200 . Any explanation?

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details