K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis

Pranshu Sharma Last Updated : 23 Dec, 2024

8 min read

K Means is one of the most popular Unsupervised Machine Learning Algorithms used for solving classification problems in data science, making it a crucial skill for those aspiring to excel in a data scientist role. K Means segregates unlabeled data into various groups, known as clusters, by identifying similar features and common patterns within the dataset. This tutorial aims to provide a comprehensive understanding of clustering, with a specific focus on the K Means clustering algorithm and its implementation in Python. By delving into the nuances of K means clustering in Python, you will gain valuable insights into how to effectively organize and analyze data. Additionally, the tutorial will guide you on determining the optimum number of clusters for a dataset, enhancing your ability to apply K means clustering in practical scenarios.

Learning Objectives

Understand what the K-means clustering algorithm is.
Develop a good understanding of the steps involved in implementing the K-Means algorithm and finding the optimal number of clusters.
Implement K means Clustering in Python with scikit-learn library.

This article was published as a part of the Data Science Blogathon.

What Is Clustering?
What Is K-Means Clustering Algorithm?
What is K-Means clustering method in Python?
How K Means Clustering in Python Works?
Diagrammatic Implementation of K-Means Clustering
Choosing the Optimal Number of Clusters
Python Code for K-Means Clustering
WCSS and Elbow Method
Conclusion
Frequently Asked Questions

What Is Clustering?

Suppose we have N number of unlabeled multivariate datasets of various animals like dogs, cats, birds, etc. The technique that segregates these datasets into various groups based on similar features and characteristics is called clustering.

The groups being formed are known as clusters. Clustering techniques find applications in various fields, such as image recognition and spam filtering. They also play a crucial role in unsupervised learning algorithms in machine learning by segregating multivariate data into different groups based on common patterns hidden within the datasets.

What Is K-Means Clustering Algorithm?

The k-means clustering algorithm is an Iterative algorithm that divides a group of n datasets into k different clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed.

K, here is the pre-defined number of clusters to be formed by the algorithm. If K=3, It means the number of clusters to be formed from the dataset is 3.

Implementation of the K-Means Algorithm

The implementation and working of the K-Means algorithm are explained in the steps below:

Step 1: Select the value of K to decide the number of clusters (n_clusters) to be formed.

Step 2: Select random K points that will act as cluster centroids (cluster_centers).

Step 3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid, which will form the predefined clusters.

Step 4: Place a new centroid of each cluster.

Step 5: Repeat step no.3, which reassigns each datapoint to the new closest centroid of each cluster.

Step 6: If any reassignment occurs, then go to step 4; else, go to step 7.

Step 7: Finish

What is K-Means clustering method in Python?

K-Means clustering is a method in Python for grouping a set of data points into distinct clusters. The goal is to partition the data in such a way that points in the same cluster are more similar to each other than to points in other clusters. Here’s a breakdown of how to use K Means clustering in Python:

Import Libraries:

First, you need to import the necessary libraries. In Python, the popular scikit-learn library provides an implementation of K-Means.

from sklearn.cluster import KMeans

Prepare Your Data:

Organize your data into a format that the algorithm can understand. In many cases, you’ll have a 2D array or a pandas DataFrame.

import numpy as np
data = np.array([[1, 2], [5, 8], [1.5, 1.8], [8, 8], [1, 0.6], [9, 11]])

Choose the Number of Clusters (K):

Decide on the number of clusters you want the algorithm to find. This is often based on your understanding of the data or through techniques like the elbow method.

kmeans = KMeans(n_clusters=2)

Fit the Model:

Train the K-Means model on your data.

kmeans.fit(data)

Get Results:

Once the model is trained, you can get information about the clusters.

# Get the cluster centers
centroids = kmeans.cluster_centers_

# Get the labels (cluster assignments for each data point)
labels = kmeans.labels_

In this example, n_clusters=2 indicates that we want the algorithm to find two clusters. The fit method trains the model, and then you can access information about the clusters, such as the cluster centers and labels. Visualizing the results can be helpful to see how well the algorithm grouped your data points.

How K Means Clustering in Python Works?

Here is Step-by-Step Explanation that How K-means Clustering in Python works:

Initialize Centroids:

Randomly choose K data points from the dataset to be the initial centroids. K is the number of clusters you want to create.

Assign Data Points to Nearest Centroid:

For each data point in the dataset, calculate the distance to each centroid.
Assign the data point to the cluster whose centroid is the closest (usually using Euclidean distance).

Update Centroids:

Recalculate the centroids of the clusters by taking the mean of all the data points assigned to each cluster.

Repeat:

Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids no longer change significantly or after a predefined number of iterations.

Final Result:

The algorithm converges, and it assigns each data point to one of the K clusters.

Here’s a simple example using Python with the popular machine learning library, scikit-learn:

from sklearn.cluster import KMeans
import numpy as np

# Sample data
data = np.array([[1, 2], [5, 8], [1.5, 1.8], [8, 8], [1, 0.6], [9, 11]])

# Specify the number of clusters (K)
kmeans = KMeans(n_clusters=2)

# Fit the data to the algorithm
kmeans.fit(data)

# Get the cluster centroids and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print("Centroids:")
print(centroids)
print("Labels:")
print(labels)

Diagrammatic Implementation of K-Means Clustering

1 Step: Let’s choose the number k of clusters, i.e., K=2, to segregate the dataset and put them into different respective clusters. We will choose some random 2 points which will act as centroids to form the cluster.

2 Step: Now, we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids.

3 Step: points on the left side of the line are near the blue centroid, and points to the right of the line are close to the yellow centroid. The left forms a cluster with the blue centroid, and the right one with the yellow centroid.

4 Step: Repeat the process by choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, as depicted below.

5 Step: Next, we will reassign each data point to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will join the blue cluster.

6 Step: As reassignment has occurred, we will repeat the above step of finding new k centroids.

7 Step: We will repeat the above process of finding the center of gravity of k centroids, as depicted below.

8 Step: After finding the new k centroids, we will again draw the median line and reassign the data points, like the above steps.

9 Step: We will finally segregate points based on the median line, forming two groups and excluding any dissimilar points from a single group.

The final cluster formed is like this:

Choosing the Optimal Number of Clusters

The number of clusters that we choose for the algorithm shouldn’t be random. Each cluster forms by calculating and comparing the mean distances of each data point within the cluster to its centroid.

We can choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.

The main idea is to minimize the distance (e.g., euclidean distance) between the data points and the centroid of the clusters. The process iterates until we reach a minimum value for the sum of distances.

Elbow Method

Here are the steps to follow in order to find the optimal number of clusters using the elbow method:

1 Step: Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).

2 Step: For each value of K, calculate the WCSS value.

3 Step: Plot a graph/curve between WCSS values and the respective number of clusters K.

4 Step: The sharp point of bend or a point (looking like an elbow joint) of the plot, like an arm, will be considered as the best/optimal value of K.

Python Implementation:

Importing relevant libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

Loading the data

data = pd.read_csv('Countryclusters.csv')
data

Plotting the data

Python Code for K-Means Clustering

mport pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('Countryclusters.csv')

print(data.head())

plt.scatter(data['Longitude'],data['Latitude'])
plt.xlim(-180,180)
plt.ylim(-90,90)
plt.show()

Selecting the feature

 x = data.iloc[:,1:3] # 1t for rows and second for columns
x

Clustering

kmeans = KMeans(3)
means.fit(x)

Clustering results

identified_clusters = kmeans.fit_predict(x)
identified_clusters

array([1, 1, 0, 0, 0, 2])

data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters 
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')

WCSS and Elbow Method

wcss=[]
for i in range(1,7):
kmeans = KMeans(i)
kmeans.fit(x)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)

number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

This method shows that 3 is a good number of clusters.

Conclusion

To summarize everything stated so far, k-means clustering in Python serves as a widely used unsupervised machine learning technique that groups data into clusters based on similarity. This simple algorithm applies to various domains and data types, including image and text data. You can use k-means for a variety of purposes. We can use it to perform dimensionality reduction also, where each transformed feature is the distance of the point from a cluster center.

Key Takeaways

K-means serves as a widely used unsupervised machine learning algorithm that clusters data into groups, also known as clusters, of similar objects.
The objective is to minimize the sum of squared distances between the objects and their respective cluster centroids.
K-means clustering has limitations because it cannot handle complex and non-linear data.

Frequently Asked Questions

Q1. What is meant by n_init in k-means clustering?

A. The parameter n_init is an integer that represents the number of times the k-means algorithm will run independently or the number of iterations.

Q2. What are the advantages and disadvantages of K-means Clustering?

A. K-means clustering offers advantages such as simplicity, scalability, and versatility, allowing you to apply it to a wide range of data types. Disadvantages include its sensitivity to the initial placement of centroids and its limitations in handling complex, non-linear data. k-means is also sensitive to outliers.

Q3. What is meant by random_state in k-means clustering?

A. In K-Means, random_state represents random number generation for centroid initialization. We can use an Integer value to make the randomness fixed or constant. Also, it helps when we want to produce the same clusters every time.

Analytics Vidhya does not own the media shown in this article, and the author uses it at their discretion.

Pranshu Sharma

Aspiring Data Scientist | M.TECH, CSE at NIT DURGAPUR

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Norman

In part 2 - "What is K Means Algorithm" - you forgot a very important step, which is to determine the new cluster center by computing the average of the assigned points

imran hussain

hi it is a very good article. Please can you also write an article with coding how to choose cluster head based on battery power in each cluster?

Hosein

thanks for your helpful article I just wanted to say there are some parts that needs to be corrected : first one : 5. Python Implementation >> Clustering >> means -> kmeans second one : in section "Trying different method ( to find no .of clusters to be selected) WCSS and Elbow Method" after "for" loop indent is forgotten

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis

Table of contents

What Is Clustering?

What Is K-Means Clustering Algorithm?

Implementation of the K-Means Algorithm

What is K-Means clustering method in Python?

How K Means Clustering in Python Works?

Diagrammatic Implementation of K-Means Clustering

Choosing the Optimal Number of Clusters

Elbow Method

Python Code for K-Means Clustering

WCSS and Elbow Method

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID