Guide to K-Nearest Neighbors Algorithm in Machine Learning

Tavish Srivastava Last Updated : 25 Feb, 2025

8 min read

In the four years of my data science career, I have built more than 80% of classification models and just 15-20% of regression models. These ratios can be more or less generalized throughout the industry. The reason behind this bias towards classification models is that most analytical problems involve making decisions. In this article, we will talk about one such widely used machine learning classification technique called the k-nearest neighbors (KNN) algorithm. Our focus will primarily be on how the algorithm works on new data and how the input parameter affects the output/prediction.

Note: People who prefer to learn through videos can learn the same through our free course – K-Nearest Neighbors (KNN) Algorithm in Python and R. And if you are a complete beginner to Data Science and Machine Learning, check out our Certified BlackBelt program – Certified AI & ML Blackbelt+ Program

Learning Objectives

Understand the working of KNN and how it operates in python and R.
Get to know how to choose the right value of k for KNN
Understand the difference between training error rate and validation error rate

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?
When Do We Use the KNN Algorithm?
How Does the KNN Algorithm Work?
How Do We Choose the Factor K?
Breaking It Down – Pseudo Code of KNN
Implementation in Python From Scratch
Comparing Our Model With Scikit-learn
Implementation of KNN in R
Comparing Our KNN Predictor Function With “Class” Library
Conclusion

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning method that makes predictions based on how close a data point is to others. It’s widely used for both classification and regression tasks because of its simplicity and popularity.

Next, the algorithm identifies the K nearest neighbors to the input data point based on their distances. In the case of classification, the algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input data point.

The KNN algorithm is straightforward and easy to understand, making it a popular choice in various domains. However, its performance can be affected by the choice of K and the distance metric, so careful parameter tuning is necessary for optimal results.

When Do We Use the KNN Algorithm?

KNN Algorithm can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique, we generally look at 3 important aspects:

1. Ease of interpreting output

2. Calculation time

3. Predictive Power

Let us take a few examples to place KNN in the scale :

KNN classifier fairs across all parameters of consideration. It is commonly used for its ease of interpretation and low calculation time.

How Does the KNN Algorithm Work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS):

You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The “K” in KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three data points on the plane. Refer to the following diagram for more details:

The three closest points to BS are all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next, we will understand the factors to be considered to conclude the best K.

How Do We Choose the Factor K?

First, let us try to understand the influence of the K-nearest neighbors (KNN) in the algorithm. If we consider the last example, keeping all 6 training observations constant, a given K value allows us to establish boundaries for each class. These decision boundaries effectively segregate, for instance, RC from GS. Similarly, let’s examine the impact of the value “K” on these class boundaries. The following illustrates the distinct boundaries that separate the two classes, each corresponding to different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority. The training error rate and the validation error rate are two parameters we need to access different K-value. Following is the curve for the training error rate with a varying value of K :

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself.Hence the prediction is always accurate with K=1. If validation error curve would have been similar, our choice of K would have been 1. Following is the validation error curve with varying value of K:

This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.

The above content can be understood more intuitively using our free course – K-Nearest Neighbors (KNN) Algorithm in Python and R

Breaking It Down – Pseudo Code of KNN

We can implement a KNN model by following the below steps:

Load the data
Initialise the value of k
For getting the predicted class, iterate from 1 to total number of training data points
- Calculate the distance between test data and each row of training dataset. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other distance functions or metrics that you can use include Manhattan distance, Minkowski distance, Chebyshev, cosine, etc. If you have categorical variables, you can use Hamming distance.
- Sort the calculated distances in ascending order based on distance values
- Get top k rows from the sorted array
- Get the most frequent class of these rows
- Return the predicted class

Implementation in Python From Scratch

We will be using the popular Iris dataset for building our KNN model. You can download it from here.

import pandas as pd
data = pd.read_csv("iris.csv")
data.head()

Comparing Our Model With Scikit-learn

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Name'])

# Predicted class
print(neigh.predict(test))

-> ['Iris-virginica']

# 3 nearest neighbors
print(neigh.kneighbors(test)[1])
-> [[141 139 120]]

We can see that both the models predicted the same class (‘Iris-virginica’) and the same nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as expected.

Implementation of KNN in R

1 Step : Importing the data
2 Step : Checking the data and calculating the data summary

data<-read.table(file.choose(), header = T, sep = ",", dec = ".")#Importing the data 
head(data)  #Top observations present in the data
dim(data)   #Check the dimensions of the data
summary(data) #Summarise the data

Output

#Top observations present in the data
SepalLength SepalWidth PetalLength PetalWidth Name
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa

#Check the dimensions of the data
[1] 150 5

#Summarise the data
SepalLength SepalWidth PetalLength PetalWidth Name
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 Iris-setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Iris-versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 Iris-virginica :50
Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

3 Step : Splitting the Data

#Splitting the data set into train and test
set.seed(2)

part <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))

train<- data[part == 1,]

test<- data[part == 2,]

4 Step : Calculating the Euclidean Distance

#Calculating the euclidean distance

ED<-function(data1,data2){
distance=0
  for (i in (1:(length(data1)-1))){
    distance=distance+(data1[i]-data2[i])^2
  }
  return(sqrt(distance))
}

5 Step : Writing the function to predict kNN
6 Step : Calculating the label(Name) for K=1

#Writing the function to predict kNN
knn_predict <- function(test, train, k_value){
  pred <- c()  
  #LOOP-1
  for(i in c(1:nrow(test))){   
    dist = c()          
    char = c()
    setosa =0              
    versicolor = 0
    virginica = 0
  }
    
    #LOOP-2-looping over train data 
    for(j in c(1:nrow(train))){}
      
      dist <- c(dist, ED(test[i,], train[j,]))
      char <- c(char, as.character(train[j,][[5]]))
    
  
    df <- data.frame(char, dist$SepalLength) 
    df <- df[order(df$dist.SepalLength),]       #sorting dataframe
    df <- df[1:k_value,]               
    
    
    #Loop 3: loops over df and counts classes of neibhors.
    for(k in c(1:nrow(df))){
      if(as.character(df[k,"char"]) == "setosa"){
        setosa = setosa + 1
      }else if(as.character(df[k,"char"]) == "versicolor"){
        versicolor = versicolor + 1
      }else
        virginica = virginica + 1
    }
    
    
    n<-table(df$char)
    pred=names(n)[which(n==max(n))]
    
  return(pred) #return prediction vector
}

#Predicting the value for K=1
K=1
predictions <- knn_predict(test, train, K)

Output

For K=1
[1] "Iris-virginica"

In the same way, you can compute for other values of K.

Comparing Our KNN Predictor Function With “Class” Library

install.packages("class")
library(class)

#Normalization
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) }
norm <- as.data.frame(lapply(data[,1:4], normalize))

set.seed(123)
data_spl <- sample(1:nrow(norm),size=nrow(norm)*0.7,replace = FALSE) 

train2 <- data[data_spl,] # 70% training data
test2 <- data[-data_spl,] # remaining 30% test data

train_labels <- data[data_spl,5]
test_labels <-data[-data_spl,5]
knn_pred <- knn(train=train2, test=test2, cl=train_labels, k=1)

Output

For K=1
[1] "Iris-virginica"

We can see that both models predicted the same class (‘Iris-virginica’).

Conclusion

The KNN algorithm is one of the simplest classification algorithms. Even with such simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from k-nearest neighbors. KNN can be coded in a single line on R. I am yet to explore how we can use the KNN algorithm on SAS.

Key Takeaways

KNN classifier operates by finding the k nearest neighbors to a given data point, and it takes the majority vote to classify the data point.
The value of k is crucial, and one needs to choose it wisely to prevent overfitting or underfitting the model.
One can use cross-validation to select the optimal value of k for the k-NN algorithm, which helps improve its performance and prevent overfitting or underfitting. Cross-validation is also used to identify the outliers before applying the KNN algorithm.
The above article provides implementations of KNN in Python and R, and it compares the result with scikit-learn and the “Class” library in R.

Q1. What is K nearest neighbors algorithm?

A. KNN classifier is a machine learning algorithm used for classification and regression problems. It works by finding the K nearest points in the training dataset and uses their class to predict the class or value of a new data point. It can handle complex data and is also easy to implement, which is why KNN has become a popular tool in the field of artificial intelligence.

Q2. What is KNN algorithm used for?

A. KNN algorithm is most commonly used for:
1. Disease prediction – Predicting the likelihood of diseases based on symptoms.
2. Handwriting recognition – Recognizing handwritten characters.
3. Image classification – Categorizing and recognizing images.

Q3. What is the difference between KNN and Artificial Neural Networks?

A. K-nearest neighbors (KNN) mainly handle classification and regression problems, while Artificial Neural Networks (ANN) tackle complex function approximation and pattern recognition problems. Moreover, ANN has a higher computational cost than KNN.

Tavish Srivastava

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Harshal

Useful article. Can you share similar article for randomforest ? What are limitations with data size for accuracy?

Show 1 reply

Harshal, We have already published many articles on random forest. Here is the link of the article on random forest on similar lines http://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/. You can also subscribe to analyticsvidhya to get access to weekly updates on such articles.

saurabh

Good one please share the value of Red circle and green square

Saurabh, The first graph is for illustrating purposes. You can create a random dataset to check the algorithm.

DEBASHIS ROUT

I am currently doing part time MS in BI & Data Mining. I found this article is really helpful to understand in more detail and expecting to utilize in my upcoming project work. I need to know do you have any article on importance of Data quality in BI , Classification & Decision Tree.

Tavish

Debashish, We have published many articles on CART models before. Here is a link which will give you a kick start http://www.analyticsvidhya.com/blog/2014/06/comparing-random-forest-simple-cart-model/.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Guide to K-Nearest Neighbors Algorithm in Machine Learning

Table of contents

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?

When Do We Use the KNN Algorithm?

How Does the KNN Algorithm Work?

How Do We Choose the Factor K?

Breaking It Down – Pseudo Code of KNN

Implementation in Python From Scratch

Comparing Our Model With Scikit-learn

Implementation of KNN in R

Comparing Our KNN Predictor Function With “Class” Library

Conclusion

Key Takeaways

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)