Heart Disease Prediction using KNN -The K-Nearest Neighbours Algorithm

Siddharth M Last Updated : 10 Jul, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Introduction:

In Machine Learning, one of the main types of learning includes Supervised Learning. Where we already have the correct output and set of features associated with that output. We use some algorithms and try to train them with the existing data and then try to predict the output of new data with only features associated with them. This is like a teacher, where the teacher teaches students about something and tells them what is correct and then when they give exams they need to know what they have learnt and provide with correct answers. KNN is used for both classifications as well as regression tasks in Machine learning.

About KNN:

KNN tries to find similarities between predictors and values that are within the dataset.
KNN uses a non-parametric method as there is not a particular finding of parameters to a particular functional form.
It does not make any type of assumptions about the features and output of the dataset.
KNN is also called a lazy classifier as it memorizes the training data and not exactly learn and fix the weights. Hence most of the computing work occurs during the classification rather than training time.
KNN usually works by just trying to see to which class is the new feature near to and it just puts it to the class closest to that point.

Working of KNN Algorithm:

Initially, we select a value for K in our KNN algorithm.
Now we go for a distance measure. Let’s consider Eucleadean distance here. Find the euclidean distance of k neighbours.
Now we check all the neighbours to the new point we have given and see which is nearest to our point. We only check for k-nearest here.
Now we see to which class there is the highest number obtained. The max number is chosen and we assign our new point to that class.
In this way, we use the KNN algorithm.

The ideal value of K in KNN:

Here we usually go for an odd number of K as it’s better during voting to see to which numbered class has more votes given and thus we can assign our new class to that.

If we go for too small a value of k, there is a good chance we may have overfitting of data, that’s is the algorithm may perform reasonably well of training but not well on testing data. And, we also may encounter noise if we just use the small value of k, if we have large data.

One way to determine k is to see the error plot for k and run a loop to a set of values, the k associated with the lowest error can be used for our problem. I will be performing these steps during our implementation of Heart disease data.

Pros and Cons of KNN algorithm:

Pros:

We can implement the algorithm with ease.
It is very effective against noisy data by averaging k-nearest neighbours.
Works well in case of large data.
The decision boundaries that are formed can be of arbitrary shapes.

Cons:

Curse of dimensionality: Domination of distances by irrelevant attributes.
Finding the correct value of k may be time expensive sometimes.
Very high computation cost due to its distance measure.

Implementation of K-Nearest Neighbour on Heart disease dataset.

I have used the Heart disease UCI dataset for this task, which is available here:

1. Importing all Libraries:

import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.metrics import accuracy_score

We can see here we have imported KNeighorsClassifier for our classification task. We import this from sklearn library. Sklearn has almost all the machine learning classifiers defined and we can call them and use them for our problem.

2. Read the heart disease dataset:

df = pd.read_csv('heart.csv')
df.head()

As we can see target tells us if the person is suffering from heart disease or not.

sns.countplot(df['target'])

target countplot | heart disease prediction KNN

We will proceed with this as there isn’t much unbalance in target data.

3. Performing KNN by splitting to train and test set:

x= df.iloc[:,0:13].values 
y= df['target'].values
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)

This step is common for all ML tasks and here I have just split the dataset and scaled it for further processing.

4. Checking for the best value of k:

error = []
# Calculating error for K values between 1 and 30
for i in range(1, 30):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    pred_i = knn.predict(x_test)
    error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 30), error, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
print("Minimum error:-",min(error),"at K =",error.index(min(error))+1)

# Output => Minimum error:- 0.13157894736842105 at K = 7

5. Apply K-NN Algorithm:

classifier= KNeighborsClassifier(n_neighbors=7)  
classifier.fit(x_train, y_train)
y_pred= classifier.predict(x_test) 
from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test, y_pred)

# Output =>array([[26,  7],
                 [ 3, 40]], dtype=int64)

This way we can see our confusion matrix. Here I specified the k value as 7 as we got the lowest mean error at 7.

6. Accuracy:

accuracy_score(y_test, y_pred)

# Output => 0.868421052631579

We got 86% accuracy on 25% of the dataset and this is a good sign. We could improve them by performing more hyperparameter tuning.

References:

1. https://www.dataminingbook.com/book/python-edition

2. https://www.kaggle.com/ronitf/heart-disease-uci

3. Image: https://unsplash.com/photos/KgLtFCgfC28

Applications of KNN:

Now we know about KNN and how to implement them. Let’s see some scenarios where KNN is used.

1. Music Recommendation System: Probably any recommendation system. But, in the case of music systems, we have a large amount of music coming and there is a high chance that we are getting the same music with different versions being recommended, These could be analyzed using KNN. We could even use it to see which music is of the person’s liking.

2. Outlier Detection: KNN has the ability to identify outliers.

3. Similar documents can be identified using KNN Algorithm.

Conclusion

You can find the complete code here:

https://github.com/Siddharth1698/Machine-Learning-Codes/tree/main/knn_heart_disease

Feel free to connect with me on:

1. https://www.linkedin.com/in/siddharth-m-426a9614a/

2. https://github.com/Siddharth1698

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Siddharth M

Passionate about artificial intelligence, I am dedicated to advancing research in Generative AI and Large Language Models (LLMs). My work focuses on exploring innovative solutions and pushing the boundaries of what's possible in this dynamic and transformative field.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Heart Disease Prediction using KNN -The K-Nearest Neighbours Algorithm

Introduction:

About KNN:

Working of KNN Algorithm:

The ideal value of K in KNN:

Pros and Cons of KNN algorithm:

Implementation of K-Nearest Neighbour on Heart disease dataset.

1. Importing all Libraries:

2. Read the heart disease dataset:

3. Performing KNN by splitting to train and test set:

4. Checking for the best value of k:

5. Apply K-NN Algorithm:

6. Accuracy:

Applications of KNN:

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID