Get Started With Naive Bayes Algorithm: Theory & Implementation

Surabhi Last Updated : 15 Oct, 2024

7 min read

Introduction

Naive Bayes is a machine learning algorithm that is used by data scientists for classification. The naive Bayes algorithm works based on the Bayes theorem. Before explaining Naive Bayes, first, we should discuss Bayes Theorem. Bayes theorem is used to find the probability of a hypothesis with given evidence. This beginner-level article intends to introduce you to the Naive Bayes algorithm and explain its underlying concept and implementation.

conditional probability | naive bayes algorithm

In this equation, using Bayes theorem, we can find the probability of A, given that B occurred. A is the hypothesis, and B is the evidence.

P(B|A) is the probability of B given that A is True.

P(A) and P(B) are the independent probabilities of A and B.

Learning Objectives

Learn the concept behind the Naive Bayes algorithm.
See the steps involved in the naive Bayes algorithm
Practice the step-by-step implementation of the algorithm.

This article was published as a part of the Data Science Blogathon.

Introduction
What Is the Naive Bayes Classifier Algorithm?
Naive Bayes Theorem: The Concept Behind the Algorithm
Steps Involved in the Naive Bayes Classifier Algorithm
Implementation of Naive Bayes in Python Programming
What Are the Assumptions Made by the Naive Bayes Algorithm?
Conclusion
Frequently Asked Questions

What Is the Naive Bayes Classifier Algorithm?

The Naive Bayes classifier algorithm is a machine learning technique used for classification tasks. It is based on Bayes’ theorem and assumes that features are conditionally independent of each other given the class label. The algorithm calculates the probability of a data point belonging to each class and assigns it to the class with the highest probability.

Naive Bayes is known for its simplicity, efficiency, and effectiveness in handling high-dimensional data. It is commonly used in various applications, including text classification, spam detection, and sentiment analysis.

Naive Bayes Theorem: The Concept Behind the Algorithm

Let’s understand the concept of the Naive Bayes Theorem and how it works through an example. We are taking a case study in which we have the dataset of employees in a company, our aim is to create a model to find whether a person is going to the office by driving or walking using the salary and age of the person.

In the above image, we can see 30 data points in which red points belong to those who are walking and green belong to those who are driving. Now let’s add a new data point to it. Our aim is to find the category that the new point belongs to

Note that we are taking age on the X-axis and Salary on the Y-axis. We are using the Naive Bayes algorithm to find the category of the new data point. For this, we have to find the posterior probability of walking and driving for this data point. After comparing, the point belongs to the category having a higher probability.

The posterior probability of walking for the new data point is:

posterior probability | naive bayes algorithm

and that for the driving is:

Steps Involved in the Naive Bayes Classifier Algorithm

Step 1: We have to find all the probabilities required for the Bayes theorem for the calculation of posterior probability.

P(Walks) is simply the probability of those who walk among all.

In order to find the marginal likelihood, P(X), we have to consider a circle around the new data point of any radii, including some red and green points.

P(X|Walks) can be found by:

Now we can find the posterior probability using the Bayes theorem,

Step 2: Similarly, we can find the posterior probability of Driving, and it is 0.25

Step 3: Compare both posterior probabilities. When comparing the posterior probability, we can find that P(walks|X) has greater values, and the new point belongs to the walking category.

Source: Unsplash

Implementation of Naive Bayes in Python Programming

Now let’s implement Naive Bayes step by step using the python programming language

We are using the Social network ad dataset. The dataset contains the details of users on a social networking site to find whether a user buys a product by clicking the ad on the site based on their salary, age, and gender.

Step 1: Importing the libraries

Let’s start the programming by importing the essential libraries required.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

Step 2: Importing the dataset
Python Code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [1, 2, 3]].values
y = dataset.iloc[:, -1].values

print(X)

Since our dataset contains character variables, we have to encode it using LabelEncoder.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:,0] = le.fit_transform(X[:,0])

Step 3: Train test splitting

We are splitting our data into train and test datasets using the scikit-learn library. We are providing the test size as 0.20, which means our training data contains 320 training sets, and the test sample contains 80 test sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Step 4: Feature scaling

Next, we are doing feature scaling to the training and test set of independent variables.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 5: Training the Naive Bayes model on the training set

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Let’s predict the test results

y_pred  =  classifier.predict(X_test)

Predicted and actual value

 y_pred

y_test

For the first 8 values, both are the same. We can evaluate our matrix using the confusion matrix and accuracy score by comparing the predicted and actual test values

from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test,y_pred)

confusion matrix

ac – 0.9125

Accuracy is good. Note that you can achieve better results for this problem using different algorithms.

Full Python Tutorial

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
ac = accuracy_score(y_test,y_pred)
cm = confusion_matrix(y_test, y_pred)

What Are the Assumptions Made by the Naive Bayes Algorithm?

There are several variants of Naive Bayes, such as Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes. Each variant has its own assumptions and is suited for different types of data. Here are some assumptions that the Naive Bayers algorithm makes:

The main assumption is that it assumes that the features are conditionally independent of each other.
Each of the features is equal in terms of weightage and importance.
The algorithm assumes that the features follow a normal distribution.
The algorithm also assumes that there is no or almost no correlation among features.

Conclusion

The naive Bayes algorithm is a powerful and widely-used machine learning algorithm that is particularly useful for classification tasks. This article explains the basic math behind the Naive Bayes algorithm and how it works for binary classification problems. Its simplicity and efficiency make it a popular choice for many data science applications. we have covered most concepts of the algorithm and how to implement it in Python. Hope you liked the article, and do not forget to practice algorithms.

Key Takeaways

Naive Bayes is a probabilistic classification algorithm(binary o multi-class) that is based on Bayes’ theorem.
There are different variants of Naive Bayes, which can be used for different tasks and can even be used for regression problems.
Naive Bayes can be used for a variety of applications, such as spam filtering, sentiment analysis, and recommendation systems.

Frequently Asked Questions

Q1. When should we use a naive Bayes classifier?

A. The naive Bayes classifier is a good choice when you want to solve a binary or multi-class classification problem when the dataset is relatively small and the features are conditionally independent. It is a fast and efficient algorithm that can often perform well, even when the assumptions of conditional independence do not strictly hold. Due to its high speed, it is well-suited for real-time applications. However, it may not be the best choice when the features are highly correlated or when the data is highly imbalanced.

Q2. What is the difference between Bayes Theorem and Naive Bayes Algorithm?

A. Bayes theorem provides a way to calculate the conditional probability of an event based on prior knowledge of related conditions. The naive Bayes algorithm, on the other hand, is a machine learning algorithm that is based on Bayes’ theorem, which is used for classification problems.

Q3. Is Naive Bayes a regression technique or classification technique?

It is not a regression technique, although one of the three types of Naive Bayes, called Gaussian Naive Bayes, can be used for regression problems.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Surabhi

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Onkar

Awesome explanation ☺ This will clear all the doubts and its very helful for newbies. Keep up the good work 👍

Ryan

Hi, Great post;) I would like to ask when estimating the marginal likelihood P(X), we need to draw a circle around the new data, how should we choose the radius in order to increase accuracy of the estimation? And how is the radius or metric used going too affect the accuracy? Is there any book you can recommend for this topic? Thank you so much.

Sania

Thanks Surbhi! Easy to understand.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Get Started With Naive Bayes Algorithm: Theory & Implementation

Introduction

Table of contents

What Is the Naive Bayes Classifier Algorithm?

Naive Bayes Theorem: The Concept Behind the Algorithm

Steps Involved in the Naive Bayes Classifier Algorithm

Implementation of Naive Bayes in Python Programming

Step 5: Training the Naive Bayes model on the training set

Full Python Tutorial

What Are the Assumptions Made by the Naive Bayes Algorithm?

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid