Let’s start with a practical example of using the Naive Bayes Algorithm.
Assume this is a situation you’ve got into in your data science project:
You are working on a classification problem and have generated your set of hypotheses, created features, and discussed the importance of variables. Within an hour, stakeholders want to see the first cut of the model.
What will you do? You have hundreds of thousands of data points and several variables in your training data set. In such a situation, if I were in your place, I would have used ‘Naive Bayes Classifier,‘ which can be extremely fast relative to other classification algorithms. It works on Bayes’ theorem of probability to predict the class of unknown data sets.
In this article, you will explore the Naive Bayes classifier, a fundamental technique in machine learning. We will discuss the Naive Bayes algorithm, its applications, and how to implement the Naive Bayes classifier in Python for efficient data classification.
Naïve Bayes Classifier is belongs to a family of generative learning algorithms, aiming to model the distribution of inputs within a specific class or category. Unlike discriminative classifiers such as logistic regression, it doesn’t learn which features are most crucial for distinguishing between classes. It’s widely used in text classification, spam filtering, and recommendation systems.
It is a classification technique based on Bayes’ Theorem with an independence assumption among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
The Naive Bayes classifier is a popular supervised machine learning algorithm used for classification tasks such as text classification. It belongs to the family of generative learning algorithms, which means that it models the distribution of inputs for a given class or category. This approach is based on the assumption that the features of the input data are conditionally independent given the class, allowing the algorithm to make predictions quickly and accurately.
In statistics, naive Bayes are simple probabilistic classifiers that apply Bayes’ theorem. This theorem is based on the probability of a hypothesis, given the data and some prior knowledge. The naive Bayes classifier assumes that all features in the input data are independent of each other, which is often not true in real-world scenarios. However, despite this simplifying assumption, the naive Bayes classifier is widely used because of its efficiency and good performance in many real-world applications.
Moreover, it is worth noting that naive Bayes classifiers are among the simplest Bayesian network models, yet they can achieve high accuracy levels when coupled with kernel density estimation. This technique involves using a kernel function to estimate the probability density function of the input data, allowing the classifier to improve its performance in complex scenarios where the data distribution is not well-defined. As a result, the naive Bayes classifier is a powerful tool in machine learning, particularly in text classification, spam filtering, and sentiment analysis, among others.
For example, if a fruit is red, round, and about 3 inches wide, we might call it an apple. Even if these things are related, each one helps us decide it’s probably an apple. That’s why it’s called ‘Naive.
An NB model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of computing posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.
In this first step data set is converted into a frequency table
Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.
Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction.
Problem: Players will play if the weather is sunny. Is this statement correct?
We can solve it using the above-discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here P( Sunny | Yes) * P(Yes) is in the numerator, and P (Sunny) is in the denominator.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
The Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification (nlp) and with problems having multiple class labels.
Again, scikit learn (python library) will help here to build a Naive Bayes model in Python. There are five types of NB models under the scikit-learn library:
Try out the below code in the coding window and check your results on the fly!
# importing required libraries
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)
# Now, we need to predict the missing target variable in the test data
# target variable - Survived
# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']
Create the object of the Naive Bayes model
You can also add other parameters and test your code here
Some parameters are : var_smoothing
Documentation of sklearn GaussianNB:
model = GaussianNB()
# fit the model with the training data
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
require(e1071) #Holds the Naive Bayes Classifier
Train <- read.csv(file.choose())
Test <- read.csv(file.choose())
#Make sure the target variable is of a two-class classification problem only
model <- naiveBayes(Item_Fat_Content~., data = Train)
pred <- predict(model,Test)
Above, we looked at the basic NB Model. You can improve the power of this basic model by tuning parameters and handling assumptions intelligently. Let’s look at the methods to improve the performance of this model. I recommend you go through this document for more details on Text classification using Naive Bayes.
Here are some tips for improving power of Naive Bayes Model:
In this article, we looked at one of the supervised machine learning algorithms, “Naive Bayes Classifier” mainly used for classification. Congrats, if you’ve thoroughly & understood this article, you’ve already taken your first step toward mastering this algorithm. From here, all you need is practice.
Further, I would suggest you focus more on data pre-processing and feature selection before applying the algorithm. In a future post, I will discuss about text and document classification using naive bayes in more detail.
Hope you like the article! The Naive Bayes classifier is a powerful tool in machine learning, utilizing the Naive Bayes algorithm for efficient classification tasks. Implementing the Naive Bayes classifier in Python enhances its accessibility and usability for various applications.
A. The Naive Bayes classifier assumes independence among features, a rarity in real-life data, earning it the label ‘naive’.
Email spam detection, where the algorithm classifies emails as spam or not spam based on word frequency.
A. It’s “naive” because it assumes all features are independent, which is rarely true in real-world data, and “Bayes” because it’s based on Bayes’ Theorem.
A. No, probabilities are always between 0 and 1. If the calculated value exceeds 1, it indicates an error in computation or input.
A. Examples include spam filtering and sentiment analysis.
Hi, I found your result very interesting and helped me with a project I was working on, by any chance do you know how you coul implement another algorithm to this so that you can compare your accuracy, for example a linear regression model?
Hello, I used your code, but how can I get accuracy_score? please help me (: