Bayesian decision theory is a statistical approach that quantifies tradeoffs among various classification decisions using the concept of probability, specifically Bayes’ Theorem, and the costs associated with those decisions. It is fundamentally a classification technique that leverages Bayes’ Theorem to determine conditional probabilities. In the context of statistical pattern recognition, the focus is on the statistical properties of patterns, typically expressed through probability density functions (pdf’s) and probability mass functions (pmf’s). This article will concentrate on these aspects, aiming to develop a foundational understanding of Bayesian decision theory.
This article was published as a part of the Data Science Blogathon
Random Variable
A random variable is a function that maps a possible set of outcomes to some values like while tossing a coin and getting head H as 1 and Tail T as 0 where 0 and 1 are random variables.
Bayes Theorem
The conditional probability of A given B, represented by P(A | B) is the chance of occurrence of A given that B has occurred.
By Using the Chain rule, this can also be written as:
Where, P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)
Here, equation (1) is known as the Bayes Theorem of probability
Our aim is to explore each of the components included in this theorem. Let’s explore step by step:
(a) Prior or State of Nature:
(b) Class Conditional Probabilities:
(c) Evidence:
(d) Posterior Probabilities:
For a better understanding of the above theory, we consider an example
Suppose we have a classification problem statement where we have to classify among the object-1 and object-2 with the given set of features X = [x1, x2, …, xn]T.
The main objective of designing a such classifier is to suggest actions when presented with unseen features, i.e, object not yet seen i.e, not in training data.
In this example let w denotes the state of nature with w = w1 for object-1 and w = w2 for object-2. Here, we need to know that in reality, the state of nature is so unpredictable that we generally consider that was variable that is described probabilistically.
Generally, we assume that there is some prior value P(w1) that the next object is object-1 and P(w2) that the next object is object-2. If we have no other object as in this problem then the sum of their prior is 1 i.e. the priors are exhaustive.
The prior probabilities reflect the prior knowledge of how likely we will get object-1 and object-2. It is domain-dependent as the prior may change based on the time of year they are being caught.
It sounds somewhat strange and when judging multiple objects (as in a more realistic scenario) makes this decision rule stupid as we always make the same decision based on the largest prior even though we know that any other type of objective also might appear governed by the leftover prior probabilities (as priors are exhaustive in nature).
Consider the following different scenarios:
Feature Extraction process (Extract feature from the images)
A suggested set of features- Length, width, shapes of an object, etc.
In our example, we use the width x, which is more discriminatory to improve the decision rule of our classifier. The different objects will yield different variable-width readings and we usually see this variability in probabilistic terms and also we consider x to be a continuous random variable whose distribution depends on the type of object wj, and is expressed as p(x|ωj) (probability distribution function pdf as a continuous variable) and known as the class-conditional probability density function. Therefore,
Checkout this aricle about the Probability distribution function
The pdf p(x|ω1) is the probability density function for feature x given that the state of nature is ω1 and the same interpretation for p(x|w2).
Suppose that we are well aware of both the prior probabilities P(ωj) and the conditional densities p(x|ωj). Now, we can arrive at the Bayes formula for finding posterior probabilities:
Bayes’ formula gives us intuition that by observing the measurement of x we can convert the prior P(ωj) to the posteriors, denoted by P(ωj|x) which is the probability of ωj given that feature value x has been measured.
The evidence factor, p(x), works as merely a scale factor that guarantees that the posterior probabilities sum up to one for all the classes.
Bayes’ Decision Rule
Probability of Error
To justify our decision we look at the probability of error, whenever we observe x, we have,
As they are exhaustive and if we choose the correct nature of an object by probability P then the leftover probability (1-P) will show how probable is the decision that it the not the decided object.
We can minimize the probability of error by deciding the one which has a greater posterior and the rest as the probability of error will be minimum as possible. So we finally get,
And our Bayes decision rule as,
This type of decision rule highlights the role of the posterior probabilities. With the help Bayes theorem, we can express the rule in terms of conditional and prior probabilities.
The evidence is unimportant as far as the decision is concerned. As we discussed earlier it is working as just a scale factor that states how frequently we will measure the feature with value x; it assures P(ω1|x)+ P(ω2|x) = 1.
So by eliminating the unrequired scale factor in our decision rule we have, the similar decision rule by Bayes theorem as,
Decide ω1 if p(x|ω1)P(ω1) >p(x|ω2)P(ω2); otherwise decide ω2
Now, let’s consider 2 cases:
This completes our example formulation!
Bayes classification: Posterior, likelihood, prior, and evidence
We now discuss those cases which have multiple features as well as multiple classes,
Let the Multiple Features be X1, X2, … Xn and Multiple Classes be w1, w2, … wn, then:
Where,
In cases of the same incoming patterns, we might need to use a drastically different cost function, which will lead to different actions altogether. Generally, different decision tasks may require features and yield boundaries quite different from those useful for our original categorization problem.
So, In the later articles, we will discuss the Cost function, Risk Analysis, and decisive action which will further help to understand the Bayes decision theory in a better way.
Bayesian Decision Theory provides a systematic framework for making optimal decisions under uncertainty by incorporating prior knowledge and observed data. It leverages Bayes’ Theorem to update probabilities, allowing for informed choices that minimize expected loss or maximize utility. This approach is widely applicable in fields like machine learning, statistics, and artificial intelligence, offering a principled way to handle uncertainty and variability. By balancing prior beliefs with new evidence, Bayesian Decision Theory ensures robust and adaptive decision-making. Its flexibility and mathematical foundation make it a powerful tool for solving complex real-world problems, emphasizing the importance of probabilistic reasoning in achieving optimal outcomes.
Bayesian theory refers to Bayesian statistics and inference, which involves updating probabilities based on new evidence using Bayes’ theorem. It combines prior knowledge with observed data to make predictions or inferences about a hypothesis
Bayesian decision theory perception involves using Bayesian methods to make decisions under uncertainty. It perceives the world by updating beliefs based on new data, allowing for more informed decision-making by quantifying uncertainty and risk
Bayesian decision theory utility refers to the application of Bayesian methods to maximize expected utility in decision-making. It involves calculating the expected outcomes of different actions and choosing the one that maximizes utility or value, considering both the probability of outcomes and their utility
Bayesian classification theory uses Bayesian methods for classification problems, such as predicting a class label based on features. It calculates the probability of an item belonging to a particular class given its features, often using Naive Bayes or more complex Bayesian models
Good attempt Chirag to explain quite a complex topic. It would make more sense if you include more examples but different applications. Also providing some links to next level of learning (text books, reference books) would be more help. Cheers. Rajveer