Interpretation of Performance Measures to Evaluate Models

Sitara Last Updated : 30 Mar, 2021

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

In the last year of my bachelor’s degree, I learned an interesting fact during a customs class:

If a person appears suspicious, they detain him and deliberately prolong the interrogation process. He was asked very simple but annoying questions. Here one person asks a question, and another employee stands a short distance away and observes as if he was doing something else. If a person is really carrying a prohibited, harmful, or dangerous item across the border, selling, or importing it, they become overly irritable and overly emotional. Such a person is already considered quite suspicious.

That is, the probability of P (Person = Criminal / Reaction = Emotional) is high. However, this does not mean that all emotional individuals are criminals; If some people are suspicious, it may simply be due to their neurotic or other health problems, current mood, or first trip.

There is a saying, “Where there is no fire, there is no smoke.” That is, if I see smoke, then there is a fire. I wonder how right I am in this opinion?

Let’s say I’m making a fire detector. Thus, this detector should detect fire cases as accurately as possible and not cause a disturbance by giving a false fire signal. That is, the detector should not endanger me or disturb me unnecessarily.

Here, a fire that the detector can give a true warning to is a “True Positive” result. There is a fire (the event is positive) and the detector has identified it correctly (the result is true).

In the absence of fire, the alarm is “False Positive”. There is no fire, but the detector gives a positive signal by mistake. Let me note that this does not mean that the positive is good, but that the affirmation of the incident that took place.

The fact that the detector does not signal in the absence of fire is “True Negative”. No incident occurred and the detector did not signal according to the situation.

Failure to signal in case of fire (this is the most dangerous situation) is “False Negative”. An accident occurs, but the detector can not see it.

We show these 4 cases in the form of the table below.

Evakuate model with confusion matrix — Source

We have two types of errors.

Type 1 is believing that something is not wrong, and Type 2 is not believing that something exists (it is not the same as believing that it does not exist).

In Type 1 error we reject the true Null Hypothesis, but in Type 2 error we fail to reject the false Null Hypothesis (we cannot accept Null Hypothesis, we can only fail to reject it due to insufficient data, time interval, or other impediments).

Generally, Type 1 error is considered more dangerous, because we continue to believe in something that does not exist, and we do not re-investigate. The investigation continues until it finds a Type 2 error. However, this comparison varies from situation to situation. Fire, health problems, accidents, etc. Type 1 error is often more reliable in matters related to human life because even if I am worried by mistake, I will insure myself (call the fire brigade, take vitamins, stay at home, etc.) and the second type of mistake will give me peace of mind and I will be caught unprepared. Otherwise, even if the detector gives the wrong signal, I will search everywhere until I find the problem. As a result, I will not face life-threatening.

To make such a detector, I have to program it (in the language of Data Science, train it), and finally, test it.

Here, our model transforms the prior probability into a posterior with the data we give it. That is before there was historical data of fire cases the probability of fire cases was 50%. Now it has the data to learn. Based on this data, a new result is obtained according to Bayes’ theorem (P (Probability of occurrence of the event | If data is given)). During the test, we construct a matrix similar to the one above. And there we record the number of TP, TN, FP, FN cases. These numbers show how accurate our model is. Such a matrix is called a confusion matrix.

Let’s say our model detects 80% of fires. At first glance, the result is not bad. But when we look at the sentence again, we see that I only find 80% of fire cases. What about when there is no fire? Maybe our detector will activate such an alarm more often in the absence of fire and cause additional inconvenience?

Here, 80% is called the “sensitivity rate”. That is, our model is 80% sensitive to fires and detects 80% of fires. This is only for fires. Taking this as the core value of our model leads to a “base rate fallacy”.

“Base rate” – the percentage of fires that occur (the ratio of the number of fires to the total time, for example, once in 5 years.

We also need to know the “specificity rate”. The degree of specificity is the correct detection of non-fire cases by the detector, ie how accurate the detector can find when there is no fire.

But how can we evaluate our model?

We use the following ratios to evaluate the model correctly.

In the Confusion Matrix, the TP / (TP + FP) ratio is called precision. That is the ratio of the number of times the signal is activated during a fire to the total time the signal is activated (what percentage of alarm cases occur at the right time).

The TP / (TP + FN) ratio is called recall. That is the ratio of the number of fires in which the signal is activated to all fires (what percentage of alarms are activated during a fire). Recall and sensitivity are the same concepts.

Besides, there is an accuracy rate. It is also equal to the ratio (TP + TN) / (TP + FP + TN + FN). That is the ratio of the times when all the alarms work properly (activated in case of fire, not activated in case of fire) to all cases (note that for highly imbalanced data we rely on F1 score rather than accuracy rate).

These 3 ratios are very important to evaluate the model. It should also be noted that, in fact, relative values are more important than absolute values for any assessment. For example, the absolute grade you get from the exam (say, 85 points) is insignificant when we do not know the maximum score. But when it comes to 85/100, this price is significant. The evaluation of the model is based on the same simple logic.

I hope that in this blog your discussion of the Bayesian theorem has expanded a bit.

Source for image: https://challengersdeep.wixsite.com/website/post/od-olmayan-yerdən-tüstü-çıxmaz

Sitara

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Interpretation of Performance Measures to Evaluate Models

Introduction

We show these 4 cases in the form of the table below.

We have two types of errors.

But how can we evaluate our model?

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)