Understanding The Concept Of Hypothesis In Data Science!

Mustafa Last Updated : 16 Jun, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Greetings, I am Mustafa Sidhpuri a Computer Science and Engineering student. Recently, I was learning about Hypothesis Testing. At first, I felt it was a little tough for me to understand, after reading a lot of blogs and watching videos about a hypothesis I was able to understand it. I would like to share a summary of what I learned with you all.

In this blog, I will try to explain what is Hypothesis and its types.

What is Hypothesis?

Suppose we have a huge amount of data. We take out a sample from the dataset and make some claims. Note that claims are not always valid, these are just assumptions or guesses, this type of claim or assumption is called Hypothesis.

Example 1:

Let us take an example to understand it more clearly. According to laws food manufacturing companies should not put more than 2.5 ppm(particle per million) Lead in food. So, let us take a company XYZ and we claim that the average amount of lead in food that XYZ company manufactures contains is more than 2.5 ppm.

This is just a claim based on the limited amount of data and not valid for the whole population. Hypothesis testing helps us verifying a claim on statistic values.

Example 2:

Let us take another example, suppose a person is charged for some trial where the jury has to decide whether the person is innocent or guilty.

It can be converted to 2 hypotheses:

Hypothesis 1: Defendant is innocent.

Hypothesis2: Defendant is guilty.

These two opposing hypotheses are called the null hypothesis and alternative hypothesis.

Null Hypothesis

The null hypothesis is a prevailing belief about the population. It states that there is no change or no difference in the situation.

It assumes the status quo (the existing state of affairs) is true.

In our example 2 defendant is a member of society, that is why he is considered innocent until proven guilty. So our null hypothesis claims the defendant is innocent just like he was before the charge.

The Null hypothesis is represented as H0.

Remember that the null hypothesis will always have these signs:

= ≤ ≥

Alternative Hypothesis

In simple words, we can define the alternative hypothesis as the opposite of the null hypothesis

Continuing the same example 2, our alternative hypothesis is that he is guilty.

The Alternative hypothesis is represented as H1

Remember that the Alternative hypothesis will always have these signs:

!= > <

Important points to remember:

H0 and H1 cannot be true at the same time.
We only reject or not reject the null hypothesis, we never accept it. If H1 is rejected it does not mean that H0 has to be accepted there might be some other possibilities.

Let us take some examples so that you can easily understand null and alternate hypotheses.

Situation 1: Flipkart claimed that its total valuation in December 2016 was at least $14 billion. Here the claim contains a ≥ sign, so the null hypothesis is an original claim.

The hypothesis, in this case, can be formulated as:

Total valuation ≥ $14 billion → Null Hypothesis

Total valuation < $14 billion → Alternate Hypothesis

Situation 2: Flipkart claimed that its total valuation in December 2016 was greater than $14 billion. Here the claim contains > sign, so the null hypothesis is the complement of the original claim.

The hypothesis, in this case, can be formulated as:

Total valuation ≤ $14 billion → Null Hypothesis

Total valuation >$14 billion → Alternate Hypothesis

Making a decision

We have understood the hypothesis, what is hypothesis testing, and how it is used in our daily lives. After knowing our alternate and null hypothesis, we have to reject or not the alternate hypothesis.

Suppose your friend brags that his archery score is 70. You don’t believe him and you tell him to play 5 games of archery with him and see what his score is. Unfortunately, his average score is 20. So yow will not believe him. If his score were 65 then you would believe him.

Here your H0: mean = 70 and H1: mean not equal to 70

5 games that you played were a sample and an average score of 70 which he told you was based on all of his games. Here we require a critical value that tells us that we can reject the H0 or we cannot reject the H0 (we never accept H0).

image source: Hands-On Machine Learning with Scikit–Learn and TensorFlow 2e

The shaded part on the left side of the graph is LCV(Lower Critical Values) and on the right side is called HCV(Higher Critical Values).

In the above figure, we see that a critical region appears on both sides, but this is not the case every time. It depends on the behavior of the alternate hypothesis.

There are generally two types of alternative hypothesis:

Non-directional
Directional.

Non-Directional alternate hypothesis

Taking the same example which we discussed above, our hypothesis is mean=70 or mean not equal to 70, so we do not know specifically that it is more than 70 or less than 70.

But, the mean can be less than or greater than 70 so here no direction is mentioned. This type is called a non-directional alternate hypothesis. It is also called the Two-Tailed Test.

The non-directional alternate hypothesis is generally used in the consistency of products, especially in pharmaceuticals.

Directional Alternate Hypothesis

Taking the same example of archery, now your friend says that he scores ≥70. So our hypothesis will be:

H0: mean ≥70

H1: mean <70

As we can see in H1, it clearly shows that our critical region will lie on the left side, it is in a specific direction. If our critical region lies on the left side then it is called a Left-tailed test

Similarly, if we have h1: mean>70, our critical region will lie on the right side. If our critical region is on the right side it is called a Right-tail test

Points to remember:

!= in H1 → Two-tail test
< in H1 → Left-tail test
> in H1 → Right-tail test

Below figure clearly explain directional and non-directional alternate hypothesis.

image source: https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a

How to calculate critical value?

We now know about the critical region, we need to know how to calculate it. There are several methods used to find critical regions or critical values. Two of them are mentioned below which you can explore:

Critical Value Method.
P-Value Method.

Note that there are other methods also available.

About me

I am Mustafa Sidhpuri a motivated Data Scientist with experience as a freelance data scientist. Passionate about building models that fix problems. Relevant skills include machine learning, problem-solving, programming, and creative thinking.

Contact: [email protected]

GitHub

Medium

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Mustafa

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

balaji

this article very help full to understand about the hypothesis testing :)

Krishna Bhadke

such a good article easy to understand

Aashi

What does the p value for each variable represent?

Show 2 reply

Ajay Nain

A p-value is composed of summation of 3 cases: 1. The probability of getting a particular value in a distribution (eg. pdf, mdf or histogram) 2. The probability of getting a value which is equally rare in that distribution 3. The probability of getting values which are more rare than the observed value in that distribution. For example, If we tossing a coin 5 times and want to know the p-value for 4 Head and 1 tails, the p-value will be calculated as follows: total possible outcomes = 2^5 = 32 1. Prob. of getting 4 heads and 1 tail (HHHHT, HHHTH, HHTHH, HTHHH, THHHH); P1 = 5/32 2. Equally rare event is 4 tails and 1 head; P2. = 5/32 3. More extreme events are 5 heads or 5 tails (HHHHH, TTTTT); P3. = 1/32 + 1/32 = 2/32 Finally p-value = P1 + P2 + P3 = 5/32 + 5/32 + 2/32 = 12/32 = 0.375 P-value = 0.375 Further more, this test is used to understand about hypothesis test. If we are using alpha value of 0.05 to generalize whether coin is biased or not. H0 = if p-value of 4 head and 1 tails is less than than alpha, coin is biased H1 = else coin is not biased Now as p-value is > 0.05, Coin is not biased and H0 is rejected.

Ajay Nain

The p-value is the summation of 3 cases: 1. The probability of getting a value in a particular distribution (eg histogram, pdf, mdf). 2. The probability of getting equally rare value in that distribution 3. The summation of observing more extreme values in that distribution. You can understand it with the help of an example explained below. Let a coin is tossed 5 times and we want to know the p-value of getting 4 heads and 1 tail. Total events = 2^5 = 32 The p-value will be calculated in 3 steps as explained below: 1. Prob. of getting 4 heads and 1 tails (HHHHT, HHHTH, HHTHH, HTHHH, THHHH); P1 = 5/32 2. Prob. of equally rare event 4 tails and 1 head, P2 = 5/32 3. Sum of prob. of more rare events i.e. 5 heads or 5 tails; P3 = 1/32 + 1/32 = 2/32 p-value = P1 + P2 + P3 = 5/32 + 5/32 + 2/32 = 12/32 = 0.375 p-value = 0.375 Further-more, we can have understanding of hypothesis test using this example. Lets say we are using a confidence interval of 95% to check whether coin is biased or not for the event of getting 4 heads and 1 tail. Now, alpha = 1 - 0.95 = 0.05 H0 = Coin is biased if p-value 0.05, We reject H0 and the coin is not biased It means getting 4 heads and 1 tail in tossing a coin 5 times does not means that coin is special or biased which is also true as per our knowledge. This is how p-value works. Hope this explanation makes sense.

Write for us

Write, captivate, and earn accolades and rewards for your work

Reach a Global Audience
Get Expert Feedback
Build Your Brand & Audience

Cash In on Your Knowledge
Join a Thriving Community
Level Up Your Data Science Game

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Understanding The Concept Of Hypothesis In Data Science!

What is Hypothesis?

Null Hypothesis

= ≤ ≥

Alternative Hypothesis

!= > <

Making a decision

Non-Directional alternate hypothesis

Directional Alternate Hypothesis

How to calculate critical value?

About me

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#