This article was published as a part of the Data Science Blogathon
Greetings, I am Mustafa Sidhpuri a Computer Science and Engineering student. Recently, I was learning about Hypothesis Testing. At first, I felt it was a little tough for me to understand, after reading a lot of blogs and watching videos about a hypothesis I was able to understand it. I would like to share a summary of what I learned with you all.
In this blog, I will try to explain what is Hypothesis and its types.
Suppose we have a huge amount of data. We take out a sample from the dataset and make some claims. Note that claims are not always valid, these are just assumptions or guesses, this type of claim or assumption is called Hypothesis.
Example 1:
Let us take an example to understand it more clearly. According to laws food manufacturing companies should not put more than 2.5 ppm(particle per million) Lead in food. So, let us take a company XYZ and we claim that the average amount of lead in food that XYZ company manufactures contains is more than 2.5 ppm.
This is just a claim based on the limited amount of data and not valid for the whole population. Hypothesis testing helps us verifying a claim on statistic values.
Example 2:
Let us take another example, suppose a person is charged for some trial where the jury has to decide whether the person is innocent or guilty.
It can be converted to 2 hypotheses:
Hypothesis 1: Defendant is innocent.
Hypothesis2: Defendant is guilty.
These two opposing hypotheses are called the null hypothesis and alternative hypothesis.
The null hypothesis is a prevailing belief about the population. It states that there is no change or no difference in the situation.
It assumes the status quo (the existing state of affairs) is true.
In our example 2 defendant is a member of society, that is why he is considered innocent until proven guilty. So our null hypothesis claims the defendant is innocent just like he was before the charge.
The Null hypothesis is represented as H0.
Remember that the null hypothesis will always have these signs:
In simple words, we can define the alternative hypothesis as the opposite of the null hypothesis
Continuing the same example 2, our alternative hypothesis is that he is guilty.
The Alternative hypothesis is represented as H1
Remember that the Alternative hypothesis will always have these signs:
Important points to remember:
Let us take some examples so that you can easily understand null and alternate hypotheses.
Situation 1: Flipkart claimed that its total valuation in December 2016 was at least $14 billion. Here the claim contains a ≥ sign, so the null hypothesis is an original claim.
The hypothesis, in this case, can be formulated as:
Total valuation ≥ $14 billion → Null Hypothesis
Total valuation < $14 billion → Alternate Hypothesis
Situation 2: Flipkart claimed that its total valuation in December 2016 was greater than $14 billion. Here the claim contains > sign, so the null hypothesis is the complement of the original claim.
The hypothesis, in this case, can be formulated as:
Total valuation ≤ $14 billion → Null Hypothesis
Total valuation >$14 billion → Alternate Hypothesis
We have understood the hypothesis, what is hypothesis testing, and how it is used in our daily lives. After knowing our alternate and null hypothesis, we have to reject or not the alternate hypothesis.
Suppose your friend brags that his archery score is 70. You don’t believe him and you tell him to play 5 games of archery with him and see what his score is. Unfortunately, his average score is 20. So yow will not believe him. If his score were 65 then you would believe him.
Here your H0: mean = 70 and H1: mean not equal to 70
5 games that you played were a sample and an average score of 70 which he told you was based on all of his games. Here we require a critical value that tells us that we can reject the H0 or we cannot reject the H0 (we never accept H0).
image source: Hands-On Machine Learning with Scikit–Learn and TensorFlow 2e
The shaded part on the left side of the graph is LCV(Lower Critical Values) and on the right side is called HCV(Higher Critical Values).
In the above figure, we see that a critical region appears on both sides, but this is not the case every time. It depends on the behavior of the alternate hypothesis.
There are generally two types of alternative hypothesis:
Taking the same example which we discussed above, our hypothesis is mean=70 or mean not equal to 70, so we do not know specifically that it is more than 70 or less than 70.
But, the mean can be less than or greater than 70 so here no direction is mentioned. This type is called a non-directional alternate hypothesis. It is also called the Two-Tailed Test.
The non-directional alternate hypothesis is generally used in the consistency of products, especially in pharmaceuticals.
Taking the same example of archery, now your friend says that he scores ≥70. So our hypothesis will be:
H0: mean ≥70
H1: mean <70
As we can see in H1, it clearly shows that our critical region will lie on the left side, it is in a specific direction. If our critical region lies on the left side then it is called a Left-tailed test
Similarly, if we have h1: mean>70, our critical region will lie on the right side. If our critical region is on the right side it is called a Right-tail test
Points to remember:
Below figure clearly explain directional and non-directional alternate hypothesis.
image source: https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a
We now know about the critical region, we need to know how to calculate it. There are several methods used to find critical regions or critical values. Two of them are mentioned below which you can explore:
Note that there are other methods also available.
I am Mustafa Sidhpuri a motivated Data Scientist with experience as a freelance data scientist. Passionate about building models that fix problems. Relevant skills include machine learning, problem-solving, programming, and creative thinking.
Contact: [email protected]
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
this article very help full to understand about the hypothesis testing :)
such a good article easy to understand
What does the p value for each variable represent?
A p-value is composed of summation of 3 cases: 1. The probability of getting a particular value in a distribution (eg. pdf, mdf or histogram) 2. The probability of getting a value which is equally rare in that distribution 3. The probability of getting values which are more rare than the observed value in that distribution. For example, If we tossing a coin 5 times and want to know the p-value for 4 Head and 1 tails, the p-value will be calculated as follows: total possible outcomes = 2^5 = 32 1. Prob. of getting 4 heads and 1 tail (HHHHT, HHHTH, HHTHH, HTHHH, THHHH); P1 = 5/32 2. Equally rare event is 4 tails and 1 head; P2. = 5/32 3. More extreme events are 5 heads or 5 tails (HHHHH, TTTTT); P3. = 1/32 + 1/32 = 2/32 Finally p-value = P1 + P2 + P3 = 5/32 + 5/32 + 2/32 = 12/32 = 0.375 P-value = 0.375 Further more, this test is used to understand about hypothesis test. If we are using alpha value of 0.05 to generalize whether coin is biased or not. H0 = if p-value of 4 head and 1 tails is less than than alpha, coin is biased H1 = else coin is not biased Now as p-value is > 0.05, Coin is not biased and H0 is rejected.
The p-value is the summation of 3 cases: 1. The probability of getting a value in a particular distribution (eg histogram, pdf, mdf). 2. The probability of getting equally rare value in that distribution 3. The summation of observing more extreme values in that distribution. You can understand it with the help of an example explained below. Let a coin is tossed 5 times and we want to know the p-value of getting 4 heads and 1 tail. Total events = 2^5 = 32 The p-value will be calculated in 3 steps as explained below: 1. Prob. of getting 4 heads and 1 tails (HHHHT, HHHTH, HHTHH, HTHHH, THHHH); P1 = 5/32 2. Prob. of equally rare event 4 tails and 1 head, P2 = 5/32 3. Sum of prob. of more rare events i.e. 5 heads or 5 tails; P3 = 1/32 + 1/32 = 2/32 p-value = P1 + P2 + P3 = 5/32 + 5/32 + 2/32 = 12/32 = 0.375 p-value = 0.375 Further-more, we can have understanding of hypothesis test using this example. Lets say we are using a confidence interval of 95% to check whether coin is biased or not for the event of getting 4 heads and 1 tail. Now, alpha = 1 - 0.95 = 0.05 H0 = Coin is biased if p-value 0.05, We reject H0 and the coin is not biased It means getting 4 heads and 1 tail in tossing a coin 5 times does not means that coin is special or biased which is also true as per our knowledge. This is how p-value works. Hope this explanation makes sense.