As Karl Pearson, a British mathematician has once stated, Statistics is the grammar of science, especially for computers and IT, physics, and biology. When one starts a journey in Data Science or Data Analytics, having statistical knowledge will have the leverage to get better data insights from the data.
“Statistics is that the grammar of science.” Karl Pearson
The importance of statistics in Data Science and Data Analytics can’t be underestimated. Statistics provides tools and methods to seek out structure and to offer deeper data insights. Knowing Statistics well will allow you to think critically, and be creative when using the info to unravel business problems and make data-driven decisions. In this article, we will try to cover the following Statistics topics for data science and data analytics:
A random variable is simply a way to map the outcomes of random processes, such as flipping a coin or rolling dice, or selecting a card from a pack of cards, to numbers. For example, consider the random process of flipping a coin by random variable X which takes a value of 1 if the outcome if heads and 0 if the outcome is tails.
There are two types of random variables Discrete Random Variable and Continuous Random Variable.
Discrete Random Variable is the one that takes which may take a countable number of distinct values(not necessarily that we can count). E.g 1,2,3,…., Number of days in a week, Number of students in a school. The Continuous Random Variable can take infinitely many values. E.g Heights, weights, temperature, distance, etc.
The set of possible outcomes is called the sample space of the random variable. Each time the random process is repeated is called an Event. The chance or the likelihood of an event occurring with a particular outcome is called the probability of that event. A probability of an event is the chance or likelihood that a random variable takes a specific value of x which can be described by P(x).
We will take the above-mentioned example of flipping a coin again, the likelihood of getting heads or tails is the same, that is 0.5 or 50%. So we have the following setting:
The mean is the arithmetic average of all data points or observations in the given data set.
Calculating the mean is very simple, you just add up all the values and divide by the total number of values in the dataset.
where N is the number of data points or observations in the sample. The sample mean is denoted by μ, which is very often used to approximate the population mean, which is expressed above.
Simply, the median is the middle value in the dataset. The value that splits the data in half.
The mode is the most frequent value that occurs in the dataset.
Simply, you can refer to the variance as a statistical measure of the spread of variables in the dataset.
More specifically, it measures how far a variable in the dataset is from the mean. It is denoted by(for population mean).
Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9, 11, 10, 12, 7.
Solution:
Given,
3, 8, 6, 10, 12, 9, 11, 10, 12, 7
Step 1: Compute the mean of the 10 values given.
Mean = (3+8+6+10+12+9+11+10+12+7) / 10
= 88 / 10
= 8.8
Step 2: Make a table with three columns, one for the X values, the second for the deviations, and the third for squared deviations. As the data is not given as sample data so we use the formula for population variance. Thus, the mean is denoted by μ.
Step 3: Calculate Variance by substituting the values in the formula,
Standard deviation is a quantity that expresses how much a variable of dataset differs from the mean. It is denoted by
Standard deviation is often preferred over the variance because it has the same unit as the data points, which means you can interpret it more easily.
Example:
A hen lays eight eggs. Each egg was weighed and recorded as follows:
60 g, 56 g, 61 g, 68 g, 51 g, 53 g, 69 g, 54 g. Find the Standard deviation.
Using the information from the above table, we can see that
To calculate the standard deviation, we must use the following formula:
Therefore, standard deviation = 6.32g
The correlation is the term in statistics that refers to the relationship (i.e. degree of association between two random variables) and it measures both the strength as well as the direction of the linear relationship between two variables. Correlation tells us how well a pair of numeric variables are linearly related, but it doesn’t give us the reason behind that relationship. If a correlation is present between two variables then it means that there is a relationship or a pattern between the values of the two target variables. But this doesn’t imply that the two variables cause each other (i.e. Change in one variable will cause the change in another variable. )
Correlation coefficients’ values range between -1 and 1. Note that the correlation of a variable with itself is always 1, that is Cor(X, X) = 1. Note that when interpreting correlation do not confuse it with causation, given that a correlation is not causation. Even if there is a correlation between two variables, you can’t conclude that one variable causes a change in the other.
The most common formula used for linear dependency between the data set is Pearson’s Correlation coefficient.
Formula :
Example: Calculate the correlation coefficient for the following data:
X = 4, 8 ,12, 16 and
Y = 5, 10, 15, 20.
Solution:
Given variables are,
X = 4, 8 ,12, 16 and
Y = 5, 10, 15, 20
To find the linear coefficient of these data, we need to first construct a table as follows to get the required values of the formula.
Putting all the values in the formula,
Therefore, the correlation coefficient is 1.
Causation means that the two variables have a cause-and-effect relationship with one another. (E.g. Event A causes event B) It may also happen that the relationship could be coincidental, or a third factor might be causing both variables to change.
The covariance is a statistical measure of the relationship between two random variables. It evaluates how much – to what extent – the variables change together.
Covariance can take negative or positive values as well as zero. A positive value of covariance indicates that two random variables tend to vary within the same direction, whereas a negative value suggests that these variables vary in opposite directions. Finally, the value zero means that they don’t vary together.
Example: The table below describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi). Using the covariance formula, determine whether economic growth and S&P 500 returns have a positive or inverse relationship. Before you compute the covariance.
x = 2.1, 2.5, 4.0, and 3.6 (economic growth)
y = 8, 12, 14, and 10 (S&P 500 returns)
Solution:
We need to first construct a table as follows to get the required values of the formula.
Covariance and correlation both primarily assess the relationship between two variables. But they are not the same.
Covariance measures the variation of two random variables from their expected values. However, it doesn’t indicate the strength of the relationship, nor the dependency between variables.
While, correlation measures how strong the relationship is, nor the dependency between variables. Simply, correlation is a scaled measure of covariance.
Relationship between correlation and covariance:
here is a variance.
A discrete distribution is a probability distribution that gives the discrete (individually countable) outcomes, such as 1, 2, 3… There are many discrete probability distributions to be used in different scenarios. We will discuss some of the Discrete distributions below:
To be able to apply the binomial formula the following conditions needs to be satisfied,
Example: A coin is tossed 10 times. What is the probability of getting exactly 6 heads?
Solution: Here we will be using Binomial distribution.
Because the number of trials is fixed, and independent. Each trial is binary (heads or tails). The probability of success for each trial is the same (i.e. P(Heads) = 0.5 ). So all the above conditions are satisfied.
The number of trials (n) = 10
The odds of success (“tossing a heads”) = 0.5
(1 – p) = 1 – 0.5 = 0.5
X = 6 ( where the random variable X represents the probability of getting exactly 6 heads)
Substituting all the values in above formula
P(X = 6) = 10C6 * 0.56 * 0.54
= 210 * 0.015625 * 0.0625
= 0.205078125
Example: Robert is a football player. His success rate of goal hitting is 70%. What is the probability that Robert hits his third goal on his fifth attempt?
Solution:
Here probability of success, P is 0.70. The number of trials n is 5, and the number of successes, r is 3. Using the negative binomial distribution formula, let’s compute the probability of hitting the third goal in the fifth attempt.
Substituting all the values in above formula, we get
P( X = 3 ) = 5-1C3-1 * (0.7)3 * (0.3)5-3
= 4C2 * (0.7)3 * (0.3)2
= 6 * 0.343 * 0.09
= 0.185
Therefore, the probability that Robert hits his third goal on his fifth attempt is 0.185.
Example: In an amusement fair, a competitor is entitled for a prize if he throws a ring on a peg from a certain distance. It is observed that only 30% of the competitors can do this. If someone is given 5 chances, what is the probability of his winning the prize when he has already missed 4 chances?
Solution:
If someone has already missed four chances and has to win in the fifth chance, then it is a probability experiment of getting the first success in 5 trials. The problem statement also suggests the probability distribution be geometric.
Here,
p = 30% = 0.3
( 1 – p ) = 1 – 0.3 = 0.7
n = 5
Substituting all the values in the above formula, we get
Therefore, the probability of his winning the prize when he has already missed 4 chances is 0.072 i.e. 7.2%
Example: As only 3 students came to attend the class today, find the probability for exactly 4 students to attend the classes tomorrow.
Solution:
Given,
Average rate of value(λ) = 3
Poisson random variable(r) = 4
Poisson distribution = P(X = r) =
P( X = 4 ) =
P(X=4) = 0.16803135574154
Bernoulli distribution has two possible outcomes- success and failure. The simplest example for Bernoulli Distribution is flipping a coin. It has two possible outcomes only-heads or tails.
Let p be the probability of success and 1 – p is the probability of failure. Then PMF(Probability Mass Function) is given by,
It is a probability distribution in which the random variable X can take on any value (is continuous). Because there are infinite values that X could assume, so the probability of X taking a specific value is zero.
For continuous random variables, as we discussed, the probability that X takes on any particular value x is 0. That is, finding for a continuous random variable X is not going to work. Instead, we need to find the probability of random variable X falls in some interval (a,b), that is, we’ll need to find P(a < X < b). We can achieve this by the Probability Density function.
In a normal distribution, the mean is zero and the standard deviation is one. It has zero skew and a kurtosis of 3. Normal distributions are symmetrical, but not all symmetrical distributions are normally distributed.
Normal distribution fits most of the natural phenomena.
You can read more about this here.
The Probability Density Function for normal distribution is given by,
Any data that is normally distributed follow the 1-2-3 rule. This rule states that,
The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of 1.And the probability density function is given by,
You can read more about this here.
You can read more about this here.
Exponential Distribution:
You can read more about this here.
You can read more about this here.
The F distribution is the probability distribution related to the f statistic. The distribution of all possible values of the f statistic is called an F distribution.
If a random variable X has an F-distribution with parameters d1 and d2. Then the PDF for X is given by
You can read more about this here.
The following figure gives the PDF,
End!!!!!
Hope you enjoyed the article. Keep reading!!!…
If you liked this article, here are some other articles you may enjoy: