In data science, having the ability to derive meaningful insights from data is a crucial skill. A fundamental understanding of statistical tests is necessary to derive insights from any data. These tests allow data scientists to validate hypotheses, compare groups, identify relationships, and make predictions with confidence. Whether you’re analyzing customer behavior, optimizing algorithms, or conducting scientific research, a solid grasp of statistical tests is indispensable. This article explores the essential statistical tests every data scientist should know.
Hypothesis validation: Statistical tests allow data scientists to objectively assess whether observed patterns in data are likely to be real or just due to chance.
Decision making: They provide a quantitative basis for making decisions, helping to remove subjectivity and gut feelings from the process.
Comparing groups: Tests enable meaningful comparisons between different groups or conditions in a dataset.
Identifying relationships: Many tests help uncover and quantify relationships between variables.
Model validation: Statistical tests are crucial in assessing the validity and performance of predictive models.
Quality control: They help in detecting anomalies or significant changes in data patterns.
5 Statistical Tests Every Data Scientist Should Know
Z-test
A z-test is a statistical test used to determine whether there is a significant difference between sample and population means or between the means of two samples when the variances are known and the sample size is large (typically n > 30). It is based on the z-distribution (also known as the standard normal distribution), which is a normal distribution with a mean of 0 and a standard deviation of 1.
Formula
For a single sample z-test, the test statistic (z) is calculated as:
z = (x̅ - μ) / (σ / √n)
Where:
x̅ is the sample mean.
μ is the hypothesized population mean.
σ is the population standard deviation (assumed to be known).
n is the sample size.
Steps for Conducting a Z-Test:
Here are the steps for conducting a z-test:
1. State your hypothesis:
Null hypothesis (H₀): This is the default assumption you aim to disprove. In a z-test, it typically states that there’s no significant difference between the means you’re comparing.
Alternative hypothesis (H₁): This is what you believe to be true and what the z-test will help you assess. It can be one-tailed (specifies a direction for the difference) or two-tailed (doesn’t specify a direction).
2. Choose your significance level (α): This value, denoted by alpha (α), represents the probability of rejecting the null hypothesis when it’s actually true (a type I error). Common choices for alpha are 0.05 (5%) or 0.01 (1%). A lower alpha indicates a stricter test, requiring stronger evidence to reject the null hypothesis.
3. Determine the appropriate z-test type: Select the z-test that aligns with your research question:
One-sample z-test: Compares one sample mean to a hypothesized value.
Two-sample z-test: Compares the means of two independent samples.
Z-test for proportions: Used for data in proportions (less common).
4. Calculate the test statistic (z-score): Use the appropriate formula. This calculation involves the sample means, hypothesized population mean (for one-sample test), standard deviations (or estimated values), and sample sizes.
5. Find the critical value (z_critical): Look up the z-critical value in a standard normal distribution table based on your chosen significance level (alpha).
6. Interpret the results: Compare the absolute value of your calculated z-statistic (|z|) to the z_critical value. If the absolute value of your z-statistic is greater than the critical value, reject the null hypothesis (evidence of a difference).If not, fail to reject the null hypothesis (insufficient evidence for a difference).
T-Test
T-test is a statistical test used to determine if there is a significant difference between the means of two groups. It helps to determine if the differences observed in sample data are likely to exist in the population from which the samples were drawn.
There are three main types of T-tests:
One-Sample T-test
Independent (Two-Sample) T-test
Paired Sample T-test
Formula:
The formula for a t-test depends on the specific type of t-test you’re performing:
1. One-sample t-test:
This formula compares the mean of one sample (x̅) to a hypothesized population mean (μ). It’s similar to a one-sample z-test but uses the sample standard deviation (s) instead of the population standard deviation.
t = (x̅ - μ) / (s / √n)
Where:
x̅ is the sample mean.
μ is the hypothesized population mean.
s is the sample standard deviation.
n is the sample size.
2. Independent (two-sample) t-test:
This formula compares the means of two independent samples (x̅₁ and x̅₂). It considers the separate sample standard deviations (s₁ and s₂).
t = (x̅₁ - x̅₂) / √(s₁² / n₁ + s₂² / n₂)
Where:
x̅₁ and x̅₂ are the means of the two samples.
s₁² and s₂² are the variances of the two samples (estimated from sample data).
n₁ and n₂ are the sizes of the two samples.
3. Paired t-test:
This formula compares the means of paired differences (d) between two related groups.
t = (d̅) / (s_d / √n)
Where:
d̅ is the mean of the paired differences.
s_d is the standard deviation of the paired differences.
n is the number of pairs.
Steps for Conducting a T-Test:
Here’s a breakdown of the steps to calculate a t-test:
State your hypotheses:
Null hypothesis (H₀): This is the “no difference” scenario you aim to disprove.
Alternative hypothesis (H₁): This is what you believe might be true.
Choose significance level (α): This is the probability of rejecting a true null hypothesis (usually 0.05).
Identify the appropriate t-test type:
One-sample t-test (comparing one sample to a hypothesized mean).
Independent (two-sample) t-test (comparing means of two independent groups).
Paired t-test (comparing means of paired or related samples).
Collect and organize your data: Ensure your data is numerical and ideally follows a normal distribution.
Calculate the relevant statistics:
Depending on the chosen t-test type, calculate the mean, standard deviation, and sample size for each group (or for the single sample).
If using a paired t-test, calculate the mean and standard deviation of the differences between paired samples.
Determine the degrees of freedom (df): This value depends on the sample size(s) and varies with the t-test type. Refer to a t-distribution table guide for calculating df.
Calculate the t-statistic: Use the appropriate formula (refer to previous explanation of t-test formulas) based on your chosen t-test type.
Find the critical value: Look up the t-value on a t-distribution table corresponding to your chosen significance level (α) and the degrees of freedom (df) you calculated in step 6.
Interpret the results:
If the absolute value of your calculated t-statistic is greater than the critical value from the table, reject the null hypothesis (evidence of a significant difference).
If not, fail to reject the null hypothesis (insufficient evidence for a difference).
ANOVA (Analysis of Variance)
ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. There are 3 types of ANOVA tests:
One-Way ANOVA: Compares the means of three or more independent (unrelated) groups based on one factor.
Two-Way ANOVA: Compares the means of groups that are split on two factors and can show interaction effects between the factors.
Repeated Measures ANOVA: Used when the same subjects are used for each treatment.
Steps in Conducting ANOVA
1. Formulate Hypotheses:
Null hypothesis (H₀): All group means are equal (µ₁ = µ₂ = µ₃ = … = µₖ).
Alternative hypothesis (H₁): At least one group mean is different.
2. Calculate Group Means and Overall Mean: Compute the mean of each group and the grand mean (overall mean of all observations).
3. Calculate Sums of Squares:
Total Sum of Squares (SST): Measures the total variation in the data.
Between-Group Sum of Squares (SSB): Measures the variation between the group means.
Within-Group Sum of Squares (SSW): Measures the variation within each group.
4. Calculate Degrees of Freedom (df):
df between groups (df₁): k – 1 (where k is the number of groups).
df within groups (df₂): N – k (where N is the total number of observations).
5. Compute Mean Squares:
Mean Square Between (MSB): SSB / df₁
Mean Square Within (MSW): SSW / df₂
6. Calculate the F-Statistic:
F = MSB / MSW
7. Determine the p-Value:
Compare the calculated F-value with the critical F-value from F-distribution tables based on the degrees of freedom and chosen significance level (usually 0.05).
8. Make a Decision:
If the p-value is less than the significance level, reject the null hypothesis (indicating that there are significant differences between group means).
F-Test
F-test is a statistical tool used to compare the variances of two normally distributed populations. It helps determine if there’s a statistically significant difference in how spread out the data is between the two groups.
Formula:
F = σ₁² / σ₂²
Where:
F is the F-statistic (test statistic).
σ₁² (sigma squared) is the variance of the first population / sample.
σ₂² (sigma squared) is the variance of the second population / sample.
Steps to Conduct F-Test:
State the null and alternative hypotheses:
Null hypothesis (H₀): The variances of the two populations are equal (σ₁² = σ₂²).
Alternative hypothesis (H₁): The variances of the two populations are not equal (σ₁² ≠ σ₂²).
Calculate the sample variances (s₁² and s₂²) for each group.
Compute the F-statistic using the formula F = s₁² / s₂². Place the larger variance in the numerator to ensure a right-tailed test (more common scenario).
Determine the degrees of freedom: This considers the sample sizes of both groups. You’ll need to look up F-critical values in a table based on these degrees of freedom and your chosen significance level (usually 0.05).
Interpret the results:
If the F-statistic is greater than the F-critical value, you reject the null hypothesis and conclude there’s a significant difference in variances between the two populations.
If the F-statistic is less than or equal to the F-critical value, you fail to reject the null hypothesis. There’s not enough evidence to say the variances are statistically different.
Chi-Square Test
The Chi-Square test is a statistical method used to determine if there is a significant association between two categorical variables. It’s widely used in hypothesis testing to assess the goodness of fit or the independence between variables.
There are two types of Chi-Square Tests:
Chi-Square Test for Independence
Chi-Square Test for Goodness of Fit
Chi-Square Test for Independence
The Chi-Square Test for Independence is a statistical test used to determine if there’s a relationship between two categorical variables. Here’s a breakdown of the test and its formula:
Formula:
The Chi-Square test statistic (Χ², chi-squared) is calculated using the following formula:
X^2 = Σ ( (O - E)² / E )
Where:
Σ (sigma) represents summation across all categories (i x j, where i is the number of rows and j is the number of columns in the contingency table).
O = Observed frequency for a particular category combination.
E = Expected frequency for the same category combination (calculated based on the assumption of independence).
Steps to Calculate Chi-Square Test for Independence
Create a contingency table: Fill it with observed frequencies for each combination of variable categories.
Calculate expected frequencies: Consider the row and column totals and the overall sample size to determine what the expected frequencies would be if the variables were independent.
Compute (O-E) for each category: Subtract the expected frequency from the observed frequency for each cell.
Square (O-E) for each category.
Divide (O-E)² by E for each category.
Sum all the values from step 5. This sum is your Chi-Square test statistic (Χ²).
Interpretation:
A higher Chi-Square value indicates a stronger evidence against the null hypothesis (variables are independent).
You need to compare the Chi-Square statistic to a critical value from the Chi-Square distribution table based on the degrees of freedom (calculated as (number of rows – 1) * (number of columns – 1)) and your chosen significance level (usually 0.05).
If the Chi-Square statistic is greater than the critical value, you reject the null hypothesis and conclude there’s a relationship between the variables.
Chi-Square Test for Goodness of Fit
The Chi-Square Test for Goodness of Fit is a different application of the Chi-Square statistic used to assess how well a sample distribution fits a hypothesized probability distribution.
Formula:
Similar to the Chi-Square Test for Independence, the Goodness of Fit test statistic (Χ², chi-squared) is calculated using the following formula:
X^2 = Σ ( (O - E)² / E )
Where:
Σ (sigma) represents summation across all categories (i, where i is the number of categories).
O = Observed frequency for a particular category.
E = Expected frequency for the same category (calculated based on the hypothesized probability distribution).
Steps to Calculate Chi-Square Test for Goodness of Fit:
Define the expected distribution: Specify the theoretical distribution you’re comparing your data to.
Calculate expected frequencies: Based on the chosen distribution and its parameters, calculate how often each category should occur in your sample size.
Create a table: Organize your observed data frequencies and the calculated expected frequencies.
Compute (O-E) for each category. Subtract the expected frequency from the observed frequency for each category.
Square (O-E) for each category.
Divide (O-E)² by E for each category.
Sum all the values from step 6. This sum is your Chi-Square test statistic (Χ²).
Interpretation:
A higher Chi-Square value indicates a stronger deviation from the hypothesized distribution.
You need to compare the Chi-Square statistic to a critical value from the Chi-Square distribution table based on the degrees of freedom (calculated as the number of categories minus 1) and your chosen significance level (usually 0.05).
If the Chi-Square statistic is greater than the critical value, you reject the null hypothesis (data follows the distribution) and conclude there’s a significant difference between your data and the hypothesized distribution.
Conclusion
In data science, statistical tests are essential tools for uncovering insights and making informed decisions. The z-test, t-test, ANOVA, F-test, and chi-square test each play a crucial role in analyzing different aspects of data. By mastering these tests, data scientists can confidently validate hypotheses, compare groups, and identify relationships within their data. Remember, the key to success lies not just in knowing how to perform these tests, but in understanding when and why to use each one. Armed with this knowledge, you’ll be well-equipped to tackle complex data challenges and drive data-driven decision-making in any field.
Data Analyst with over 2 years of experience in leveraging data insights to drive informed decisions. Passionate about solving complex problems and exploring new trends in analytics. When not diving deep into data, I enjoy playing chess, singing, and writing shayari.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.