5 Statistical Tests Every Data Scientist Should Know

Aayush Tyagi Last Updated : 22 Jul, 2024

9 min read

Introduction

In data science, having the ability to derive meaningful insights from data is a crucial skill. A fundamental understanding of statistical tests is necessary to derive insights from any data. These tests allow data scientists to validate hypotheses, compare groups, identify relationships, and make predictions with confidence. Whether you’re analyzing customer behavior, optimizing algorithms, or conducting scientific research, a solid grasp of statistical tests is indispensable. This article explores the essential statistical tests every data scientist should know.

Introduction
Role of Statistical Tests in Data science
5 Statistical Tests Every Data Scientist Should Know
Conclusion

Role of Statistical Tests in Data science

Hypothesis validation: Statistical tests allow data scientists to objectively assess whether observed patterns in data are likely to be real or just due to chance.
Decision making: They provide a quantitative basis for making decisions, helping to remove subjectivity and gut feelings from the process.
Comparing groups: Tests enable meaningful comparisons between different groups or conditions in a dataset.
Identifying relationships: Many tests help uncover and quantify relationships between variables.
Model validation: Statistical tests are crucial in assessing the validity and performance of predictive models.
Quality control: They help in detecting anomalies or significant changes in data patterns.

5 Statistical Tests Every Data Scientist Should Know

Z-test

A z-test is a statistical test used to determine whether there is a significant difference between sample and population means or between the means of two samples when the variances are known and the sample size is large (typically n > 30). It is based on the z-distribution (also known as the standard normal distribution), which is a normal distribution with a mean of 0 and a standard deviation of 1.

Formula

For a single sample z-test, the test statistic (z) is calculated as:

z = (x̅ - μ) / (σ / √n)

Where:

x̅ is the sample mean.
μ is the hypothesized population mean.
σ is the population standard deviation (assumed to be known).
n is the sample size.

Steps for Conducting a Z-Test:

Here are the steps for conducting a z-test:

1. State your hypothesis:

Null hypothesis (H₀): This is the default assumption you aim to disprove. In a z-test, it typically states that there’s no significant difference between the means you’re comparing.
Alternative hypothesis (H₁): This is what you believe to be true and what the z-test will help you assess. It can be one-tailed (specifies a direction for the difference) or two-tailed (doesn’t specify a direction).

2. Choose your significance level (α): This value, denoted by alpha (α), represents the probability of rejecting the null hypothesis when it’s actually true (a type I error). Common choices for alpha are 0.05 (5%) or 0.01 (1%). A lower alpha indicates a stricter test, requiring stronger evidence to reject the null hypothesis.

3. Determine the appropriate z-test type: Select the z-test that aligns with your research question:

One-sample z-test: Compares one sample mean to a hypothesized value.
Two-sample z-test: Compares the means of two independent samples.
Z-test for proportions: Used for data in proportions (less common).

4. Calculate the test statistic (z-score): Use the appropriate formula. This calculation involves the sample means, hypothesized population mean (for one-sample test), standard deviations (or estimated values), and sample sizes.

5. Find the critical value (z_critical): Look up the z-critical value in a standard normal distribution table based on your chosen significance level (alpha).

6. Interpret the results: Compare the absolute value of your calculated z-statistic (|z|) to the z_critical value. If the absolute value of your z-statistic is greater than the critical value, reject the null hypothesis (evidence of a difference).If not, fail to reject the null hypothesis (insufficient evidence for a difference).

T-Test

T-test is a statistical test used to determine if there is a significant difference between the means of two groups. It helps to determine if the differences observed in sample data are likely to exist in the population from which the samples were drawn.

There are three main types of T-tests:

One-Sample T-test
Independent (Two-Sample) T-test
Paired Sample T-test

Formula:

The formula for a t-test depends on the specific type of t-test you’re performing:

1. One-sample t-test:

This formula compares the mean of one sample (x̅) to a hypothesized population mean (μ). It’s similar to a one-sample z-test but uses the sample standard deviation (s) instead of the population standard deviation.

t = (x̅ - μ) / (s / √n)

Where:

x̅ is the sample mean.
μ is the hypothesized population mean.
s is the sample standard deviation.
n is the sample size.

2. Independent (two-sample) t-test:

This formula compares the means of two independent samples (x̅₁ and x̅₂). It considers the separate sample standard deviations (s₁ and s₂).

t = (x̅₁ - x̅₂) / √(s₁² / n₁ + s₂² / n₂)

Where:

x̅₁ and x̅₂ are the means of the two samples.
s₁² and s₂² are the variances of the two samples (estimated from sample data).
n₁ and n₂ are the sizes of the two samples.

3. Paired t-test:

This formula compares the means of paired differences (d) between two related groups.

t = (d̅) / (s_d / √n)

Where:

d̅ is the mean of the paired differences.
s_d is the standard deviation of the paired differences.
n is the number of pairs.

Steps for Conducting a T-Test:

Here’s a breakdown of the steps to calculate a t-test:

State your hypotheses:
- Null hypothesis (H₀): This is the “no difference” scenario you aim to disprove.
- Alternative hypothesis (H₁): This is what you believe might be true.
Choose significance level (α): This is the probability of rejecting a true null hypothesis (usually 0.05).
Identify the appropriate t-test type:
- One-sample t-test (comparing one sample to a hypothesized mean).
- Independent (two-sample) t-test (comparing means of two independent groups).
- Paired t-test (comparing means of paired or related samples).
Collect and organize your data: Ensure your data is numerical and ideally follows a normal distribution.
Calculate the relevant statistics:
- Depending on the chosen t-test type, calculate the mean, standard deviation, and sample size for each group (or for the single sample).
- If using a paired t-test, calculate the mean and standard deviation of the differences between paired samples.
Determine the degrees of freedom (df): This value depends on the sample size(s) and varies with the t-test type. Refer to a t-distribution table guide for calculating df.
Calculate the t-statistic: Use the appropriate formula (refer to previous explanation of t-test formulas) based on your chosen t-test type.
Find the critical value: Look up the t-value on a t-distribution table corresponding to your chosen significance level (α) and the degrees of freedom (df) you calculated in step 6.
Interpret the results:
- If the absolute value of your calculated t-statistic is greater than the critical value from the table, reject the null hypothesis (evidence of a significant difference).
- If not, fail to reject the null hypothesis (insufficient evidence for a difference).

ANOVA (Analysis of Variance)

ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. There are 3 types of ANOVA tests:

One-Way ANOVA: Compares the means of three or more independent (unrelated) groups based on one factor.
Two-Way ANOVA: Compares the means of groups that are split on two factors and can show interaction effects between the factors.
Repeated Measures ANOVA: Used when the same subjects are used for each treatment.

Steps in Conducting ANOVA

1. Formulate Hypotheses:

Null hypothesis (H₀): All group means are equal (µ₁ = µ₂ = µ₃ = … = µₖ).
Alternative hypothesis (H₁): At least one group mean is different.

2. Calculate Group Means and Overall Mean: Compute the mean of each group and the grand mean (overall mean of all observations).

3. Calculate Sums of Squares:

Total Sum of Squares (SST): Measures the total variation in the data.
Between-Group Sum of Squares (SSB): Measures the variation between the group means.
Within-Group Sum of Squares (SSW): Measures the variation within each group.

4. Calculate Degrees of Freedom (df):

df between groups (df₁): k – 1 (where k is the number of groups).
df within groups (df₂): N – k (where N is the total number of observations).

5. Compute Mean Squares:

Mean Square Between (MSB): SSB / df₁
Mean Square Within (MSW): SSW / df₂

6. Calculate the F-Statistic:

F = MSB / MSW

7. Determine the p-Value:

Compare the calculated F-value with the critical F-value from F-distribution tables based on the degrees of freedom and chosen significance level (usually 0.05).

8. Make a Decision:

If the p-value is less than the significance level, reject the null hypothesis (indicating that there are significant differences between group means).

F-Test

F-test is a statistical tool used to compare the variances of two normally distributed populations. It helps determine if there’s a statistically significant difference in how spread out the data is between the two groups.

Formula:

F = σ₁² / σ₂²

Where:

F is the F-statistic (test statistic).
σ₁² (sigma squared) is the variance of the first population / sample.
σ₂² (sigma squared) is the variance of the second population / sample.

Steps to Conduct F-Test:

State the null and alternative hypotheses:
- Null hypothesis (H₀): The variances of the two populations are equal (σ₁² = σ₂²).
- Alternative hypothesis (H₁): The variances of the two populations are not equal (σ₁² ≠ σ₂²).
Calculate the sample variances (s₁² and s₂²) for each group.
Compute the F-statistic using the formula F = s₁² / s₂². Place the larger variance in the numerator to ensure a right-tailed test (more common scenario).
Determine the degrees of freedom: This considers the sample sizes of both groups. You’ll need to look up F-critical values in a table based on these degrees of freedom and your chosen significance level (usually 0.05).
Interpret the results:
- If the F-statistic is greater than the F-critical value, you reject the null hypothesis and conclude there’s a significant difference in variances between the two populations.
- If the F-statistic is less than or equal to the F-critical value, you fail to reject the null hypothesis. There’s not enough evidence to say the variances are statistically different.

Chi-Square Test

The Chi-Square test is a statistical method used to determine if there is a significant association between two categorical variables. It’s widely used in hypothesis testing to assess the goodness of fit or the independence between variables.

There are two types of Chi-Square Tests:

Chi-Square Test for Independence
Chi-Square Test for Goodness of Fit

Chi-Square Test for Independence

The Chi-Square Test for Independence is a statistical test used to determine if there’s a relationship between two categorical variables. Here’s a breakdown of the test and its formula:

Formula:

The Chi-Square test statistic (Χ², chi-squared) is calculated using the following formula:

X^2 = Σ ( (O - E)² / E )

Where:

Σ (sigma) represents summation across all categories (i x j, where i is the number of rows and j is the number of columns in the contingency table).
O = Observed frequency for a particular category combination.
E = Expected frequency for the same category combination (calculated based on the assumption of independence).

Steps to Calculate Chi-Square Test for Independence

Create a contingency table: Fill it with observed frequencies for each combination of variable categories.
Calculate expected frequencies: Consider the row and column totals and the overall sample size to determine what the expected frequencies would be if the variables were independent.
Compute (O-E) for each category: Subtract the expected frequency from the observed frequency for each cell.
Square (O-E) for each category.
Divide (O-E)² by E for each category.
Sum all the values from step 5. This sum is your Chi-Square test statistic (Χ²).

Interpretation:

A higher Chi-Square value indicates a stronger evidence against the null hypothesis (variables are independent).
You need to compare the Chi-Square statistic to a critical value from the Chi-Square distribution table based on the degrees of freedom (calculated as (number of rows – 1) * (number of columns – 1)) and your chosen significance level (usually 0.05).
If the Chi-Square statistic is greater than the critical value, you reject the null hypothesis and conclude there’s a relationship between the variables.

Chi-Square Test for Goodness of Fit

The Chi-Square Test for Goodness of Fit is a different application of the Chi-Square statistic used to assess how well a sample distribution fits a hypothesized probability distribution.

Formula:

Similar to the Chi-Square Test for Independence, the Goodness of Fit test statistic (Χ², chi-squared) is calculated using the following formula:

X^2 = Σ ( (O - E)² / E )

Where:

Σ (sigma) represents summation across all categories (i, where i is the number of categories).
O = Observed frequency for a particular category.
E = Expected frequency for the same category (calculated based on the hypothesized probability distribution).

Steps to Calculate Chi-Square Test for Goodness of Fit:

Define the expected distribution: Specify the theoretical distribution you’re comparing your data to.
Calculate expected frequencies: Based on the chosen distribution and its parameters, calculate how often each category should occur in your sample size.
Create a table: Organize your observed data frequencies and the calculated expected frequencies.
Compute (O-E) for each category. Subtract the expected frequency from the observed frequency for each category.
Square (O-E) for each category.
Divide (O-E)² by E for each category.
Sum all the values from step 6. This sum is your Chi-Square test statistic (Χ²).

Interpretation:

A higher Chi-Square value indicates a stronger deviation from the hypothesized distribution.
You need to compare the Chi-Square statistic to a critical value from the Chi-Square distribution table based on the degrees of freedom (calculated as the number of categories minus 1) and your chosen significance level (usually 0.05).
If the Chi-Square statistic is greater than the critical value, you reject the null hypothesis (data follows the distribution) and conclude there’s a significant difference between your data and the hypothesized distribution.

Conclusion

In data science, statistical tests are essential tools for uncovering insights and making informed decisions. The z-test, t-test, ANOVA, F-test, and chi-square test each play a crucial role in analyzing different aspects of data. By mastering these tests, data scientists can confidently validate hypotheses, compare groups, and identify relationships within their data. Remember, the key to success lies not just in knowing how to perform these tests, but in understanding when and why to use each one. Armed with this knowledge, you’ll be well-equipped to tackle complex data challenges and drive data-driven decision-making in any field.

Aayush Tyagi

Data Analyst with over 2 years of experience in leveraging data insights to drive informed decisions. Passionate about solving complex problems and exploring new trends in analytics. When not diving deep into data, I enjoy playing chess, singing, and writing shayari.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

5 Statistical Tests Every Data Scientist Should Know

Introduction

Table of contents

Role of Statistical Tests in Data science

5 Statistical Tests Every Data Scientist Should Know

Z-test

Formula

Steps for Conducting a Z-Test:

T-Test

Formula:

Steps for Conducting a T-Test:

ANOVA (Analysis of Variance)

Steps in Conducting ANOVA

F-Test

Formula:

Steps to Conduct F-Test:

Chi-Square Test

Chi-Square Test for Independence

Chi-Square Test for Goodness of Fit

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID