Discovering Insights with Chi Square Tests: A Hands-on Approach in Python

Aashish Last Updated : 26 Feb, 2024

6 min read

Introduction

Let me take you into the universe of chi-square tests and how we can involve them in Python with the scipy library. We’ll be going over the chi-square integrity of the fit test. Whether the reader is just starting or an accomplished information examiner, this guide will outfit you with pragmatic models and experiences so you can unhesitatingly apply chi-square tests in your own work.

Learning objectives

By the end of this article, readers will have:

Understood what a Chi-Square Test is and its purpose.
Recognized the different types of Chi-Square Tests.
Calculated Chi-Square in order to test any relation between two categorical variables.
Understand a project implemented in Chi-Square in Python using step-by-step instructions.

This article was published as a part of the Data Science Blogathon.

What is Chi-Square Test?
Types of Chi-Square Tests
Calculating Chi-Square
Real World Example
Frequency of Heart Failure by Obesity
Calculating Chi-Square in Python
Frequently Asked Questions

What is Chi-Square Test?

The Chi-Square test is one of the fact-based interactions used to assess the connection between two all-out factors to figure out the connection between them. This test is extremely straightforward including looking at the noticed frequencies of the factors with their normal frequencies under the supposition that there is no relationship between them. The Chi-Square trial of freedom is usually utilized kind of Chi-Square test. It is applied in circumstances where we have two straight-out factors – like obesity and heart failure event, and we need to research on the off chance that there is an association between them. By doing this we can decide if the example falls into classes in light of our assumptions for the variable dissemination.

Types of Chi-Square Tests

There are several types of Chi-Square Tests, including the chi-square goodness of fit test, the chi-square test of independence, and the chi-square test for homogeneity. The type of test used will depend on the specific research question being addressed and the type of data being analyzed.

Chi-square Goodness of Fit Test: This type of test is used to find out how the observed value of a given condition is significantly (or not) different from the expected value

Chi-square Test of Independence: This type of test is a statistical hypothesis test that can be utilized to determine if 2 categorical and nominal variables are (likely) related or not.

Chi-square Test for Homogeneity: This type of test is used by statisticians to check whether different columns and/or rows of data in a table belong to the same population (or not).

What is a Chi-Square Test? Formula, Examples & Application

Calculating Chi-Square

To calculate the Chi-Square statistic, the observed frequencies are compared to the expected frequencies. The formula for the Chi-Square statistic is:

Chi-Square = Σ((Observed – Expected)^2 / Expected)

Where Observed is the observed frequency for each category and Expected is the expected frequency for each category.

Real World Example

Let me talk through a real-world example of the Chi-Square test to understand how it can help one determine if there’s a relationship between obesity and heart failure rates. As a result, I used a sample of patients diagnosed with heart failure who had their body mass index (BMI) data to categorize them as obese or non-obese.

Now, to calculate the Chi-Square statistic, I created a contingency table showing the number of patients in each category for both obesity (based on BMI) and heart failure variables. After that I’m estimating the expected frequencies for each cell in this table assuming no association between these variables. In the end using the Chi-Square formula I compared the observed and expected frequencies to find if there was any significant association between the two variables.

If my calculated Chi-Square statistic value is greater than the critical value, I reject our null hypothesis that there’s no link between obesity and heart failure. This indicates that obesity is indeed a risk factor for heart failure. Conducting such tests helps us gain valuable insights into relationships within a sample population and develop preventative measures to improve patient outcomes.

Now that we have seen how the process works in theory, let me show you practically, how the calculations and the process works:

Frequency of Heart Failure by Obesity

Our Hypothesis

H0: Obesity and heart failure are independent
HA: Obesity and heart failure are not independent

Frequencies

Here, we calculated the total frequencies by summing up the observed frequencies.

To understand the number of obese patients who would not have undergone heart failure in our sample by chance, we will use the expected values. This is calculated by multiplying each row total by each column total, then dividing the result by the overall sample total. This will give us the expected values of obese patients who did not experience heart failure in our sample population.

Now, let’s calculate the Chi-Square value using the below formula:

Chi-Square = Σ((Observed – Expected)^2 / Expected)

And here are the results:

Finally, let’s add all the values to find out Chi-Square

Chi-Square = 53.63

Now, we need to determine an alpha level for our test. Let’s set 0.05 as an alpha level and find the critical value of Chi-Square (p). I used the Chi-Square calculator to calculate the p-value.

The p-value is less than .00001 which is, obviously, less than .05 (our alpha value)

Hence, the result is significant. In other words, we reject the null hypothesis which tells us that there is a relationship between Obesity and Heart Failure.

Calculating Chi-Square in Python

Now that we did all the fun calculations manually, let’s see if we can do the same using Python. As I previously mentioned, we will be using the scipy package’s chi2_contingency function in Python to do this.

Step 1: Create a Contingency Table

I’m using the crosstab() function from the pandas library. I’m using Heart Failure to group by in the rows and the Obesity variable to group by in the columns. We also need to set margins to true to add row and column subtotals.

heartfailure_crosstab = pd.crosstab(df['obesity'], 
df['heart_failure'], margins=True, margins_name="subtotal")

It returns a contingency table that has this data:

Step 2: Compute Chi-Square and p-values

I used the scipy.stats.chi2_contingency function to calculate both my chi-square and p values.

To use this function, I use the following line:

chi, p, dof, expected = chi2_contingency(heartfailure_crosstab)

On a successful run of this above, the function returns the chi-square value to chi, p-value to p, degrees of freedom to dof, and expected values to expected variables respectively.

Step 3: Print P Value

In the above step we stored the output values to variables chi, p, dof, and expected respectively. To find out if p value in this calculation is less than our alpha value (0.05) we use the following command:

print(p)

The output of the above command will be:

0.0000000000004257

All I need to do now is to look at the p value and compare it with my alpha to make my conclusion. The above value is less than 0.0001 which is clearly less than 0.05 (our alpha value) and that allows us to conclude that the result is significant. Hence, we reject the null hypothesis which tells us that there is a relation between Obesity and Heart Failures.

Conclusion

The manual calculation of chi-square tests, as you have seen, requires quite a bit of time and manual effort, though Python’s auto calculation using a command is much simpler and more efficient. In this article, I discussed what this test is, different types of chi-square tests, and how to perform a sample chi/square test. Additionally, we learned how to handle similar computations in Python with a single capability by doing so, saving time and effort.

5 Upcoming Python Libraries You Don’t Want to Miss in 2023

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is a Chi-Square Test and why is it important in data analysis?

A. A Chi-Square Test is a statistical method used to determine if there is a significant association between categorical variables. It’s crucial in data analysis as it helps identify relationships and patterns within data sets, aiding decision-making and hypothesis testing.

Q2. How do you perform a Chi-Square Test in Python?

A: Performing a Chi-Square Test in Python involves using libraries like scipy and pandas. You can use functions such as scipy.stats.chisquare() or scipy.stats.chi2_contingency() to conduct the test on categorical data, enabling hands-on analysis of relationships and dependencies.

Q3. What are the prerequisites for conducting Chi-Square Tests in Python?

A. Before conducting Chi-Square Tests in Python, one should have a basic understanding of Python programming and familiarity with libraries like pandas and scipy. Additionally, having knowledge of categorical data and the concepts of statistical hypothesis testing would be beneficial for effective utilization.

Q4. How to do Chi Square?

To perform a Chi-Square Test: Define hypotheses, collect categorical data, create a contingency table, calculate expected frequencies, compute the Chi-Square statistic, find critical value, compare, interpret, and repeat if necessary.

Aashish

Product Manager with 10+ years of experience in driving product innovation, managing product development and leading agile product teams. Experienced in developing product strategies and executing on tactical plans to ensure successful product launches and profitable product roadmaps. Skilled in developing product roadmaps, managing product portfolios, and creating product requirements.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Discovering Insights with Chi Square Tests: A Hands-on Approach in Python

Introduction

Learning objectives

Table of contents

What is Chi-Square Test?

Types of Chi-Square Tests

Calculating Chi-Square

Real World Example

Frequency of Heart Failure by Obesity

Our Hypothesis

Frequencies

Calculating Chi-Square in Python

Step 1: Create a Contingency Table

Step 2: Compute Chi-Square and p-values

Step 3: Print P Value

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I