A Brief Introduction to Survival Analysis and Kaplan Meier Estimator

DEBYENDU Last Updated : 28 Nov, 2023

7 min read

Introduction

Survival analysis is a statistical method essential for analyzing time-to-event data, widely employed in medical research, economics, and various scientific disciplines. At the core of survival analysis is the Kaplan-Meier estimator, a powerful tool for estimating survival probabilities over time. This article provides a concise introduction to survival analysis, unraveling its significance and applications. We delve into the fundamentals of the Kaplan-Meier estimator, exploring its role in handling censored data and creating survival curves. Whether you’re new to survival analysis or seeking a refresher, this guide navigates through the key concepts, making this statistical approach accessible and comprehensible.

This article was published as a part of the Data Science Blogathon.

Introduction
What is Survival Analysis?
Censoring/ Censored Observation
Kaplan Meier Estimator
Assumptions of Kaplan Meier Survival
Important things to consider for Kaplan Meier Estimator Analysis
Let us take the example in Python
Advantages & Dis-Advantages of Kaplan Meier Estimator
Conclusion
Frequently Asked Questions

What is Survival Analysis?

Survival analysis explores the time until an event occurs, answering questions about failure rates, disease progression, or recovery duration. It’s a crucial statistical field, involving terms like event, time, censoring, and various methods like Kaplan-Meier curves, Cox regression models, and log-rank tests for group comparisons. This branch delves into modeling time-to-event data, offering insights into diverse scenarios, from medical diagnoses to mechanical system failures. Understanding survival analysis requires defining a specific time frame and employing various statistical tools to analyze and interpret data effectively.

Event, when we talk about, is the activity which is going on or is going to happen in the survival analysis study like the Death of a Person from a particular disease, time to get cure by a medical diagnose, time to get cured by vaccines, time of occurrence of failure of machines in the manufacturing shop floor, time for diseases occurrence, etc.

Time

in survival analysis case study is the time from the beginning of the survival analysis observation on the subject matter till the time when the event is going to occur. Like in the case of Mechanical Machine to a failure we need to know the

(a) time of an event when the machine will start
(b) when the machine will fail
(c) loss of machine or the shutdown of the machine from the survival analysis study.

Censoring/ Censored Observation

This terminology is defined as if the subject matter on which we are doing the study of survival analysis doesn’t get affected by the defined event of study, then they are described as censored. The censored subject might also not have an event after the end of the survival analysis observation. The subject is called censored in the sense that nothing was observed out of the subject after the time of censoring.

Censoring Observation are also of 3 types-

1. Right Censored

Right censoring is used in many problems. It happens when we are not certain what happened to people after a certain point in time.

It occurs when the true event time is greater than the censored time when c < t. This happens if either some people cannot be followed the entire time because they died or were lost to follow up or withdrew from the study.

2. Left Censored

Left censoring is when we are not certain what happened to people before some point in time. Left censoring is the opposite, occurring when the true event time is less than the censored time when c > t.

3. Interval Censored

Interval censoring is when we know something has happened in an interval (not before starting time and not after ending time of the study) but we do not know exactly when in the interval it happened.

Interval censoring is a concatenation of the left and right censoring when the time is known to have occurred between two-time points

Survival Function S (t): This is a probability function that depends on the time of the study. The subject survives more than time t. The Survivor function gives the probability that the random variable T exceeds the specified time t.

Here, we will discuss the Kaplan Meier Estimator.

Kaplan Meier Estimator

Kaplan Meier Estimator is used to estimate the survival function for lifetime data. It is a non-parametric statistics technique. It is also known as the product-limit estimator, and the concept lies in estimating the survival time for a certain time of like a major medical trial event, a certain time of death, failure of the machine, or any major significant event.

There are lots of examples like

Failure of machine parts after several hours of operation.
How much time it will take for COVID 19 vaccine to cure the patient.
How much time is required to get a cure from a medical diagnosis etc.
To estimate how many employees will leave the company in a specific period of time.
How many patients will get cured by lung cancer

To Estimate the Kaplan Meier Survival we first need to estimate the Survival Function S (t) is the probability of event time t

Where (d) are the number of death events at the time (t), and (n) is the number of subjects at risk of death just prior to the time (t).

Assumptions of Kaplan Meier Survival

In real-life cases, we do not have an idea of the true survival rate function. So in Kaplan Meier Estimator we estimate and approximate the true survival function from the study data. There are 3 assumptions of Kaplan Meier Survival

Survival Probabilities are the same for all the samples who joined late in the study and those who have joined early. The Survival analysis which can affect is not assumed to change.
Occurrence of Event are done at a specified time.
Censoring of the study does not depend on the outcome. The Kaplan Meier method doesn’t depend on the outcome of interest.

Interpretation of Survival Analysis is Y-axis shows the probability of subject which has not come under the case study. The X-axis shows the representation of the subject’s interest after surviving up to time. Each drop in the survival function (approximated by the Kaplan-Meier estimator) is caused by the event of interest happening for at least one observation.

The plot is often accompanied by confidence intervals, to describe the uncertainty about the point estimates-wider confidence intervals show high uncertainty, this happens when we have a few participants- occurs in both observations dying and being censored.

Important things to consider for Kaplan Meier Estimator Analysis

We need to perform the Log Rank Test to make any kind of inferences.
Kaplan Meier’s results can be easily biased. The Kaplan Meier is a univariate approach to solving the problem
Removal of Censored Data will cause to change in the shape of the curve. This will create biases in model fit-up
Statistical tests and observations become mislead if the Dichotomizing of Continuous Variable is performed.
By dichotomizing means we take statistical measures such as median to create groups but this may lead to problems in the data set.

Let us take the example in Python

Link to Notebook- (https://drive.google.com/file/d/1VGKZNViDbx4rx_7lGMCA6dgU3XuMKGVU/view?usp=sharing)

Let us import the important library required to work in python

First, we are importing different python libraries for our work. Here, we are taking the lung-cancer data set. After the libraries and load, we will read the data using the pandas library. The data set contains different information

Treatment 1=standard, 2=test, Cell type 1=squamous, 2=small
cell, 3=adeno, 4=large, Survival in days, Status 1=dead, 0=censored, Karnofsky score (a measure of general performance, 100=best), Months from Diagnosis, Age in years Prior therapy 0=no, 10=yes, etc.

Here we see the Head &tail.

Now, Here we import the python code for performing the Kaplan Meier Estimator

Here, we perform the analysis on the Karnofsky score it x-axis depicts the timeline and the y-axis shows the score. The best score is 1 it means the subject is fit, a score of 0 means the worst score.

Then we apply the code for Survival, Prior Therapy, the treatment here we will do the Kaplan Meier Estimator Analysis.

Then, we fit up kmf1 = KaplanMeierFitter() for fitting up the Kaplan Meier function and we run the following code for different data related to the lung cancer problems.

Kaplan Meier estimator after running the code shows the plot between Treatment test standard &Treatment test.

In this paper, my key objective was to explain the Survival Analysis with Kaplan Meier Estimator. The things related to it and a problem description in real life.

Advantages & Dis-Advantages of Kaplan Meier Estimator

Advantages

Does not require too many features- time to the survival analysis event is only required.
Provides an average overview related to the event.

Disadvantages

Lots of variables cannot be correlated and monitor simultaneously.
If censoring data is removed the model will get biased at the time of fitting.
The proper estimation of the magnitude of change in the event cannot be predicted.

Conclusion

In conclusion, understanding survival analysis and the Kaplan-Meier estimator opens a gateway to nuanced insights in time-to-event data. The significance of this statistical method transcends disciplines, providing invaluable tools for researchers and analysts. The Kaplan-Meier estimator, with its ability to handle censored data and generate survival curves, stands as a cornerstone in unraveling temporal patterns. As we navigate the diverse applications and implications of survival analysis, this brief guide serves as a stepping stone for both beginners and seasoned analysts, fostering a deeper appreciation for the intricacies of time-dependent phenomena in various fields. Embrace the power of survival analysis and the Kaplan-Meier estimator to decipher the narratives hidden within temporal data.

Frequently Asked Questions

Q1. What are the assumptions of Kaplan-Meier test?

A. The assumptions of Kaplan-Meier test include non-informative censoring, independence of survival times, and no competing risks, ensuring accurate survival estimates in the presence of censored data.

Q2. What are the assumptions for survival analysis?

A. Survival analysis assumes censoring is non-informative, survival times are independent, and the hazard function is constant over time, ensuring reliable results in analyzing time-to-event data.

Q3. What are the requirements for Kaplan-Meier analysis?

A. Kaplan-Meier analysis requires time-to-event data, information on event occurrence, and censored observations. It is suitable for analyzing survival or failure times in medical or experimental studies.

Q4. What are the limitations of Kaplan-Meier survival analysis?

A. Limitations include the inability to handle time-varying covariates, assuming proportional hazards, and sensitivity to the timing of events, affecting its applicability in certain scenarios

DEBYENDU

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

A Brief Introduction to Survival Analysis and Kaplan Meier Estimator

Introduction

Table of contents

What is Survival Analysis?

Time

Censoring/ Censored Observation

1. Right Censored

2. Left Censored

3. Interval Censored

Kaplan Meier Estimator

Assumptions of Kaplan Meier Survival

Important things to consider for Kaplan Meier Estimator Analysis

Let us take the example in Python

Advantages & Dis-Advantages of Kaplan Meier Estimator

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM