SciPy - Cox Proportional Hazards Model Survival Analysis



The Cox Proportional Hazards Model is a popular statistical method used for survival analysis. It helps estimate the effect of various variables on the time it takes for an event such as failure or death to occur. This model is particularly valuable when dealing with censored data where some individuals may not have experienced the event by the end of the study.

Although SciPy doesn't have a built-in Cox Proportional Hazards model, the lifelines library which is based on SciPy, offers comprehensive support for survival analysis, including the Cox Proportional Hazards model.

Steps of Cox Proportional Hazards in Python using lifelines

Following are the steps that need to be followed to implement the Cox Proportional Hazards in Python using lifelines −

Install Lifeline Library

First we have to install the lifeline library by executing the below command in the command prompt, if haven't installed before −

pip install lifelines

Importing the Libraries

After that we have to import all the necessary libraries −

import pandas as pd
import numpy as np
from lifelines import CoxPHFitter

Prepare our Data

For the Cox Proportional Hazards model our dataset should include at least two components which are mentioned as follows −

  • Duration, which is the time until the event or censoring occurs.
  • Event/Censoring Indicator, 1 if the event occurred, 0 if the observation was censored.

Below is the example dataset to implement the Cox Proportional Hazards Model

# Example dataset
data = {
    'age': [60, 65, 70, 80, 85],
    'sex': [1, 0, 1, 1, 0],  # 1 for male, 0 for female
    'duration': [5, 6, 7, 8, 9],  # Duration in years
    'event': [1, 0, 1, 1, 0]  # 1 for event (e.g., death), 0 for censored
}
# Create a DataFrame
df = pd.DataFrame(data)

Fit the ox Proportional Hazards Model

Here we will display a summary of the model with estimated coefficients, standard errors, z-scores and p-values for each predictor variable which helps us to understand their effects on the hazard ratio.

# Instantiate the Cox Proportional Hazards model
cph = CoxPHFitter()

# Fit the model with the dataset
cph.fit(df, duration_col='duration', event_col='event')

# Print the model summary
cph.print_summary()

Making Predictions

Once the model is fitted we can predict survival functions for new data or calculate the cumulative hazard.

# Predict the survival function for a new individual
new_data = pd.DataFrame({
    'age': [75],
    'sex': [1],
})

# Predict survival for the new individual
survival_function = cph.predict_survival_function(new_data)
print(survival_function)

Example

Following is the example which simulates a dataset with different risk factors such as age, gender, smoking status and analyzes their effect on survival time −

import pandas as pd
from lifelines import CoxPHFitter

# Simulated dataset: Survival data with age, sex, and smoking status
data = {
    'age': [60, 62, 65, 70, 72, 75, 80, 85],      # Age of individuals
    'sex': [1, 0, 1, 1, 0, 1, 0, 1],              # 1 for male, 0 for female
    'smoking_status': [1, 0, 1, 0, 1, 0, 1, 0],   # 1 for smoker, 0 for non-smoker
    'duration': [5, 6, 7, 8, 9, 5, 6, 7],         # Survival time in years
    'event': [1, 1, 0, 1, 1, 0, 1, 1]             # 1 for event (e.g., death), 0 for censored
}

# Create DataFrame
df = pd.DataFrame(data)

# Instantiate the Cox Proportional Hazards model
cph = CoxPHFitter()

# Fit the model with the dataset
cph.fit(df, duration_col='duration', event_col='event')

# Print the model summary to analyze the results
cph.print_summary()

# Predict the survival function for a new individual (e.g., 70-year-old smoker)
new_data = pd.DataFrame({
    'age': [70],
    'sex': [1],          # Male
    'smoking_status': [1],  # Smoker
})

# Predict survival for the new individual
survival_function = cph.predict_survival_function(new_data)

# Print the survival function for the new individual
print(survival_function)

Here is the output of the above example −

< lifelines.CoxPHFitter: fitted with 8 total observations, 2 right-censored observations >
             duration col = 'duration'
                event col = 'event'
      baseline estimation = breslow
   number of observations = 8
number of events observed = 6
   partial log-likelihood = -7.39
         time fit was run = 2025-01-31 12:06:31 UTC

---
                coef exp(coef)  se(coef)  coef lower 95%  coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
covariate
age            -0.02      0.98      0.06           -0.14            0.11                0.87                1.12
sex            -0.23      0.79      1.06           -2.31            1.84                0.10                6.32
smoking_status -0.55      0.58      1.03           -2.57            1.47                0.08                4.33

                cmp to     z    p  -log2(p)
covariate
age               0.00 -0.26 0.80      0.33
sex               0.00 -0.22 0.83      0.28
smoking_status    0.00 -0.54 0.59      0.76
---
Concordance = 0.47
Partial AIC = 20.79
log-likelihood ratio test = 0.33 on 3 df
-log2(p) of ll-ratio test = 0.07

            0
5.0  0.918415
6.0  0.734865
7.0  0.610552
8.0  0.435392
9.0  0.191870
Advertisements