Hello

Who am I

Associate Staff Biostatistician at the Cleveland Clinic in the Department of Quantitative Health Sciences and the Taussig Cancer Institute.

Applied cancer biostatistics and methods research in early phase oncology clinical trial design and methods for retrospective data analysis.

Why am I here

Sometimes, despite the actual focus of your career, you end up best known for a website post on survival analysis in R that you made for an internal training back in 2018 🤷.

Full tutorial is available on my website.

What are survival data anyway?

Examples from cancer

  • Time from diagnosis to death
  • Time from surgery to recurrence of disease
  • Time from start of treatment to progression of disease
  • Time from response to recurrence of disease

Examples from other fields

  • Time from HIV infection to development of AIDS
  • Time to from diagnosis with heart disease to heart attack
  • Time from dicharge from rehabilitation facility to recurrence of substance abuse
  • Time from birth to initiation of sexual activity
  • Time from production to machine malfunction

A rose by any other name…

Because time-to-event data are common in many fields, it also goes by names besides survival analysis including:

  • Reliability analysis
  • Duration analysis
  • Event history analysis
  • Time-to-event analysis

What is censoring?

Censoring occurs when the event of interest is not observed after a period of follow-up

But isn’t this just binary data??

  • Binary data doesn’t have the ability to change depending on the time of analysis, e.g. 5-year survival will have the same value whether it is analyzed at 5 years and 1 day, 5 years and 2 days, 6 years, etc. Either a participant died by 5 years or they didn’t.

  • Time-to-event data may have different values depending on the time of analysis, e.g. overall survival will have different values depending on whether it is analyzed at 5 years and 1 day or at 6 years, since additional participants can die between those two time points.

Right censoring example

Reasons for censoring

A subject may be censored due to:

  • Loss to follow-up
  • Withdrawal from study
  • No event by end of fixed study period

Other types of censoring

  • Left censoring: when the event or censoring occurred before a study has started or data is collected

  • Interval censoring: when the event or censoring occurred between two dates but when is not known exactly

    • Common in cancer studies where, for example, disease recurrence can only be detected by imaging, but the actual recurrence is known to have developed some time between the prior negative imaging and the current positive imaging

Today we will focus only on right censoring.

Recall this plot

Censoring must be considered in the analysis

How would we compute the proportion who are event-free at 15 years?

  • Subjects 7, 8, and 10 had the event before 15 years
  • Subjects 1 and 9 were censored before 15 years
  • The remaining subjects were event-free and still being followed at 15 years

And how would you compute the median time-to-event when the event time is unknown for some patients?

Additional reasons for survival analysis

  • Distribution of follow-up times is skewed
  • Distribution may differ between censored and event patients
  • Follow-up times are always positive

Data components for survival analysis

To analyze survival data, we need to know the observed time (\(Y_i\)) and the event indicator (\(\delta_i\)). For a subject (denoted by \(i\)):

  • Observed time is the minimum of the event time (\(T_i\)) and censoring time (\(C_i\)) (\(Y_i = \min(T_i, C_i)\))
  • Event indicator (\(\delta_i\)) is 1 if the event is observed (i.e. \(T_i \leq C_i\)) and 0 if censored (i.e. \(T_i > C_i\))

Load needed packages

library(dplyr)
library(ggplot2)
library(lubridate)
library(survival)
library(ggsurvfit)
library(gtsummary)

Load example data

To access the example data used throughout this talk, install and load the cancersimdata package from my GitHub repo:

# If needed, install the remotes package first
install.packages("remotes")

# Then install the GitHub repository for the dataset
remotes::install_github("zabore/cancersimdata")

# Finally, load the repository
library(cancersimdata)

Example data background

bc_rt_data is a synthetic dataset based on real breast cancer data. The dataset contains information on 3000 women with T1-2N1M0 breast cancer, who had a mastectomy between 1995-2015.

The original study examined the association between post-mastectomy radiation therapy and disease recurrence.

Sittenfeld SMC, Zabor EC, …, Tendulkar RD. A multi-institutional prediction model to estimate the risk of recurrence and mortality after mastectomy for T1-2N1 breast cancer. Cancer. 2022 Aug 15;128(16):3057-3066. doi: 10.1002/cncr.34352. Epub 2022 Jun 17. PMID: 35713598; PMCID: PMC9539507.

Example data contents

Relevant variables include:

  • rt: PMRT indicator, 1 = yes, 0 = no
  • os_event: Death indicator, 1 = dead, 0 = censored
  • date_of_mastecomy: Date of mastectomy
  • date_last_follow_up_death: Date of last follow-up or death

Example data format

bc_rt_data |> 
  select(rt, os_event, date_of_mastectomy, date_last_follow_up_death) |> 
  print(n = 10)
# A tibble: 3,000 × 4
      rt os_event date_of_mastectomy date_last_follow_up_death
   <dbl>    <dbl> <chr>              <chr>                    
 1     0        0 09/11/2004         07/24/2009               
 2     0        0 08/06/2011         11/07/2016               
 3     0        0 06/10/1998         11/04/2005               
 4     1        0 10/06/2013         04/30/2019               
 5     1        0 11/22/2002         01/17/2008               
 6     0        0 01/31/2013         01/12/2019               
 7     1        0 05/03/2014         08/23/2016               
 8     0        0 02/16/1998         11/12/2004               
 9     1        0 03/04/2014         02/07/2020               
10     1        1 03/06/1998         09/08/2003               
# ℹ 2,990 more rows

Event indicator

It is important to pay attention to the format of the event indicator.

The Surv() function in the survival package accepts by default TRUE/FALSE, where TRUE is event and FALSE is censored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 is event and 1 is censored. Please take care to ensure the event indicator is properly formatted.

Here we see that the documentation stipulates 1 is event (death) and 0 is censored.

Start and end dates

Here, the start and end dates are in the dataset, formatted as character variables.

We need to:

  1. Convert them to date format
  2. Calculate the follow-up times

Formatting dates

The lubridate package offers a more comprehensive and user-friendly suite of functions for date manipulation. See the lubridate website for details.

The mdy() function converts character values ordered month day year:

bc_rt_data <- 
  bc_rt_data |> 
  mutate(
    date_of_mastectomy = mdy(date_of_mastectomy), 
    date_last_follow_up_death = mdy(date_last_follow_up_death)
  )

Formatted dates

bc_rt_data |> 
  select(rt, os_event, date_of_mastectomy, date_last_follow_up_death) |> 
  print(n = 10)
# A tibble: 3,000 × 4
      rt os_event date_of_mastectomy date_last_follow_up_death
   <dbl>    <dbl> <date>             <date>                   
 1     0        0 2004-09-11         2009-07-24               
 2     0        0 2011-08-06         2016-11-07               
 3     0        0 1998-06-10         2005-11-04               
 4     1        0 2013-10-06         2019-04-30               
 5     1        0 2002-11-22         2008-01-17               
 6     0        0 2013-01-31         2019-01-12               
 7     1        0 2014-05-03         2016-08-23               
 8     0        0 1998-02-16         2004-11-12               
 9     1        0 2014-03-04         2020-02-07               
10     1        1 1998-03-06         2003-09-08               
# ℹ 2,990 more rows

Calculating follow-up times

%--% is a special operator that creates an interval from a specific instant to another instant.

dyears(1) converts the interval to be on the years scale

bc_rt_data <-
  bc_rt_data |> 
  mutate(
    os_years = (date_of_mastectomy %--% date_last_follow_up_death) / dyears(1)
  )


Note that os_years is already a variable in this dataset, and we are simply overwriting it here with our own calculation, for demonstration purposes

Checking follow-up times

In real-world data it is common to encounter errors in data, such as end dates that come before start dates, etc.

Typically as a quick check, I look at the numeric and visual distribution of follow-up times.

Checking follow-up times

summary(bc_rt_data$os_years)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.02464  4.46749  6.35729  6.56711  8.55852 21.26489 
ggplot(bc_rt_data, aes(x = os_years, fill = factor(os_event))) +
  geom_histogram(bins = 25, alpha = 0.5, position = "identity")

The survival package

The basis of the survival ecosystem in R.


  • Began development in 1985
  • Total of 11.9M downloads
  • Active development ongoing
  • Many detailed vignettes covering both the basics and advanced topics
  • Includes the essential methods

The gtsummary package

Create highly customizable tables, see https://www.danieldsjoberg.com/gtsummary/ for details.

Use tbl_survfit() to create:

  • Tables of median event-free time
  • Tables of x-time event-free probability

Use tbl_uvregression() to create:

  • Tables of univariate Cox regression results

Use tbl_regression() to create:

  • Tables of multivariable Cox regression results

The ggsurvfit package

Uses ggplot2 as the basis so known customizations are available with the + operator. See https://www.danieldsjoberg.com/ggsurvfit/ for details.

  • Plot Kaplan-Meier curves using ggsurvfit
  • Plot cumulative incidence curves for competing risks using ggcuminc
  • Plot multi-state models using ggsurvfit
  • Options to add confidence intervals, risk tables, quantiles, and more

Packages, in summary

Survival object

The Surv() function creates a survival object for use as the response in a model formula. There is one value for each subject that is the survival time, followed by a + if the subject was censored.

 [1] 4.865161+ 5.256674+ 7.403149+ 5.563313+ 5.152635+ 5.946612+ 2.308008+
 [8] 6.737851+ 5.930185+ 5.508556 

We see that that the first 9 subjects were censored at variaous times, and the 10th subject had an event at 5.5 years.

Kaplan-Meier

The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.

Survival curves

The survfit() function creates survival curves using the Kaplan-Meier method based on a formula.

s1 <- survfit(Surv(os_years, os_event) ~ 1, data = bc_rt_data)

Some key components of this survfit object that will be used to create survival curves include:

  • time: the timepoints at which the curve has a step, i.e. at least one event occurred
  • surv: the survival probability estimate at the corresponding time

Kaplan-Meier curves with ggsurvfit

The ggsurvfit package works best if you create the survfit object using the included survfit2() function, which uses the same syntax to what we saw previously with survfit().

survfit2() tracks the environment from the function call, which allows the plot to have better default values for labeling and p-value reporting.

s2 <- survfit2(Surv(os_years, os_event) ~ 1, data = bc_rt_data)

Plotting the curves

s2 |> 
  ggsurvfit() +
  labs(
    x = "Years from mastectomy",
    y = "Overall survival probability"
  ) + 
  scale_y_continuous(
    limits = c(0, 1)) + 
  scale_x_continuous(
    limits = c(0, 15), 
    breaks = seq(0, 15, 5)) + 
  add_confidence_interval() + 
  add_risktable(
    risktable_stats = "n.risk")

Add confidence intervals

s2 |> 
  ggsurvfit() +
  labs(
    x = "Years from mastectomy",
    y = "Overall survival probability"
  ) + 
  scale_y_continuous(
    limits = c(0, 1)) + 
  scale_x_continuous(
    limits = c(0, 15), 
    breaks = seq(0, 15, 5)) + 
  add_confidence_interval() + 
  add_risktable(
    risktable_stats = "n.risk")

Add risk table

s2 |> 
  ggsurvfit() +
  labs(
    x = "Years from mastectomy",
    y = "Overall survival probability"
  ) + 
  scale_y_continuous(
    limits = c(0, 1)) + 
  scale_x_continuous(
    limits = c(0, 15), 
    breaks = seq(0, 15, 5)) + 
  add_confidence_interval() + 
  add_risktable(
    risktable_stats = "n.risk")

Survival curves: common mistakes

  1. Plotting the y-axis on a scale other than 0, 1
  2. Plotting the x-axis beyond the limit of reasonable confidence
  3. Adding too much extra information to plot face, e.g. hazard ratios, p-values, median survival time, etc
  4. Using default/non-descriptive axis labels
  5. Ignoring negative or missing survival times

Estimating x-time survival

One quantity of interest in a survival analysis is the probability of surviving beyond a certain point in time (x).

For example, to estimate the probability of surviving to 10 years, use summary with the times argument.

summary(survfit(Surv(os_years, os_event) ~ 1, data = bc_rt_data), times = 10)
Call: survfit(formula = Surv(os_years, os_event) ~ 1, data = bc_rt_data)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   10    386     501    0.727  0.0125        0.703        0.752

We find that the 10-year probability of survival in this study is 73%.

The associated lower and upper bounds of the 95% confidence interval are also displayed.

What does x-time survival mean?

The 10-year survival probability is the point on the y-axis that corresponds to 10 years on the x-axis for the survival curve.

x-time survival: common mistakes

Using a “naive” estimate:

501 of the 3000 patients in the data died by 10 years so the “naive” estimate is calculated as:

\[\Big(1 - \frac{501}{3000}\Big) \times 100 = 83\%\] You get an incorrect estimate of the 10-year probability of survival when you ignore the fact that 2113 patients were censored before 10 years.

Recall the correct estimate of the 10-year probability of survival, accounting for censoring using the Kaplan-Meier method, was 73%.

Illustration: overestimation using naive estimate

Ignoring censoring leads to an overestimate of the overall survival probability. Censored subjects only contribute information for a portion of the follow-up time, and then fall out of the risk set, pulling down the cumulative probability of survival. Ignoring censoring erroneously treats patients who are censored as part of the risk set for the entire follow-up period.

Table of x-time survival probability

We can produce nice tables of x-time survival probability estimates using the tbl_survfit() function from the gtsummary package:

survfit(Surv(os_years, os_event) ~ 1, 
        data = bc_rt_data) |> 
  tbl_survfit(
    times = 10,
    label_header = 
      "**10-year survival (95% CI)**"
  )
Characteristic 10-year survival (95% CI)
Overall 73% (70%, 75%)

Estimating median survival time

Another quantity of interest in a survival analysis is the average survival time, which we quantify using the median.

Note that survival times are not expected to be normally distributed so the mean is not an appropriate summary.

We can obtain the median survival directly from the survfit object:

Call: survfit(formula = Surv(os_years, os_event) ~ 1, data = bc_rt_data)

        n events median 0.95LCL 0.95UCL
[1,] 3000    526     NA      NA      NA

We see the median survival time is is NA, which means that it has not yet been reached in this study. This is common in scenarios when the event rate is low, or follow-up time is short.

What does median survival mean?

Median survival is the time on the x-axis corresponding to a survival probability of 0.5 on the y-axis. Here we see there is no time where the horizontal line at 0.5 meets the survival curve.

Median survival: common mistakes

Using a “naive” estimate.

Summarize the median survival time among the 526 patients who died:

# A tibble: 1 × 1
  median_surv
        <dbl>
1        4.79

You get an incorrect estimate of median survival time of 4.8 years when you ignore the fact that censored patients also contribute follow-up time.

Recall the correct estimate of median survival was not reached.

Illustration: underestimation of median survival

Ignoring censoring leads to an underestimate of median survival time because the follow-up time that censored patients contribute is excluded, and the risk set is artificially small.

Table of median survival

We can produce nice tables of median survival time estimates using the tbl_survfit() function from the gtsummary package:

survfit(Surv(os_years, os_event) ~ 1, 
        data = bc_rt_data) |> 
  tbl_survfit(
    probs = 0.5,
    label_header = 
      "**Median survival (95% CI)**"
  )
Characteristic Median survival (95% CI)
Overall — (—, —)


In this case, this table is not informative, but is included for demonstration purposes. In a dataset like this where median survival is not reached, x-time estimates can be presented instead, sometimes for multiple timepoints

Comparing survival times between groups

We can conduct between-group significance tests using a log-rank test.


The log-rank test equally weights observations over the entire follow-up time and is the most common way to compare survival times between groups.


There are versions that more heavily weight the early or late follow-up that could be more appropriate depending on the research question (see ?survdiff for different test options).

Conducting the log-rank test

We get the log-rank p-value using the survdiff() function from the survival package. For example, we can test whether there was a difference in survival time according to PMRT:

survdiff(Surv(os_years, os_event) ~ rt, data = bc_rt_data)
Call:
survdiff(formula = Surv(os_years, os_event) ~ rt, data = bc_rt_data)

n=2823, 177 observations deleted due to missingness.

        N Observed Expected (O-E)^2/E (O-E)^2/V
rt=0 1147      233      195      7.55      12.7
rt=1 1676      246      284      5.17      12.7

 Chisq= 12.7  on 1 degrees of freedom, p= 4e-04 

We see that there was a significant difference in overall survival according to PMRT, with a p-value of p = <.001.

Add log-rank p-value to Kaplan-Meier plot

survfit2(
  Surv(os_years, os_event) ~ rt, 
  data = bc_rt_data) |> 
  ggsurvfit() +
  labs(
    x = "Years from mastectomy",
    y = "Overall survival probability"
  ) + 
  scale_y_continuous(
    limits = c(0, 1)) + 
  scale_x_continuous(
    limits = c(0, 15), 
    breaks = seq(0, 15, 5)) + 
  add_risktable(
    risktable_stats = "n.risk"
    ) + 
  scale_color_manual(
    values = ccf_palette("contrast"),
    labels = c("No RT", "RT")
  ) +
  add_pvalue()

The Cox regression model

We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables.

The Cox regression model is a semi-parametric model that can be used to fit univariate and multivariable regression models that have survival outcomes.

\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]

\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard

Cox regression assumptions

Some key assumptions of the model:

  • non-informative censoring
  • proportional hazards

Note that parametric regression models for survival outcomes are also available, but they won’t be addressed in this tutorial.

How to interpret a hazard ratio

The quantity of interest from a Cox regression model is a hazard ratio (HR), which represents the instantaneous rate of occurrence of the event of interest in those who are still at risk for the event.


If you have a regression parameter \(\beta\), then HR = \(\exp(\beta)\).


A HR < 1 indicates reduced hazard of event whereas a HR > 1 indicates an increased hazard of event.

Fitting Cox models

Fit regression models using the coxph() function from the survival package, which takes a Surv() object on the left hand side and has standard syntax for regression formulas in R on the right hand side.

mod1 <- coxph(Surv(os_years, os_event) ~ rt, data = bc_rt_data)

Table of Cox model results

We can creates tables of results using the tbl_regression() function from the gtsummary package, with the option to exponentiate set to TRUE to return the hazard ratio rather than the log hazard ratio:

mod1 |> 
  tbl_regression(
    exp = TRUE,
    label = list(rt ~ "PMRT")
    ) 
Characteristic HR1 95% CI1 p-value
PMRT 0.72 0.60, 0.86 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

HR = 0.72 implies that receipt of PMRT is associated with 0.72 times the hazard of death as compared to no PMRT.

Cox regression: common mistakes

  1. Overfitting. This occurs when there are too few events to support the number of included variables. Rule of thumb is 10-15 events per degree of freedom.
  2. Interpreting a hazard as a risk - they are related, but they are not the same.
  3. Overlooking the proportional hazards assumption

Connect with me

zabore2@ccf.org

https://www.emilyzabor.com/

https://github.com/zabore

https://www.linkedin.com/in/emily-zabor-59b902b7/

https://bsky.app/profile/zabore.bsky.social/

Further reading

Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.

M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer, 89(3), 431-436.

Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survival analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. 89(4), 605-11.

Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Further concepts and methods in survival analysis. 781-786. ISSN 0007-0920.

include-after: |