R-Ladies Philly
2024-12-17
Who am I
Associate Staff Biostatistician at the Cleveland Clinic in the Department of Quantitative Health Sciences and the Taussig Cancer Institute.
Applied cancer biostatistics and methods research in early phase oncology clinical trial design and methods for retrospective data analysis.
Why am I here
Sometimes, despite the actual focus of your career, you end up best known for a website post on survival analysis in R that you made for an internal training back in 2018 🤷.
Full tutorial is available on my website.
Because time-to-event data are common in many fields, it also goes by names besides survival analysis including:
Censoring occurs when the event of interest is not observed after a period of follow-up
Binary data doesn’t have the ability to change depending on the time of analysis, e.g. 5-year survival will have the same value whether it is analyzed at 5 years and 1 day, 5 years and 2 days, 6 years, etc. Either a participant died by 5 years or they didn’t.
Time-to-event data may have different values depending on the time of analysis, e.g. overall survival will have different values depending on whether it is analyzed at 5 years and 1 day or at 6 years, since additional participants can die between those two time points.
A subject may be censored due to:
Left censoring: when the event or censoring occurred before a study has started or data is collected
Interval censoring: when the event or censoring occurred between two dates but when is not known exactly
Today we will focus only on right censoring.
How would we compute the proportion who are event-free at 15 years?
And how would you compute the median time-to-event when the event time is unknown for some patients?
To analyze survival data, we need to know the observed time (\(Y_i\)) and the event indicator (\(\delta_i\)). For a subject (denoted by \(i\)):
To access the example data used throughout this talk, install and load the cancersimdata package from my GitHub repo:
bc_rt_data
is a synthetic dataset based on real breast cancer data. The dataset contains information on 3000 women with T1-2N1M0 breast cancer, who had a mastectomy between 1995-2015.
The original study examined the association between post-mastectomy radiation therapy and disease recurrence.
Sittenfeld SMC, Zabor EC, …, Tendulkar RD. A multi-institutional prediction model to estimate the risk of recurrence and mortality after mastectomy for T1-2N1 breast cancer. Cancer. 2022 Aug 15;128(16):3057-3066. doi: 10.1002/cncr.34352. Epub 2022 Jun 17. PMID: 35713598; PMCID: PMC9539507.
Relevant variables include:
# A tibble: 3,000 × 4
rt os_event date_of_mastectomy date_last_follow_up_death
<dbl> <dbl> <chr> <chr>
1 0 0 09/11/2004 07/24/2009
2 0 0 08/06/2011 11/07/2016
3 0 0 06/10/1998 11/04/2005
4 1 0 10/06/2013 04/30/2019
5 1 0 11/22/2002 01/17/2008
6 0 0 01/31/2013 01/12/2019
7 1 0 05/03/2014 08/23/2016
8 0 0 02/16/1998 11/12/2004
9 1 0 03/04/2014 02/07/2020
10 1 1 03/06/1998 09/08/2003
# ℹ 2,990 more rows
It is important to pay attention to the format of the event indicator.
The Surv()
function in the survival package accepts by default TRUE/FALSE, where TRUE is event and FALSE is censored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 is event and 1 is censored. Please take care to ensure the event indicator is properly formatted.
Here we see that the documentation stipulates 1 is event (death) and 0 is censored.
Here, the start and end dates are in the dataset, formatted as character variables.
We need to:
The lubridate package offers a more comprehensive and user-friendly suite of functions for date manipulation. See the lubridate website for details.
The mdy()
function converts character values ordered month day year:
# A tibble: 3,000 × 4
rt os_event date_of_mastectomy date_last_follow_up_death
<dbl> <dbl> <date> <date>
1 0 0 2004-09-11 2009-07-24
2 0 0 2011-08-06 2016-11-07
3 0 0 1998-06-10 2005-11-04
4 1 0 2013-10-06 2019-04-30
5 1 0 2002-11-22 2008-01-17
6 0 0 2013-01-31 2019-01-12
7 1 0 2014-05-03 2016-08-23
8 0 0 1998-02-16 2004-11-12
9 1 0 2014-03-04 2020-02-07
10 1 1 1998-03-06 2003-09-08
# ℹ 2,990 more rows
%--%
is a special operator that creates an interval from a specific instant to another instant.
dyears(1)
converts the interval to be on the years scale
Note that os_years is already a variable in this dataset, and we are simply overwriting it here with our own calculation, for demonstration purposes
In real-world data it is common to encounter errors in data, such as end dates that come before start dates, etc.
Typically as a quick check, I look at the numeric and visual distribution of follow-up times.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.02464 4.46749 6.35729 6.56711 8.55852 21.26489
The basis of the survival ecosystem in R.
Create highly customizable tables, see https://www.danieldsjoberg.com/gtsummary/ for details.
Use tbl_survfit()
to create:
Use tbl_uvregression()
to create:
Use tbl_regression()
to create:
Uses ggplot2 as the basis so known customizations are available with the +
operator. See https://www.danieldsjoberg.com/ggsurvfit/ for details.
ggsurvfit
ggcuminc
ggsurvfit
The Surv()
function creates a survival object for use as the response in a model formula. There is one value for each subject that is the survival time, followed by a +
if the subject was censored.
[1] 4.865161+ 5.256674+ 7.403149+ 5.563313+ 5.152635+ 5.946612+ 2.308008+
[8] 6.737851+ 5.930185+ 5.508556
We see that that the first 9 subjects were censored at variaous times, and the 10th subject had an event at 5.5 years.
The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.
The survfit()
function creates survival curves using the Kaplan-Meier method based on a formula.
Some key components of this survfit
object that will be used to create survival curves include:
time
: the timepoints at which the curve has a step, i.e. at least one event occurredsurv
: the survival probability estimate at the corresponding time
The ggsurvfit package works best if you create the survfit
object using the included survfit2()
function, which uses the same syntax to what we saw previously with survfit()
.
survfit2()
tracks the environment from the function call, which allows the plot to have better default values for labeling and p-value reporting.
One quantity of interest in a survival analysis is the probability of surviving beyond a certain point in time (x).
For example, to estimate the probability of surviving to 10 years, use summary
with the times
argument.
Call: survfit(formula = Surv(os_years, os_event) ~ 1, data = bc_rt_data)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
10 386 501 0.727 0.0125 0.703 0.752
We find that the 10-year probability of survival in this study is 73%.
The associated lower and upper bounds of the 95% confidence interval are also displayed.
The 10-year survival probability is the point on the y-axis that corresponds to 10 years on the x-axis for the survival curve.
Using a “naive” estimate:
501 of the 3000 patients in the data died by 10 years so the “naive” estimate is calculated as:
\[\Big(1 - \frac{501}{3000}\Big) \times 100 = 83\%\] You get an incorrect estimate of the 10-year probability of survival when you ignore the fact that 2113 patients were censored before 10 years.
Recall the correct estimate of the 10-year probability of survival, accounting for censoring using the Kaplan-Meier method, was 73%.
Ignoring censoring leads to an overestimate of the overall survival probability. Censored subjects only contribute information for a portion of the follow-up time, and then fall out of the risk set, pulling down the cumulative probability of survival. Ignoring censoring erroneously treats patients who are censored as part of the risk set for the entire follow-up period.
We can produce nice tables of x-time survival probability estimates using the tbl_survfit()
function from the gtsummary package:
Another quantity of interest in a survival analysis is the average survival time, which we quantify using the median.
Note that survival times are not expected to be normally distributed so the mean is not an appropriate summary.
We can obtain the median survival directly from the survfit
object:
Call: survfit(formula = Surv(os_years, os_event) ~ 1, data = bc_rt_data)
n events median 0.95LCL 0.95UCL
[1,] 3000 526 NA NA NA
We see the median survival time is is NA, which means that it has not yet been reached in this study. This is common in scenarios when the event rate is low, or follow-up time is short.
Median survival is the time on the x-axis corresponding to a survival probability of 0.5 on the y-axis. Here we see there is no time where the horizontal line at 0.5 meets the survival curve.
Using a “naive” estimate.
Summarize the median survival time among the 526 patients who died:
# A tibble: 1 × 1
median_surv
<dbl>
1 4.79
You get an incorrect estimate of median survival time of 4.8 years when you ignore the fact that censored patients also contribute follow-up time.
Recall the correct estimate of median survival was not reached.
Ignoring censoring leads to an underestimate of median survival time because the follow-up time that censored patients contribute is excluded, and the risk set is artificially small.
We can produce nice tables of median survival time estimates using the tbl_survfit()
function from the gtsummary package:
Characteristic | Median survival (95% CI) |
---|---|
Overall | — (—, —) |
In this case, this table is not informative, but is included for demonstration purposes. In a dataset like this where median survival is not reached, x-time estimates can be presented instead, sometimes for multiple timepoints
We can conduct between-group significance tests using a log-rank test.
The log-rank test equally weights observations over the entire follow-up time and is the most common way to compare survival times between groups.
There are versions that more heavily weight the early or late follow-up that could be more appropriate depending on the research question (see ?survdiff
for different test options).
We get the log-rank p-value using the survdiff()
function from the survival package. For example, we can test whether there was a difference in survival time according to PMRT:
Call:
survdiff(formula = Surv(os_years, os_event) ~ rt, data = bc_rt_data)
n=2823, 177 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
rt=0 1147 233 195 7.55 12.7
rt=1 1676 246 284 5.17 12.7
Chisq= 12.7 on 1 degrees of freedom, p= 4e-04
We see that there was a significant difference in overall survival according to PMRT, with a p-value of p = <.001.
survfit2(
Surv(os_years, os_event) ~ rt,
data = bc_rt_data) |>
ggsurvfit() +
labs(
x = "Years from mastectomy",
y = "Overall survival probability"
) +
scale_y_continuous(
limits = c(0, 1)) +
scale_x_continuous(
limits = c(0, 15),
breaks = seq(0, 15, 5)) +
add_risktable(
risktable_stats = "n.risk"
) +
scale_color_manual(
values = ccf_palette("contrast"),
labels = c("No RT", "RT")
) +
add_pvalue()
We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables.
The Cox regression model is a semi-parametric model that can be used to fit univariate and multivariable regression models that have survival outcomes.
\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]
\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard
Some key assumptions of the model:
Note that parametric regression models for survival outcomes are also available, but they won’t be addressed in this tutorial.
The quantity of interest from a Cox regression model is a hazard ratio (HR), which represents the instantaneous rate of occurrence of the event of interest in those who are still at risk for the event.
If you have a regression parameter \(\beta\), then HR = \(\exp(\beta)\).
A HR < 1 indicates reduced hazard of event whereas a HR > 1 indicates an increased hazard of event.
Fit regression models using the coxph()
function from the survival package, which takes a Surv()
object on the left hand side and has standard syntax for regression formulas in R
on the right hand side.
We can creates tables of results using the tbl_regression()
function from the gtsummary package, with the option to exponentiate set to TRUE to return the hazard ratio rather than the log hazard ratio:
Characteristic | HR1 | 95% CI1 | p-value |
---|---|---|---|
PMRT | 0.72 | 0.60, 0.86 | <0.001 |
1 HR = Hazard Ratio, CI = Confidence Interval |
HR = 0.72 implies that receipt of PMRT is associated with 0.72 times the hazard of death as compared to no PMRT.
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.
M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer, 89(3), 431-436.
Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survival analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. 89(4), 605-11.
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Further concepts and methods in survival analysis. 781-786. ISSN 0007-0920.