install.packages("janitor")
install.packages("dplyr")
install.packages("readr")
install.packages("gt")
install.packages("gtsummary")
Descriptive statistics
It will be common to want some basic tabulations of categorical variables and summaries of continuous variables, to examine our data and make sure they are in line with what we expect, and to report results for papers.
In this section, we will learn some basic descriptive statistics including one-way and two-way contingency tables, summary statistics for continuous variables, and summary statistics by group. We will be using functions from the {janitor}, {dplyr}, and {readr} packages, which we have used previously, and we will also be introducing the {gt} and {gtsummary} packages. First, make sure these packages are installed and loaded in your current R session.
Note that all of these packages are already installed on Posit Workbench on the server
library(janitor)
library(dplyr)
library(readr)
library(gt)
library(gtsummary)
The {gt} package will be used to create nicely styled tables.
Also make sure the breastcancer data are loaded into your current R session, with names standardized using the clean_names()
function from the {janitor} package.
<- read_csv("~/mmedr/breastcancer.csv") |>
mydf clean_names()
One-way frequency table
We can use the tabyl()
function from the {janitor} package to make a basic one-way frequency table.
Let’s say we want a table of grade:
|>
mydf tabyl(grade)
grade n percent
I 362 0.1206667
II 1310 0.4366667
III 1328 0.4426667
Then we could additionally format the percentages using the adorn_pct_formatting()
function:
|>
mydf tabyl(grade) |>
adorn_pct_formatting()
grade n percent
I 362 12.1%
II 1310 43.7%
III 1328 44.3%
And with just three short lines of code, we have a basic one-way contingency table for our categorical variable of interest.
Two-way contingency table
We can similarly use thetabyl()
function from the {janitor} package to make two-way contingency tables by simply adding another variable name, separated by a comma. Let’s say we now want to see grade tabulated by rt:
|>
mydf tabyl(grade, rt)
grade 0 1
I 189 173
II 562 748
III 477 851
And we get a basic two-way contingency table with counts of grade according to the value of rt.
To enhance the appearance of our table, we can use the {gt} package and it’s associated features to customize our two-way contingency table.
See the {gt} package website for many details and features.
Let’s label the row variable and column variable, and add a title:
|>
mydf tabyl(grade, rt) |>
gt(
rowname_col = "grade"
|>
) tab_spanner(
columns = 2:3,
label = "Radiation therapy?"
|>
) tab_stubhead(
label = "Grade"
|>
) tab_header(
title = "Grade according to radiation therapy"
)
Grade according to radiation therapy | ||
---|---|---|
Grade |
Radiation therapy?
|
|
0 | 1 | |
I | 189 | 173 |
II | 562 | 748 |
III | 477 | 851 |
And now you see we have a nicely styled html table with the frequencies of grade according to rt, with row and column headers and a title.
Alternatively, we could display percentages in our table. Let’s say we are interested in row percentages, to see what percentage of subjects received radiation therapy within each level of grade.
|>
mydf tabyl(grade, rt) |>
adorn_percentages(denominator = "row") |>
gt(
rowname_col = "grade"
|>
) fmt_percent(
columns = -1,
decimals = 1
|>
) tab_spanner(
columns = 2:3,
label = "Radiation therapy?"
|>
) tab_stubhead(
label = "Grade"
|>
) tab_header(
title = "Grade according to radiation therapy"
)
Grade according to radiation therapy | ||
---|---|---|
Grade |
Radiation therapy?
|
|
0 | 1 | |
I | 52.2% | 47.8% |
II | 42.9% | 57.1% |
III | 35.9% | 64.1% |
The case_match()
function from the {dplyr} package is useful for recoding the levels of a categorical variable to get more desriptive levels when making tables or presenting summaries, e.g. from 0 to “No” and 1 to “Yes”
To make the table look even nicer, we could use the case_match()
function from the {dplyr} package, which works similarly to case_when()
, and is useful for recoding the levels of a categorical variable. We may want to create more descriptive levels than the 1/0 indicators these variables currently contain.
|>
mydf mutate(
rt_text = case_match(
rt,1 ~ "Yes",
0 ~ "No"
)|>
) tabyl(grade, rt_text) |>
adorn_percentages(denominator = "row") |>
gt(
rowname_col = "grade"
|>
) fmt_percent(
columns = -1,
decimals = 1
|>
) tab_spanner(
columns = 2:3,
label = "Radiation therapy?"
|>
) tab_stubhead(
label = "Grade"
|>
) tab_header(
title = "Grade according to radiation therapy"
)
Grade according to radiation therapy | ||
---|---|---|
Grade |
Radiation therapy?
|
|
No | Yes | |
I | 52.2% | 47.8% |
II | 42.9% | 57.1% |
III | 35.9% | 64.1% |
Summary statistics
We can use the summarize()
function from the {dplyr} package to compute summary statistics.
Let’s compute the mean and standard deviation of age_dx_yrs:
|>
mydf summarize(
avg_age = mean(age_dx_yrs),
sd_age = sd(age_dx_yrs)
)
# A tibble: 1 × 2
avg_age sd_age
<dbl> <dbl>
1 57.3 13.9
Summary statistics by group
And we can easily extend this code to generate summary statistics by group by using the group_by()
function from the {dplyr} package prior to using summarize()
.
The group_by()
function is useful in many settings, and it takes an existing dataframe and converts it to a grouped dataframe where all following operations are done within the group.
Note: It is best practice to ungroup()
when group operations are complete, to avoid unintended results.
For example:
|>
mydf group_by(rt) |>
summarize(
avg_age = mean(age_dx_yrs),
sd_age = sd(age_dx_yrs)
)
# A tibble: 2 × 3
rt avg_age sd_age
<dbl> <dbl> <dbl>
1 0 60.5 13.7
2 1 55.1 13.5
Recall that if there are missing values, these must be accounted for by supplying the appropriate additional argument to the mean()
and sd()
functions. For example, if we want to summarize tumor_size_cm according to grade, we can use:
|>
mydf group_by(grade) |>
summarize(
avg_size = mean(tumor_size_cm, na.rm = TRUE),
sd_size = sd(tumor_size_cm, na.rm = TRUE)
|>
) ungroup()
# A tibble: 3 × 3
grade avg_size sd_size
<chr> <dbl> <dbl>
1 I 2.40 0.937
2 II 2.39 0.983
3 III 2.38 0.999
Summarize multiple variables
You can summarize multiple variables using a single function by name:
|>
mydf summarise_at(
c("age_dx_yrs", "tumor_size_cm"), mean, na.rm = TRUE
)
# A tibble: 1 × 2
age_dx_yrs tumor_size_cm
<dbl> <dbl>
1 57.3 2.39
Or using selection helpers inside vars()
:
|>
mydf summarise_at(
vars(age_dx_yrs:tumor_size_cm), mean, na.rm = TRUE
)
# A tibble: 1 × 2
age_dx_yrs tumor_size_cm
<dbl> <dbl>
1 57.3 2.39
You can also select all variables of a certain type, like all numeric variables:
|>
mydf summarise_if(is.numeric, mean, na.rm = TRUE)
# A tibble: 1 × 12
time event rt age_dx_yrs tumor_size_cm n_ln_pos_3_vs_1or2 nodal_ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.07 0.813 0.591 57.3 2.39 0.156 0.300
# ℹ 5 more variables: lvi <dbl>, er_or_pr_pos <dbl>, her2_pos <dbl>,
# quadrant_inner_vs_upper <dbl>, optimal_systemic_therapy <dbl>
And you can summarize select variables using multiple functions by including them in a list:
|>
mydf summarise_at(
c("age_dx_yrs", "tumor_size_cm"), list(mean, sd), na.rm = TRUE
)
# A tibble: 1 × 4
age_dx_yrs_fn1 tumor_size_cm_fn1 age_dx_yrs_fn2 tumor_size_cm_fn2
<dbl> <dbl> <dbl> <dbl>
1 57.3 2.39 13.9 0.984
To specify the string appended to the variable names in the results, make a named list instead:
|>
mydf summarise_at(
c("age_dx_yrs", "tumor_size_cm"), list(mean = mean, sd = sd), na.rm = TRUE
)
# A tibble: 1 × 4
age_dx_yrs_mean tumor_size_cm_mean age_dx_yrs_sd tumor_size_cm_sd
<dbl> <dbl> <dbl> <dbl>
1 57.3 2.39 13.9 0.984
And you can also do this by group, by adding in a group_by()
statement:
|>
mydf group_by(rt) |>
summarise_at(
c("age_dx_yrs", "tumor_size_cm"), list(mean = mean, sd = sd), na.rm = TRUE
)
# A tibble: 2 × 5
rt age_dx_yrs_mean tumor_size_cm_mean age_dx_yrs_sd tumor_size_cm_sd
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 60.5 2.18 13.7 0.986
2 1 55.1 2.53 13.5 0.958
Creating publication-ready tables
In clinical research, we always start our papers with a “Table 1” showing characteristics of the included patient sample, which usually includes a mix of categorical and continuous variables. The {gtsummary} package can help with this.
Let’s make a table of all variables in the breast cancer data except for the outcomes, time and event, using the tbl_summary()
function from the {gtsummary} package:
|>
mydf select(
-time,
-event
|>
) tbl_summary()
Characteristic | N = 3,0001 |
---|---|
rt | 1,772 (59%) |
age_dx_yrs | 57 (48, 67) |
tumor_size_cm | 2.37 (1.69, 3.06) |
Unknown | 149 |
grade | |
I | 362 (12%) |
II | 1,310 (44%) |
III | 1,328 (44%) |
n_ln_pos_3_vs_1or2 | 467 (16%) |
nodal_ratio | 0.28 (0.16, 0.41) |
lvi | 1,240 (41%) |
er_or_pr_pos | 2,537 (87%) |
Unknown | 87 |
her2_pos | 511 (18%) |
Unknown | 132 |
quadrant_inner_vs_upper | 517 (17%) |
optimal_systemic_therapy | 2,660 (89%) |
1 n (%); Median (Q1, Q3) |
And we see that with these few lines of code, by default we get a table that shows the number and percentage for binary or categorical variables and the median and first and third quartiles for continuous variables.
The tables are highly customizable.
We can label the row variables:
|>
mydf select(
-time,
-event
|>
) tbl_summary(
label = list(
rt = "Radiation therapy",
age_dx_yrs = "Age at diagnosis (years)",
tumor_size_cm = "Tumor size (cm)",
grade = "Grade",
n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
nodal_ratio = "Nodal ratio",
lvi = "LVI",
er_or_pr_pos = "ER+ or PR+",
her2_pos = "Her2+",
quadrant_inner_vs_upper = "Inner vs upper quadrant",
optimal_systemic_therapy = "Optimal systemic therapy"
) )
Characteristic | N = 3,0001 |
---|---|
Radiation therapy | 1,772 (59%) |
Age at diagnosis (years) | 57 (48, 67) |
Tumor size (cm) | 2.37 (1.69, 3.06) |
Unknown | 149 |
Grade | |
I | 362 (12%) |
II | 1,310 (44%) |
III | 1,328 (44%) |
3 vs 1/2 positive lymph nodes | 467 (16%) |
Nodal ratio | 0.28 (0.16, 0.41) |
LVI | 1,240 (41%) |
ER+ or PR+ | 2,537 (87%) |
Unknown | 87 |
Her2+ | 511 (18%) |
Unknown | 132 |
Inner vs upper quadrant | 517 (17%) |
Optimal systemic therapy | 2,660 (89%) |
1 n (%); Median (Q1, Q3) |
We can make the variable names bold, and the levels italic:
|>
mydf select(
-time,
-event
|>
) tbl_summary(
label = list(
rt = "Radiation therapy",
age_dx_yrs = "Age at diagnosis (years)",
tumor_size_cm = "Tumor size (cm)",
grade = "Grade",
n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
nodal_ratio = "Nodal ratio",
lvi = "LVI",
er_or_pr_pos = "ER+ or PR+",
her2_pos = "Her2+",
quadrant_inner_vs_upper = "Inner vs upper quadrant",
optimal_systemic_therapy = "Optimal systemic therapy"
)|>
) bold_labels() |>
italicize_levels()
Characteristic | N = 3,0001 |
---|---|
Radiation therapy | 1,772 (59%) |
Age at diagnosis (years) | 57 (48, 67) |
Tumor size (cm) | 2.37 (1.69, 3.06) |
Unknown | 149 |
Grade | |
I | 362 (12%) |
II | 1,310 (44%) |
III | 1,328 (44%) |
3 vs 1/2 positive lymph nodes | 467 (16%) |
Nodal ratio | 0.28 (0.16, 0.41) |
LVI | 1,240 (41%) |
ER+ or PR+ | 2,537 (87%) |
Unknown | 87 |
Her2+ | 511 (18%) |
Unknown | 132 |
Inner vs upper quadrant | 517 (17%) |
Optimal systemic therapy | 2,660 (89%) |
1 n (%); Median (Q1, Q3) |
We could add a column variable, and summarize by that, and also overall:
|>
mydf select(
-time,
-event
|>
) tbl_summary(
label = list(
rt = "Radiation therapy",
age_dx_yrs = "Age at diagnosis (years)",
tumor_size_cm = "Tumor size (cm)",
grade = "Grade",
n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
nodal_ratio = "Nodal ratio",
lvi = "LVI",
er_or_pr_pos = "ER+ or PR+",
her2_pos = "Her2+",
quadrant_inner_vs_upper = "Inner vs upper quadrant",
optimal_systemic_therapy = "Optimal systemic therapy"
),by = rt
|>
) bold_labels() |>
italicize_levels() |>
add_overall()
Characteristic | Overall N = 3,0001 |
0 N = 1,2281 |
1 N = 1,7721 |
---|---|---|---|
Age at diagnosis (years) | 57 (48, 67) | 60 (51, 70) | 55 (46, 64) |
Tumor size (cm) | 2.37 (1.69, 3.06) | 2.14 (1.49, 2.83) | 2.53 (1.87, 3.18) |
Unknown | 149 | 60 | 89 |
Grade | |||
I | 362 (12%) | 189 (15%) | 173 (9.8%) |
II | 1,310 (44%) | 562 (46%) | 748 (42%) |
III | 1,328 (44%) | 477 (39%) | 851 (48%) |
3 vs 1/2 positive lymph nodes | 467 (16%) | 92 (7.5%) | 375 (21%) |
Nodal ratio | 0.28 (0.16, 0.41) | 0.25 (0.15, 0.38) | 0.30 (0.18, 0.43) |
LVI | 1,240 (41%) | 435 (35%) | 805 (45%) |
ER+ or PR+ | 2,537 (87%) | 1,045 (88%) | 1,492 (86%) |
Unknown | 87 | 41 | 46 |
Her2+ | 511 (18%) | 208 (18%) | 303 (18%) |
Unknown | 132 | 51 | 81 |
Inner vs upper quadrant | 517 (17%) | 218 (18%) | 299 (17%) |
Optimal systemic therapy | 2,660 (89%) | 1,030 (84%) | 1,630 (92%) |
1 Median (Q1, Q3); n (%) |
We can customize the column labels:
|>
mydf select(
-time,
-event
|>
) tbl_summary(
label = list(
rt = "Radiation therapy",
age_dx_yrs = "Age at diagnosis (years)",
tumor_size_cm = "Tumor size (cm)",
grade = "Grade",
n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
nodal_ratio = "Nodal ratio",
lvi = "LVI",
er_or_pr_pos = "ER+ or PR+",
her2_pos = "Her2+",
quadrant_inner_vs_upper = "Inner vs upper quadrant",
optimal_systemic_therapy = "Optimal systemic therapy"
),by = rt
|>
) bold_labels() |>
italicize_levels() |>
add_overall() |>
modify_header(
stat_1 = "**No radiation** \nN = 1,228",
stat_2 = "**Radiation** \nN = 1,772"
)
Characteristic | Overall N = 3,0001 |
No radiation N = 1,2281 |
Radiation N = 1,7721 |
---|---|---|---|
Age at diagnosis (years) | 57 (48, 67) | 60 (51, 70) | 55 (46, 64) |
Tumor size (cm) | 2.37 (1.69, 3.06) | 2.14 (1.49, 2.83) | 2.53 (1.87, 3.18) |
Unknown | 149 | 60 | 89 |
Grade | |||
I | 362 (12%) | 189 (15%) | 173 (9.8%) |
II | 1,310 (44%) | 562 (46%) | 748 (42%) |
III | 1,328 (44%) | 477 (39%) | 851 (48%) |
3 vs 1/2 positive lymph nodes | 467 (16%) | 92 (7.5%) | 375 (21%) |
Nodal ratio | 0.28 (0.16, 0.41) | 0.25 (0.15, 0.38) | 0.30 (0.18, 0.43) |
LVI | 1,240 (41%) | 435 (35%) | 805 (45%) |
ER+ or PR+ | 2,537 (87%) | 1,045 (88%) | 1,492 (86%) |
Unknown | 87 | 41 | 46 |
Her2+ | 511 (18%) | 208 (18%) | 303 (18%) |
Unknown | 132 | 51 | 81 |
Inner vs upper quadrant | 517 (17%) | 218 (18%) | 299 (17%) |
Optimal systemic therapy | 2,660 (89%) | 1,030 (84%) | 1,630 (92%) |
1 Median (Q1, Q3); n (%) |
Note: to see the underlying header names, do the following:
<-
temp |>
mydf select(
-time,
-event
|>
) tbl_summary(
by = rt
)
show_header_names(temp)
Column Name Header level* N* n* p*
label "**Characteristic**" 3,000 <int>
stat_1 "**0** \nN = 1,228" 0 <chr> 3,000 <int> 1,228 <int> 0.409 <dbl>
stat_2 "**1** \nN = 1,772" 1 <chr> 3,000 <int> 1,772 <int> 0.591 <dbl>
* These values may be dynamically placed into headers (and other locations).
ℹ Review the `modify_header()` (`?gtsummary::modify_header()`) help for
examples.
And we can add a table header:
|>
mydf select(
-time,
-event
|>
) tbl_summary(
label = list(
rt = "Radiation therapy",
age_dx_yrs = "Age at diagnosis (years)",
tumor_size_cm = "Tumor size (cm)",
grade = "Grade",
n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
nodal_ratio = "Nodal ratio",
lvi = "LVI",
er_or_pr_pos = "ER+ or PR+",
her2_pos = "Her2+",
quadrant_inner_vs_upper = "Inner vs upper quadrant",
optimal_systemic_therapy = "Optimal systemic therapy"
),by = rt
|>
) bold_labels() |>
italicize_levels() |>
add_overall() |>
modify_header(
stat_1 = "**No radiation** \nN = 1,228",
stat_2 = "**Radiation** \nN = 1,772"
|>
) modify_caption("Table of patient characteristics by radiation")
Characteristic | Overall N = 3,0001 |
No radiation N = 1,2281 |
Radiation N = 1,7721 |
---|---|---|---|
Age at diagnosis (years) | 57 (48, 67) | 60 (51, 70) | 55 (46, 64) |
Tumor size (cm) | 2.37 (1.69, 3.06) | 2.14 (1.49, 2.83) | 2.53 (1.87, 3.18) |
Unknown | 149 | 60 | 89 |
Grade | |||
I | 362 (12%) | 189 (15%) | 173 (9.8%) |
II | 1,310 (44%) | 562 (46%) | 748 (42%) |
III | 1,328 (44%) | 477 (39%) | 851 (48%) |
3 vs 1/2 positive lymph nodes | 467 (16%) | 92 (7.5%) | 375 (21%) |
Nodal ratio | 0.28 (0.16, 0.41) | 0.25 (0.15, 0.38) | 0.30 (0.18, 0.43) |
LVI | 1,240 (41%) | 435 (35%) | 805 (45%) |
ER+ or PR+ | 2,537 (87%) | 1,045 (88%) | 1,492 (86%) |
Unknown | 87 | 41 | 46 |
Her2+ | 511 (18%) | 208 (18%) | 303 (18%) |
Unknown | 132 | 51 | 81 |
Inner vs upper quadrant | 517 (17%) | 218 (18%) | 299 (17%) |
Optimal systemic therapy | 2,660 (89%) | 1,030 (84%) | 1,630 (92%) |
1 Median (Q1, Q3); n (%) |
We can also customize the statistics in the table, for example use the mean and standard deviation for all continuous variables:
|>
mydf select(
-time,
-event
|>
) tbl_summary(
label = list(
rt = "Radiation therapy",
age_dx_yrs = "Age at diagnosis (years)",
tumor_size_cm = "Tumor size (cm)",
grade = "Grade",
n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
nodal_ratio = "Nodal ratio",
lvi = "LVI",
er_or_pr_pos = "ER+ or PR+",
her2_pos = "Her2+",
quadrant_inner_vs_upper = "Inner vs upper quadrant",
optimal_systemic_therapy = "Optimal systemic therapy"
),by = rt,
statistic = list(all_continuous() ~ "{mean} ({sd})")
|>
) bold_labels() |>
italicize_levels() |>
add_overall() |>
modify_header(
stat_1 = "**No radiation** \nN = 1,228",
stat_2 = "**Radiation** \nN = 1,772"
|>
) modify_caption("Table of patient characteristics by radiation")
Characteristic | Overall N = 3,0001 |
No radiation N = 1,2281 |
Radiation N = 1,7721 |
---|---|---|---|
Age at diagnosis (years) | 57 (14) | 60 (14) | 55 (13) |
Tumor size (cm) | 2.39 (0.98) | 2.18 (0.99) | 2.53 (0.96) |
Unknown | 149 | 60 | 89 |
Grade | |||
I | 362 (12%) | 189 (15%) | 173 (9.8%) |
II | 1,310 (44%) | 562 (46%) | 748 (42%) |
III | 1,328 (44%) | 477 (39%) | 851 (48%) |
3 vs 1/2 positive lymph nodes | 467 (16%) | 92 (7.5%) | 375 (21%) |
Nodal ratio | 0.30 (0.17) | 0.28 (0.17) | 0.32 (0.18) |
LVI | 1,240 (41%) | 435 (35%) | 805 (45%) |
ER+ or PR+ | 2,537 (87%) | 1,045 (88%) | 1,492 (86%) |
Unknown | 87 | 41 | 46 |
Her2+ | 511 (18%) | 208 (18%) | 303 (18%) |
Unknown | 132 | 51 | 81 |
Inner vs upper quadrant | 517 (17%) | 218 (18%) | 299 (17%) |
Optimal systemic therapy | 2,660 (89%) | 1,030 (84%) | 1,630 (92%) |
1 Mean (SD); n (%) |
See the {gtsummary} website for details and examples: https://www.danieldsjoberg.com/gtsummary/