Descriptive statistics

It will be common to want some basic tabulations of categorical variables and summaries of continuous variables, to examine our data and make sure they are in line with what we expect, and to report results for papers.

In this section, we will learn some basic descriptive statistics including one-way and two-way contingency tables, summary statistics for continuous variables, and summary statistics by group. We will be using functions from the {janitor}, {dplyr}, and {readr} packages, which we have used previously, and we will also be introducing the {gt} and {gtsummary} packages. First, make sure these packages are installed and loaded in your current R session.

Note that all of these packages are already installed on Posit Workbench on the server

install.packages("janitor")
install.packages("dplyr")
install.packages("readr")
install.packages("gt")
install.packages("gtsummary")
library(janitor)
library(dplyr)
library(readr)
library(gt)
library(gtsummary)

The {gt} package will be used to create nicely styled tables.

Also make sure the breastcancer data are loaded into your current R session, with names standardized using the clean_names() function from the {janitor} package.

mydf <- read_csv("~/mmedr/breastcancer.csv") |> 
  clean_names()

One-way frequency table

We can use the tabyl() function from the {janitor} package to make a basic one-way frequency table.

Let’s say we want a table of grade:

mydf |> 
  tabyl(grade)
 grade    n   percent
     I  362 0.1206667
    II 1310 0.4366667
   III 1328 0.4426667

Then we could additionally format the percentages using the adorn_pct_formatting() function:

mydf |> 
  tabyl(grade) |> 
  adorn_pct_formatting()
 grade    n percent
     I  362   12.1%
    II 1310   43.7%
   III 1328   44.3%

And with just three short lines of code, we have a basic one-way contingency table for our categorical variable of interest.

Two-way contingency table

We can similarly use thetabyl() function from the {janitor} package to make two-way contingency tables by simply adding another variable name, separated by a comma. Let’s say we now want to see grade tabulated by rt:

mydf |> 
  tabyl(grade, rt)
 grade   0   1
     I 189 173
    II 562 748
   III 477 851

And we get a basic two-way contingency table with counts of grade according to the value of rt.

To enhance the appearance of our table, we can use the {gt} package and it’s associated features to customize our two-way contingency table.

See the {gt} package website for many details and features.

Let’s label the row variable and column variable, and add a title:

mydf |> 
  tabyl(grade, rt) |> 
  gt(
    rowname_col = "grade" 
  ) |> 
  tab_spanner(          
    columns = 2:3,
    label = "Radiation therapy?"
  ) |> 
  tab_stubhead(         
    label = "Grade" 
  ) |> 
  tab_header(           
    title = "Grade according to radiation therapy"
  )
Grade according to radiation therapy
Grade
Radiation therapy?
0 1
I 189 173
II 562 748
III 477 851

And now you see we have a nicely styled html table with the frequencies of grade according to rt, with row and column headers and a title.

Alternatively, we could display percentages in our table. Let’s say we are interested in row percentages, to see what percentage of subjects received radiation therapy within each level of grade.

mydf |> 
  tabyl(grade, rt) |> 
  adorn_percentages(denominator = "row") |> 
  gt(
    rowname_col = "grade" 
  ) |> 
  fmt_percent(
    columns = -1,
    decimals = 1
  ) |> 
  tab_spanner(          
    columns = 2:3,
    label = "Radiation therapy?"
  ) |> 
  tab_stubhead(         
    label = "Grade" 
  ) |> 
  tab_header(           
    title = "Grade according to radiation therapy"
  )
Grade according to radiation therapy
Grade
Radiation therapy?
0 1
I 52.2% 47.8%
II 42.9% 57.1%
III 35.9% 64.1%
Tip

The case_match() function from the {dplyr} package is useful for recoding the levels of a categorical variable to get more desriptive levels when making tables or presenting summaries, e.g. from 0 to “No” and 1 to “Yes”

To make the table look even nicer, we could use the case_match() function from the {dplyr} package, which works similarly to case_when(), and is useful for recoding the levels of a categorical variable. We may want to create more descriptive levels than the 1/0 indicators these variables currently contain.

mydf |> 
  mutate(
    rt_text = case_match(
      rt,
      1 ~ "Yes",
      0 ~ "No"
    )
  ) |> 
  tabyl(grade, rt_text) |> 
  adorn_percentages(denominator = "row") |> 
  gt(
    rowname_col = "grade" 
  ) |> 
  fmt_percent(
    columns = -1,
    decimals = 1
  ) |> 
  tab_spanner(          
    columns = 2:3,
    label = "Radiation therapy?"
  ) |> 
  tab_stubhead(         
    label = "Grade" 
  ) |> 
  tab_header(           
    title = "Grade according to radiation therapy"
  )
Grade according to radiation therapy
Grade
Radiation therapy?
No Yes
I 52.2% 47.8%
II 42.9% 57.1%
III 35.9% 64.1%

Summary statistics

We can use the summarize() function from the {dplyr} package to compute summary statistics.

Let’s compute the mean and standard deviation of age_dx_yrs:

mydf |> 
  summarize(
    avg_age = mean(age_dx_yrs),
    sd_age = sd(age_dx_yrs)
  )
# A tibble: 1 × 2
  avg_age sd_age
    <dbl>  <dbl>
1    57.3   13.9

Summary statistics by group

And we can easily extend this code to generate summary statistics by group by using the group_by() function from the {dplyr} package prior to using summarize().

Tip

The group_by() function is useful in many settings, and it takes an existing dataframe and converts it to a grouped dataframe where all following operations are done within the group.

Note: It is best practice to ungroup() when group operations are complete, to avoid unintended results.

For example:

mydf |> 
  group_by(rt) |> 
  summarize(
    avg_age = mean(age_dx_yrs),
    sd_age = sd(age_dx_yrs)
  ) 
# A tibble: 2 × 3
     rt avg_age sd_age
  <dbl>   <dbl>  <dbl>
1     0    60.5   13.7
2     1    55.1   13.5

Recall that if there are missing values, these must be accounted for by supplying the appropriate additional argument to the mean() and sd() functions. For example, if we want to summarize tumor_size_cm according to grade, we can use:

mydf |> 
  group_by(grade) |> 
  summarize(
    avg_size = mean(tumor_size_cm, na.rm = TRUE),
    sd_size = sd(tumor_size_cm, na.rm = TRUE)
  ) |> 
  ungroup()
# A tibble: 3 × 3
  grade avg_size sd_size
  <chr>    <dbl>   <dbl>
1 I         2.40   0.937
2 II        2.39   0.983
3 III       2.38   0.999

Summarize multiple variables

You can summarize multiple variables using a single function by name:

mydf |> 
  summarise_at(
    c("age_dx_yrs", "tumor_size_cm"), mean, na.rm = TRUE
    )
# A tibble: 1 × 2
  age_dx_yrs tumor_size_cm
       <dbl>         <dbl>
1       57.3          2.39

Or using selection helpers inside vars():

mydf |> 
  summarise_at(
    vars(age_dx_yrs:tumor_size_cm), mean, na.rm = TRUE
    )
# A tibble: 1 × 2
  age_dx_yrs tumor_size_cm
       <dbl>         <dbl>
1       57.3          2.39

You can also select all variables of a certain type, like all numeric variables:

mydf |> 
  summarise_if(is.numeric, mean, na.rm = TRUE)
# A tibble: 1 × 12
   time event    rt age_dx_yrs tumor_size_cm n_ln_pos_3_vs_1or2 nodal_ratio
  <dbl> <dbl> <dbl>      <dbl>         <dbl>              <dbl>       <dbl>
1  3.07 0.813 0.591       57.3          2.39              0.156       0.300
# ℹ 5 more variables: lvi <dbl>, er_or_pr_pos <dbl>, her2_pos <dbl>,
#   quadrant_inner_vs_upper <dbl>, optimal_systemic_therapy <dbl>

And you can summarize select variables using multiple functions by including them in a list:

mydf |> 
  summarise_at(
    c("age_dx_yrs", "tumor_size_cm"), list(mean, sd), na.rm = TRUE
    )
# A tibble: 1 × 4
  age_dx_yrs_fn1 tumor_size_cm_fn1 age_dx_yrs_fn2 tumor_size_cm_fn2
           <dbl>             <dbl>          <dbl>             <dbl>
1           57.3              2.39           13.9             0.984

To specify the string appended to the variable names in the results, make a named list instead:

mydf |> 
  summarise_at(
    c("age_dx_yrs", "tumor_size_cm"), list(mean = mean, sd = sd), na.rm = TRUE
    )
# A tibble: 1 × 4
  age_dx_yrs_mean tumor_size_cm_mean age_dx_yrs_sd tumor_size_cm_sd
            <dbl>              <dbl>         <dbl>            <dbl>
1            57.3               2.39          13.9            0.984

And you can also do this by group, by adding in a group_by() statement:

mydf |> 
  group_by(rt) |> 
  summarise_at(
    c("age_dx_yrs", "tumor_size_cm"), list(mean = mean, sd = sd), na.rm = TRUE
    )
# A tibble: 2 × 5
     rt age_dx_yrs_mean tumor_size_cm_mean age_dx_yrs_sd tumor_size_cm_sd
  <dbl>           <dbl>              <dbl>         <dbl>            <dbl>
1     0            60.5               2.18          13.7            0.986
2     1            55.1               2.53          13.5            0.958

Creating publication-ready tables

In clinical research, we always start our papers with a “Table 1” showing characteristics of the included patient sample, which usually includes a mix of categorical and continuous variables. The {gtsummary} package can help with this.

Let’s make a table of all variables in the breast cancer data except for the outcomes, time and event, using the tbl_summary() function from the {gtsummary} package:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary()
Characteristic N = 3,0001
rt 1,772 (59%)
age_dx_yrs 57 (48, 67)
tumor_size_cm 2.37 (1.69, 3.06)
    Unknown 149
grade
    I 362 (12%)
    II 1,310 (44%)
    III 1,328 (44%)
n_ln_pos_3_vs_1or2 467 (16%)
nodal_ratio 0.28 (0.16, 0.41)
lvi 1,240 (41%)
er_or_pr_pos 2,537 (87%)
    Unknown 87
her2_pos 511 (18%)
    Unknown 132
quadrant_inner_vs_upper 517 (17%)
optimal_systemic_therapy 2,660 (89%)
1 n (%); Median (Q1, Q3)

And we see that with these few lines of code, by default we get a table that shows the number and percentage for binary or categorical variables and the median and first and third quartiles for continuous variables.

The tables are highly customizable.

We can label the row variables:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    label = list(
      rt = "Radiation therapy",
      age_dx_yrs = "Age at diagnosis (years)",
      tumor_size_cm = "Tumor size (cm)",
      grade = "Grade",
      n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
      nodal_ratio = "Nodal ratio",
      lvi = "LVI",
      er_or_pr_pos = "ER+ or PR+",
      her2_pos = "Her2+",
      quadrant_inner_vs_upper = "Inner vs upper quadrant",
      optimal_systemic_therapy = "Optimal systemic therapy"
    )
  )
Characteristic N = 3,0001
Radiation therapy 1,772 (59%)
Age at diagnosis (years) 57 (48, 67)
Tumor size (cm) 2.37 (1.69, 3.06)
    Unknown 149
Grade
    I 362 (12%)
    II 1,310 (44%)
    III 1,328 (44%)
3 vs 1/2 positive lymph nodes 467 (16%)
Nodal ratio 0.28 (0.16, 0.41)
LVI 1,240 (41%)
ER+ or PR+ 2,537 (87%)
    Unknown 87
Her2+ 511 (18%)
    Unknown 132
Inner vs upper quadrant 517 (17%)
Optimal systemic therapy 2,660 (89%)
1 n (%); Median (Q1, Q3)

We can make the variable names bold, and the levels italic:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    label = list(
      rt = "Radiation therapy",
      age_dx_yrs = "Age at diagnosis (years)",
      tumor_size_cm = "Tumor size (cm)",
      grade = "Grade",
      n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
      nodal_ratio = "Nodal ratio",
      lvi = "LVI",
      er_or_pr_pos = "ER+ or PR+",
      her2_pos = "Her2+",
      quadrant_inner_vs_upper = "Inner vs upper quadrant",
      optimal_systemic_therapy = "Optimal systemic therapy"
    )
  ) |> 
  bold_labels() |> 
  italicize_levels()
Characteristic N = 3,0001
Radiation therapy 1,772 (59%)
Age at diagnosis (years) 57 (48, 67)
Tumor size (cm) 2.37 (1.69, 3.06)
    Unknown 149
Grade
    I 362 (12%)
    II 1,310 (44%)
    III 1,328 (44%)
3 vs 1/2 positive lymph nodes 467 (16%)
Nodal ratio 0.28 (0.16, 0.41)
LVI 1,240 (41%)
ER+ or PR+ 2,537 (87%)
    Unknown 87
Her2+ 511 (18%)
    Unknown 132
Inner vs upper quadrant 517 (17%)
Optimal systemic therapy 2,660 (89%)
1 n (%); Median (Q1, Q3)

We could add a column variable, and summarize by that, and also overall:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    label = list(
      rt = "Radiation therapy",
      age_dx_yrs = "Age at diagnosis (years)",
      tumor_size_cm = "Tumor size (cm)",
      grade = "Grade",
      n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
      nodal_ratio = "Nodal ratio",
      lvi = "LVI",
      er_or_pr_pos = "ER+ or PR+",
      her2_pos = "Her2+",
      quadrant_inner_vs_upper = "Inner vs upper quadrant",
      optimal_systemic_therapy = "Optimal systemic therapy"
    ),
    by = rt
  ) |> 
  bold_labels() |> 
  italicize_levels() |> 
  add_overall()
Characteristic Overall
N = 3,0001
0
N = 1,2281
1
N = 1,7721
Age at diagnosis (years) 57 (48, 67) 60 (51, 70) 55 (46, 64)
Tumor size (cm) 2.37 (1.69, 3.06) 2.14 (1.49, 2.83) 2.53 (1.87, 3.18)
    Unknown 149 60 89
Grade


    I 362 (12%) 189 (15%) 173 (9.8%)
    II 1,310 (44%) 562 (46%) 748 (42%)
    III 1,328 (44%) 477 (39%) 851 (48%)
3 vs 1/2 positive lymph nodes 467 (16%) 92 (7.5%) 375 (21%)
Nodal ratio 0.28 (0.16, 0.41) 0.25 (0.15, 0.38) 0.30 (0.18, 0.43)
LVI 1,240 (41%) 435 (35%) 805 (45%)
ER+ or PR+ 2,537 (87%) 1,045 (88%) 1,492 (86%)
    Unknown 87 41 46
Her2+ 511 (18%) 208 (18%) 303 (18%)
    Unknown 132 51 81
Inner vs upper quadrant 517 (17%) 218 (18%) 299 (17%)
Optimal systemic therapy 2,660 (89%) 1,030 (84%) 1,630 (92%)
1 Median (Q1, Q3); n (%)

We can customize the column labels:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    label = list(
      rt = "Radiation therapy",
      age_dx_yrs = "Age at diagnosis (years)",
      tumor_size_cm = "Tumor size (cm)",
      grade = "Grade",
      n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
      nodal_ratio = "Nodal ratio",
      lvi = "LVI",
      er_or_pr_pos = "ER+ or PR+",
      her2_pos = "Her2+",
      quadrant_inner_vs_upper = "Inner vs upper quadrant",
      optimal_systemic_therapy = "Optimal systemic therapy"
    ),
    by = rt
  ) |> 
  bold_labels() |> 
  italicize_levels() |> 
  add_overall() |> 
  modify_header(
    stat_1 = "**No radiation**  \nN = 1,228",
    stat_2 = "**Radiation**  \nN = 1,772"
  )
Characteristic Overall
N = 3,0001
No radiation
N = 1,2281
Radiation
N = 1,7721
Age at diagnosis (years) 57 (48, 67) 60 (51, 70) 55 (46, 64)
Tumor size (cm) 2.37 (1.69, 3.06) 2.14 (1.49, 2.83) 2.53 (1.87, 3.18)
    Unknown 149 60 89
Grade


    I 362 (12%) 189 (15%) 173 (9.8%)
    II 1,310 (44%) 562 (46%) 748 (42%)
    III 1,328 (44%) 477 (39%) 851 (48%)
3 vs 1/2 positive lymph nodes 467 (16%) 92 (7.5%) 375 (21%)
Nodal ratio 0.28 (0.16, 0.41) 0.25 (0.15, 0.38) 0.30 (0.18, 0.43)
LVI 1,240 (41%) 435 (35%) 805 (45%)
ER+ or PR+ 2,537 (87%) 1,045 (88%) 1,492 (86%)
    Unknown 87 41 46
Her2+ 511 (18%) 208 (18%) 303 (18%)
    Unknown 132 51 81
Inner vs upper quadrant 517 (17%) 218 (18%) 299 (17%)
Optimal systemic therapy 2,660 (89%) 1,030 (84%) 1,630 (92%)
1 Median (Q1, Q3); n (%)

Note: to see the underlying header names, do the following:

temp <-
  mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    by = rt
  )

show_header_names(temp)
Column Name   Header                 level*    N*            n*            p*             
label         "**Characteristic**"             3,000 <int>                                
stat_1        "**0**  \nN = 1,228"   0 <chr>   3,000 <int>   1,228 <int>   0.409 <dbl>    
stat_2        "**1**  \nN = 1,772"   1 <chr>   3,000 <int>   1,772 <int>   0.591 <dbl>    
* These values may be dynamically placed into headers (and other locations).
ℹ Review the `modify_header()` (`?gtsummary::modify_header()`) help for
  examples.

And we can add a table header:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    label = list(
      rt = "Radiation therapy",
      age_dx_yrs = "Age at diagnosis (years)",
      tumor_size_cm = "Tumor size (cm)",
      grade = "Grade",
      n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
      nodal_ratio = "Nodal ratio",
      lvi = "LVI",
      er_or_pr_pos = "ER+ or PR+",
      her2_pos = "Her2+",
      quadrant_inner_vs_upper = "Inner vs upper quadrant",
      optimal_systemic_therapy = "Optimal systemic therapy"
    ),
    by = rt
  ) |> 
  bold_labels() |> 
  italicize_levels() |> 
  add_overall() |> 
  modify_header(
    stat_1 = "**No radiation**  \nN = 1,228",
    stat_2 = "**Radiation**  \nN = 1,772"
  ) |> 
  modify_caption("Table of patient characteristics by radiation")
Table of patient characteristics by radiation
Characteristic Overall
N = 3,0001
No radiation
N = 1,2281
Radiation
N = 1,7721
Age at diagnosis (years) 57 (48, 67) 60 (51, 70) 55 (46, 64)
Tumor size (cm) 2.37 (1.69, 3.06) 2.14 (1.49, 2.83) 2.53 (1.87, 3.18)
    Unknown 149 60 89
Grade


    I 362 (12%) 189 (15%) 173 (9.8%)
    II 1,310 (44%) 562 (46%) 748 (42%)
    III 1,328 (44%) 477 (39%) 851 (48%)
3 vs 1/2 positive lymph nodes 467 (16%) 92 (7.5%) 375 (21%)
Nodal ratio 0.28 (0.16, 0.41) 0.25 (0.15, 0.38) 0.30 (0.18, 0.43)
LVI 1,240 (41%) 435 (35%) 805 (45%)
ER+ or PR+ 2,537 (87%) 1,045 (88%) 1,492 (86%)
    Unknown 87 41 46
Her2+ 511 (18%) 208 (18%) 303 (18%)
    Unknown 132 51 81
Inner vs upper quadrant 517 (17%) 218 (18%) 299 (17%)
Optimal systemic therapy 2,660 (89%) 1,030 (84%) 1,630 (92%)
1 Median (Q1, Q3); n (%)

We can also customize the statistics in the table, for example use the mean and standard deviation for all continuous variables:

mydf |> 
  select(
    -time,
    -event
  ) |> 
  tbl_summary(
    label = list(
      rt = "Radiation therapy",
      age_dx_yrs = "Age at diagnosis (years)",
      tumor_size_cm = "Tumor size (cm)",
      grade = "Grade",
      n_ln_pos_3_vs_1or2 = "3 vs 1/2 positive lymph nodes",
      nodal_ratio = "Nodal ratio",
      lvi = "LVI",
      er_or_pr_pos = "ER+ or PR+",
      her2_pos = "Her2+",
      quadrant_inner_vs_upper = "Inner vs upper quadrant",
      optimal_systemic_therapy = "Optimal systemic therapy"
    ),
    by = rt,
    statistic = list(all_continuous() ~ "{mean} ({sd})")
  ) |> 
  bold_labels() |> 
  italicize_levels() |> 
  add_overall() |> 
  modify_header(
    stat_1 = "**No radiation**  \nN = 1,228",
    stat_2 = "**Radiation**  \nN = 1,772"
  ) |> 
  modify_caption("Table of patient characteristics by radiation")
Table of patient characteristics by radiation
Characteristic Overall
N = 3,0001
No radiation
N = 1,2281
Radiation
N = 1,7721
Age at diagnosis (years) 57 (14) 60 (14) 55 (13)
Tumor size (cm) 2.39 (0.98) 2.18 (0.99) 2.53 (0.96)
    Unknown 149 60 89
Grade


    I 362 (12%) 189 (15%) 173 (9.8%)
    II 1,310 (44%) 562 (46%) 748 (42%)
    III 1,328 (44%) 477 (39%) 851 (48%)
3 vs 1/2 positive lymph nodes 467 (16%) 92 (7.5%) 375 (21%)
Nodal ratio 0.30 (0.17) 0.28 (0.17) 0.32 (0.18)
LVI 1,240 (41%) 435 (35%) 805 (45%)
ER+ or PR+ 2,537 (87%) 1,045 (88%) 1,492 (86%)
    Unknown 87 41 46
Her2+ 511 (18%) 208 (18%) 303 (18%)
    Unknown 132 51 81
Inner vs upper quadrant 517 (17%) 218 (18%) 299 (17%)
Optimal systemic therapy 2,660 (89%) 1,030 (84%) 1,630 (92%)
1 Mean (SD); n (%)

See the {gtsummary} website for details and examples: https://www.danieldsjoberg.com/gtsummary/