Adjustment for multiple comparisons

In this section, we will introduce methods to adjust p-values to account for multiple testing.

The problem is referred to as multiple comparisons, multiple testing, or multiplicity, but they all mean the same thing.

What do we mean by multiple comparisons, and what is the issue?

When multiple statistical tests are conducted simultaneously, type I errors become more likely. Therefore, our standard type I error rate of 0.05 that is used to determine whether p-values are significant or not is no longer correct, because the type I error has been inflated due to the multiple testing.

Multiple comparisons affect both p-values and confidence intervals.

Prior to significance testing we need to identify a more strict p-value threshold or, alternatively, directly adjust our p-values.

A number of corrections for multiple comparisons can be implemented with the R function p.adjust().

Consider the setting where we have p-values for the association between 10 different gene mutations and treatment response:

library(tibble)
library(gt)

ptab <-
  tibble(
  Gene = paste0("gene", seq(1:10)),
  `p-value` = c(0.001, 0.245, 0.784, 0.034, 0.004, 0.123, 0.089, 0.063, 0.228, 
                0.802)
  ) 

ptab |> 
  gt() |> 
  tab_header("Table of p-values for association with treatment response")

Table of p-values for association with treatment response
Gene	p-value
gene1	0.001
gene2	0.245
gene3	0.784
gene4	0.034
gene5	0.004
gene6	0.123
gene7	0.089
gene8	0.063
gene9	0.228
gene10	0.802

First, adjust these for multiple testing using the false-discovery rate approach. Pass the vector of p-values to p.adjust() and specify method = "fdr":

p.adjust(
  p = c(0.001, 0.245, 0.784, 0.034, 0.004, 0.123, 0.089, 0.063, 0.228, 0.802),
  method = "fdr"
)

 [1] 0.0100000 0.3062500 0.8020000 0.1133333 0.0200000 0.2050000 0.1780000
 [8] 0.1575000 0.3062500 0.8020000

Get back a vector of the adjusted p-values, listed in the same order as the originally provided p-values.

The false-discovery rate is the expected proportion of false positives among all significant tests, and is an appropriate method to use when a study is viewed as exploratory and significant results will be followed up in an independent study.

Alternatively, we could adjust the p-values for multiple testing using the family-wise error approach. Some options include the Bonferroni correction (most conservative, i.e. most difficult to achieve significance) (method = "bonferroni") and the Holm-Bonferroni correction (method = "holm").

p.adjust(
  p = c(0.001, 0.245, 0.784, 0.034, 0.004, 0.123, 0.089, 0.063, 0.228, 0.802),
  method = "bonferroni"
)

 [1] 0.01 1.00 1.00 0.34 0.04 1.00 0.89 0.63 1.00 1.00

And we see that using the Bonferroni method, the adjusted p-values are larger than when using the FDR method. The familywise error rate is the probability of making a type I error among a specified group (“family”) of tests.

After adjusting the p-values, we can compare them to the standard 0.05 level of significance.

We can place the FDR-adjusted p-values into our table by directly applying the p.adjust() function to our column of p-values as follows:

library(dplyr)

ptab |> 
  mutate(
    `q-value` = p.adjust(`p-value`, method = "fdr")
  ) |> 
  gt() |> 
  fmt_number(columns = `q-value`, decimals = 3) |> 
  tab_header("Table of adjusted p-values for association with treatment response")

Table of adjusted p-values for association with treatment response
Gene	p-value	q-value
gene1	0.001	0.010
gene2	0.245	0.306
gene3	0.784	0.802
gene4	0.034	0.113
gene5	0.004	0.020
gene6	0.123	0.205
gene7	0.089	0.178
gene8	0.063	0.158
gene9	0.228	0.306
gene10	0.802	0.802

Note that “q-value” is a common term for p-values that have been adjusted for multiple comparisons.