`vignettes/eh_test.Rmd`

`eh_test.Rmd`

In epidemiologic studies polytomous logistic regression is commonly used in the study of etiologic heterogeneity when data are from a case-control study, and the method has good statistical properties. Although polytomous logistic regression can be implemented using available software, the additional calculations needed to perform a thorough analysis of etiologic heterogeneity are cumbersome. To facilitate use of this method we provide functions `eh_test_subtype()`

and `eh_test_marker()`

to address two key questions regarding etiologic heterogeneity:

- Do risk factor effects differ according to disease subtypes?
- Do risk factor effects differ according to individual disease markers that combine to form disease subtypes?

Whether disease subtypes are pre-specified or formed by cross-classification of individual disease markers, the resulting polytomous logistic regression model is the same. Let \(i\) index study subjects, \(i = 1, \ldots, N\), let \(m\) index disease subtypes, \(m = 0, \ldots M\), where \(m=0\) denotes control subjects, and let \(p\) index risk factors, \(p = 1, \ldots, P\). The polytomous logistic regression model is specified as

\[\Pr(Y = m | \mathbf{X}) = \frac{\exp(\mathbf{X}^T \boldsymbol{\beta}_{\boldsymbol{\cdot} m})}{\mathbf{1} + \exp(\mathbf{X}^T \boldsymbol{\beta}) \mathbf{1}}\] where \(\mathbf{X}\) is the \((P+1) \times N\) matrix of risk factor values, with the first row all ones for the intercept, and \(\boldsymbol{\beta}\) is the \((P+1) \times M\) matrix of regression coefficients. \(\boldsymbol{\beta}_{\boldsymbol{\cdot} m}\) indicates the \(m\)th column of the matrix \(\boldsymbol{\beta}\) and \(\mathbf{1}\) represents a vector of ones of length \(M\).

If disease subtypes are pre-specified, either based on clustering high-dimensional disease marker data or based on a single disease marker or combinations of disease markers, then statistical tests for etiologic heterogeneity according to each risk factor can be conducted using the `eh_test_subtype()`

function.

Estimates of the parameters of interest related to the question of whether risk factor effects differ across subtypes of disease, \(\hat{\boldsymbol{\beta}}\), and the associated estimated variance-covariance matrix, \(\widehat{cov}(\hat{\boldsymbol{\beta}})\), are obtained directly from the resulting polytomous logistic regression model. Each \(\beta_{pm}\) parameter represents the log odds ratio for a one-unit change in risk factor \(p\) for subtype \(m\) disease versus controls. Hypothesis tests for the question of whether a specific risk factor effect differs across subtypes of disease can be conducted separately for each risk factor \(p\) using a Wald test of the hypothesis

\[H_{0_{\beta_{p.}}}: \beta_{p1} = \dots = \beta_{pM}\] Using the `subtype_data`

simulated dataset, we can examine the influence of risk factors `x1`

, `x2`

, and `x3`

on the 4 pre-specified disease subtypes in variable `subtype`

using the following code:

```
library(riskclustr)
mod1 <- eh_test_subtype(
label = "subtype",
M = 4,
factors = list("x1", "x2", "x3"),
data = subtype_data)
```

See the function documentation for details of function arguments.

The resulting estimates \(\hat{\boldsymbol{\beta}}\) can be accessed with

1 | 2 | 3 | 4 | |
---|---|---|---|---|

x1 | 1.5555082 | 0.8232515 | 0.2410591 | 0.1086845 |

x2 | 0.3031594 | 0.4335048 | 0.3518870 | 0.3714092 |

x3 | 0.8000998 | 1.9909315 | 3.0115985 | 1.5594139 |

the associated standard deviation estimates \(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\) with

1 | 2 | 3 | 4 | |
---|---|---|---|---|

x1 | 0.0875330 | 0.0749353 | 0.0758686 | 0.0693273 |

x2 | 0.0783898 | 0.0732283 | 0.0759600 | 0.0697852 |

x3 | 0.2246070 | 0.1833106 | 0.1783101 | 0.1823138 |

and the heterogeneity p-values with

p_het | |
---|---|

x1 | 0.0000000 |

x2 | 0.4778092 |

x3 | 0.0000000 |

An overall formatted dataframe containing \(\hat{\boldsymbol{\beta}} \Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\Big)\) and heterogeneity p-values `p_het`

to test the null hypotheses \(H_{0_{\beta_{p.}}}\) can be obtained as

1 | 2 | 3 | 4 | p_het | |
---|---|---|---|---|---|

x1 | 1.56 (0.09) | 0.82 (0.07) | 0.24 (0.08) | 0.11 (0.07) | <.001 |

x2 | 0.3 (0.08) | 0.43 (0.07) | 0.35 (0.08) | 0.37 (0.07) | 0.478 |

x3 | 0.8 (0.22) | 1.99 (0.18) | 3.01 (0.18) | 1.56 (0.18) | <.001 |

Because it is often of interest to examine associations in a case-control study on the odds ratio (OR) scale rather than the original parameter estimate scale, it is also possible to obtain a matrix containing \(OR=\exp(\hat{\boldsymbol{\beta}})\), along with 95% confidence intervals and heterogeneity p-values `p_het`

to test the null hypotheses \(H_{0_{\beta_{p.}}}\) using

1 | 2 | 3 | 4 | p_het | |
---|---|---|---|---|---|

x1 | 4.74 (3.99-5.62) | 2.28 (1.97-2.64) | 1.27 (1.1-1.48) | 1.11 (0.97-1.28) | <.001 |

x2 | 1.35 (1.16-1.58) | 1.54 (1.34-1.78) | 1.42 (1.23-1.65) | 1.45 (1.26-1.66) | 0.478 |

x3 | 2.23 (1.43-3.46) | 7.32 (5.11-10.49) | 20.32 (14.33-28.82) | 4.76 (3.33-6.8) | <.001 |

If disease subtypes are formed by cross-classifying individual binary disease markers, then statistical tests for associations between risk factors and individual disease markers can be conducted using the `eh_test_marker()`

funtion.

Let \(k\) index disease markers, \(k = 1, \ldots, K\). Here the \(M\) disease subtypes are formed by cross-classification of the \(K\) binary disease markers, so that we have \(M = 2^K\) disease subtypes.

To evaluate the independent influences of individual disease markers, it is convenient to transform the parameters in \(\boldsymbol{\beta}\) using the one-to-one linear transformation

\[\hat{\boldsymbol{\gamma}} = \frac{\hat{\boldsymbol{\beta}} \mathbf{L}}{M/2}.\] Here \(\mathbf{L}\) is an \(M \times K\) contrast matrix such that the entries are -1 if disease marker \(k\) is absent for disease subtype \(m\) and 1 if disease marker \(k\) is present for disease subtype \(m\). \(\boldsymbol{\gamma}\) is then the \((P+1) \times K\) matrix of parameters that reflect the independent effects of distinct disease markers. Each element of the \(\boldsymbol{\gamma}\) parameters represents the average of differences in log odds ratios between disease subtypes defined by different levels of the \(k\)th disease marker with respect to the \(p\)th risk factor when the other disease markers are held constant. Variance estimates corresponding to each \(\hat{\gamma}_{pk}\) are obtained using

\[\widehat{var}(\hat{\gamma}_{pk}) = \left(\frac{M}{2}\right)^{-2} \mathbf{L}_{\boldsymbol{\cdot} k}^T \widehat{cov}(\hat{\boldsymbol{\beta}}_{p \boldsymbol{\cdot}}^T) \mathbf{L}_{\boldsymbol{\cdot} k}\] where \(\mathbf{L}_{\boldsymbol{\cdot} k}\) is the \(k\)th column of the \(\mathbf{L}\) matrix and the estimated variance-covariance matrix \(\widehat{cov}(\hat{\boldsymbol{\beta}}_{p \boldsymbol{\cdot}})\) for each risk factor \(p\) is obtained directly from the polytomous logistic regression model.

Hypothesis tests for the question of whether a risk factor effect differs across levels of each individual disease marker of which the disease subtypes are comprised can be conducted separately for each combination of risk factor \(p\) and disease marker \(k\) using a Wald test of the hypothesis

\[H_{0_{{\gamma_{pk}}}}: \gamma_{pk} = 0.\] Using the `subtype_data`

simulated dataset, we can examine the influence of risk factors `x1`

, `x2`

, and `x3`

on the two individual disease markers `marker1`

and `marker2`

. These two binary disease markers will be cross-classified to form four disease subtypes that will be used as the outcome in the polytomous logistic regression model to obtain the \(\hat{\boldsymbol{\beta}}\) estimates, which are then transformed in order to obtain estimates and hypothesis tests related to the individual disease markers.

```
library(riskclustr)
mod2 <- eh_test_marker(
markers = list("marker1", "marker2"),
factors = list("x1", "x2", "x3"),
case = "case",
data = subtype_data)
```

See the function documentation for details of function arguments.

The resulting estimates \(\hat{\boldsymbol{\gamma}}\) can be accessed with

marker1 | marker2 | |
---|---|---|

x1 | -1.0145081 | -0.4323157 |

x2 | -0.0066840 | 0.0749338 |

x3 | 0.8899905 | -0.1306765 |

the associated standard deviation estimates \(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\) with

marker1 | marker2 | |
---|---|---|

x1 | 0.0681025 | 0.0601803 |

x2 | 0.0631465 | 0.0588423 |

x3 | 0.1450606 | 0.1348479 |

and the associated p-values with

marker1 | marker2 | |
---|---|---|

x1 | 0.0000000 | 0.0000000 |

x2 | 0.9157016 | 0.2028521 |

x3 | 0.0000000 | 0.3325126 |

An overall formatted dataframe containing the \(\hat{\boldsymbol{\gamma}} \Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\Big)\) and p-values to test the null hypotheses \(H_{0_{\gamma_{pk}}}\) can be obtained as

marker1 est | marker1 pval | marker2 est | marker2 pval | |
---|---|---|---|---|

x1 | -1.01 (0.07) | <.001 | -0.43 (0.06) | <.001 |

x2 | -0.01 (0.06) | 0.916 | 0.07 (0.06) | 0.203 |

x3 | 0.89 (0.15) | <.001 | -0.13 (0.13) | 0.333 |

The estimates and heterogeneity p-values for disease subtypes formed by cross-classifying these individual disease markers can also be accessed in objects `beta_se_p`

and `or_ci_p`

, as described in the section on Pre-specified subtypes.