Obtain optimal D solution based on k-means clustering of disease marker data in a case-control study

optimal_kmeans_d applies k-means clustering using the kmeans function with many random starts. The D value is then calculated for the cluster solution at each random start using the d function, and the cluster solution that maximizes D is returned, along with the corresponding value of D. In this way the optimally etiologically heterogeneous subtype solution can be identified from possibly high-dimensional disease marker data.

optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)

Arguments

markers: a vector of the names of the disease markers. These markers should be of a type that is suitable for use with kmeans clustering. All markers will be missing for control subjects. e.g. markers = c("marker1", "marker2")
M: is the number of clusters to identify using kmeans clustering. For M>=2.
factors: a list of the names of the binary or continuous risk factors. For binary risk factors the lowest level will be used as the reference level. e.g. factors = list("age", "sex", "race")
case: denotes the variable that contains each subject's status as a case or control. This value should be 1 for cases and 0 for controls. Argument must be supplied in quotes, e.g. case = "status".
data: the name of the dataframe that contains the relevant variables.
nstart: the number of random starts to use with kmeans clustering. Defaults to 100.
seed: an integer argument passed to set.seed. Default is NULL. Recommended to set in order to obtain reproducible results.

Value

Returns a list

optimal_d The D value for the optimal D solution

optimal_d_data The original data frame supplied through the data argument, with a column called optimal_d_label

added for the optimal D subtype label. This has the subtype assignment for cases, and is 0 for all controls.

References

Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.

Examples

# \donttest{
# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
  markers = c(paste0("y", seq(1:30))),
  M = 3,
  factors = list("x1", "x2", "x3"),
  case = "case",
  data = subtype_data,
  nstart = 100,
  seed = 81110224
)
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> ℹ Please use `tibble()` instead.
#> ℹ The deprecated feature was likely used in the riskclustr package.
#>   Please report the issue at <https://github.com/zabore/riskclustr/issues>.

# Look at the value of D for the optimal D solution
res[["optimal_d"]]
#> [1] 0.3385876

# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)
#> 
#>   0   1   2   3 
#> 800 301 300 599 
# }