optimal_kmeans_d applies k-means clustering using the kmeans function with many random starts. The D value is then calculated for the cluster solution at each random start using the d function, and the cluster solution that maximizes D is returned, along with the corresponding value of D. In this way the optimally etiologically heterogeneous subtype solution can be identified from possibly high-dimensional disease marker data.

optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)

Arguments

markers

a vector of the names of the disease markers. These markers should be of a type that is suitable for use with kmeans clustering. All markers will be missing for control subjects. e.g. markers = c("marker1", "marker2")

M

is the number of clusters to identify using kmeans clustering. For M>=2.

factors

a list of the names of the binary or continuous risk factors. For binary risk factors the lowest level will be used as the reference level. e.g. factors = list("age", "sex", "race")

case

denotes the variable that contains each subject's status as a case or control. This value should be 1 for cases and 0 for controls. Argument must be supplied in quotes, e.g. case = "status".

data

the name of the dataframe that contains the relevant variables.

nstart

the number of random starts to use with kmeans clustering. Defaults to 100.

seed

an integer argument passed to set.seed. Default is NULL. Recommended to set in order to obtain reproducible results.

Value

Returns a list

optimal_d The D value for the optimal D solution

optimal_d_data The original data frame supplied through the data argument, with a column called optimal_d_label

added for the optimal D subtype label. This has the subtype assignment for cases, and is 0 for all controls.

References

Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.

Examples

# \donttest{
# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
  markers = c(paste0("y", seq(1:30))),
  M = 3,
  factors = list("x1", "x2", "x3"),
  case = "case",
  data = subtype_data,
  nstart = 100,
  seed = 81110224
)
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#>  Please use `tibble()` instead.
#>  The deprecated feature was likely used in the riskclustr package.
#>   Please report the issue at <https://github.com/zabore/riskclustr/issues>.

# Look at the value of D for the optimal D solution
res[["optimal_d"]]
#> [1] 0.3385876

# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)
#> 
#>   0   1   2   3 
#> 800 301 300 599 
# }