R/optimal_kmeans_d.R
optimal_kmeans_d.Rdoptimal_kmeans_d applies k-means clustering using the
kmeans function with many random starts. The D value is
then calculated for the cluster solution at each random start using the
d function, and the cluster solution that maximizes D is returned,
along with the corresponding value of D. In this way the optimally
etiologically heterogeneous subtype solution can be identified from possibly
high-dimensional disease marker data.
optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)a vector of the names of the disease markers. These markers
should be of a type that is suitable for use with
kmeans clustering. All markers will be missing
for control subjects. e.g. markers = c("marker1", "marker2")
is the number of clusters to identify using
kmeans clustering. For M>=2.
a list of the names of the binary or continuous risk factors.
For binary risk factors the lowest level will be used as the reference level.
e.g. factors = list("age", "sex", "race")
denotes the variable that contains each subject's status as a
case or control. This value should be 1 for cases and 0 for controls.
Argument must be supplied in quotes, e.g. case = "status".
the name of the dataframe that contains the relevant variables.
the number of random starts to use with
kmeans clustering. Defaults to 100.
an integer argument passed to set.seed.
Default is NULL. Recommended to set in order to obtain reproducible results.
Returns a list
optimal_d The D value for the optimal D solution
optimal_d_data The original data frame supplied through the
data argument, with a column called optimal_d_label
added for the optimal D subtype label. This has the subtype assignment for cases, and is 0 for all controls.
Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.
# \donttest{
# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
markers = c(paste0("y", seq(1:30))),
M = 3,
factors = list("x1", "x2", "x3"),
case = "case",
data = subtype_data,
nstart = 100,
seed = 81110224
)
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> ℹ Please use `tibble()` instead.
#> ℹ The deprecated feature was likely used in the riskclustr package.
#> Please report the issue at <https://github.com/zabore/riskclustr/issues>.
# Look at the value of D for the optimal D solution
res[["optimal_d"]]
#> [1] 0.3385876
# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)
#>
#> 0 1 2 3
#> 800 301 300 599
# }