`R/optimal_kmeans_d.R`

`optimal_kmeans_d.Rd`

`optimal_kmeans_d`

applies k-means clustering using the
`kmeans`

function with many random starts. The D value is
then calculated for the cluster solution at each random start using the
`d`

function, and the cluster solution that maximizes D is returned,
along with the corresponding value of D. In this way the optimally
etiologically heterogeneous subtype solution can be identified from possibly
high-dimensional disease marker data.

`optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)`

- markers
a vector of the names of the disease markers. These markers should be of a type that is suitable for use with

`kmeans`

clustering. All markers will be missing for control subjects. e.g.`markers = c("marker1", "marker2")`

- M
is the number of clusters to identify using

`kmeans`

clustering. For M>=2.- factors
a list of the names of the binary or continuous risk factors. For binary risk factors the lowest level will be used as the reference level. e.g.

`factors = list("age", "sex", "race")`

- case
denotes the variable that contains each subject's status as a case or control. This value should be 1 for cases and 0 for controls. Argument must be supplied in quotes, e.g.

`case = "status"`

.- data
the name of the dataframe that contains the relevant variables.

- nstart
the number of random starts to use with

`kmeans`

clustering. Defaults to 100.- seed
an integer argument passed to

`set.seed`

. Default is NULL. Recommended to set in order to obtain reproducible results.

Returns a list

`optimal_d`

The D value for the optimal D solution

`optimal_d_data`

The original data frame supplied through the
`data`

argument, with a column called `optimal_d_label`

added for the optimal D subtype label. This has the subtype assignment for cases, and is 0 for all controls.

Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.

```
# \donttest{
# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
markers = c(paste0("y", seq(1:30))),
M = 3,
factors = list("x1", "x2", "x3"),
case = "case",
data = subtype_data,
nstart = 100,
seed = 81110224
)
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> ℹ Please use `tibble()` instead.
#> ℹ The deprecated feature was likely used in the riskclustr package.
#> Please report the issue at <https://github.com/zabore/riskclustr/issues>.
# Look at the value of D for the optimal D solution
res[["optimal_d"]]
#> [1] 0.3385876
# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)
#>
#> 0 1 2 3
#> 800 301 300 599
# }
```