Perform G-means clustering on a data matrix.
Arguments
- x
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
- k_init
integer(1)
initial amount of centers. Default is2L
.- k_max
integer(1)
maximum amount of centers. Default is10L
.- level
numeric(1)
significance level for the Anderson-Darling test. Default is0.05
. Seead.test()
for more information.- ...
additional arguments passed to
stats::kmeans()
.
Details
The G-means clustering algorithm is an extension of the traditional k-means
algorithm that automatically determines the number of clusters by iteratively
testing the Gaussianity of data within clusters. The process begins with a specified
initial number of clusters (k_init
) and iteratively increases the number of
clusters until it reaches the specified maximum (k_max
) or the data within
clusters is determined to be Gaussian at the specified significance level (level
).
The algorithm is outlined as follows:
Let \(C\) be the initial set of centers (usually \(C \leftarrow \{\bar{x}\}\)).
Perform k-means clustering on the dataset \(X\) using the current set of centers \(C\), i.e., \(C \leftarrow \text{kmeans}(C, X)\).
For each center \(c_j\), identify the set of data points \(\{x_i \mid \text{class}(x_i) = j\}\) that are assigned to \(c_j\).
Use the Anderson-Darling test to check if the set of data points \(\{x_i \mid \text{class}(x_i) = j\}\) follows a Gaussian distribution
If the data points appear Gaussian, keep \(c_j\). at the confidence level \(\alpha\). Otherwise, replace \(c_j\) with two new centers.
Repeat from step 2 until no more centers are added.
References
Hamerly, Greg, Elkan, Charles (2003). “Learning the k in k-means.” In Thrun S, Saul L, Schölkopf B (eds.), Advances in Neural Information Processing Systems, volume 16. https://proceedings.neurips.cc/paper_files/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf.