Skip to contents

Perform G-means clustering on a data matrix.

Usage

gmeans(x, k_init = 2L, k_max = 10L, level = 0.05, ...)

Arguments

x

numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).

k_init

integer(1) initial amount of centers. Default is 2L.

k_max

integer(1) maximum amount of centers. Default is 10L.

level

numeric(1) significance level for the Anderson-Darling test. Default is 0.05. See ad.test() for more information.

...

additional arguments passed to stats::kmeans().

Details

The G-means clustering algorithm is an extension of the traditional k-means algorithm that automatically determines the number of clusters by iteratively testing the Gaussianity of data within clusters. The process begins with a specified initial number of clusters (k_init) and iteratively increases the number of clusters until it reaches the specified maximum (k_max) or the data within clusters is determined to be Gaussian at the specified significance level (level).

The algorithm is outlined as follows:

  1. Let \(C\) be the initial set of centers (usually \(C \leftarrow \{\bar{x}\}\)).

  2. Perform k-means clustering on the dataset \(X\) using the current set of centers \(C\), i.e., \(C \leftarrow \text{kmeans}(C, X)\).

  3. For each center \(c_j\), identify the set of data points \(\{x_i \mid \text{class}(x_i) = j\}\) that are assigned to \(c_j\).

  4. Use the Anderson-Darling test to check if the set of data points \(\{x_i \mid \text{class}(x_i) = j\}\) follows a Gaussian distribution

  5. If the data points appear Gaussian, keep \(c_j\). at the confidence level \(\alpha\). Otherwise, replace \(c_j\) with two new centers.

  6. Repeat from step 2 until no more centers are added.

References

Hamerly, Greg, Elkan, Charles (2003). “Learning the k in k-means.” In Thrun S, Saul L, Schölkopf B (eds.), Advances in Neural Information Processing Systems, volume 16. https://proceedings.neurips.cc/paper_files/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf.

Examples

set.seed(123)
x <- rbind(
  matrix(rnorm(100, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)
)
colnames(x) <- c("x", "y")
cl <- gmeans(x)