Clustering Large Datasets by Merging K-Means Solutions

Melnykov, Volodymyr; Michael, Semhar

doi:10.1007/s00357-019-09314-8

Clustering Large Datasets by Merging K-Means Solutions

Published: 29 March 2019

Volume 37, pages 97–123, (2020)
Cite this article

Journal of Classification Aims and scope Submit manuscript

883 Accesses
9 Citations
Explore all metrics

Abstract

Existing clustering methods range from simple but very restrictive to complex but more flexible. The K-means algorithm is one of the most popular clustering procedures due to its computational speed and intuitive construction. Unfortunately, the application of K-means in its traditional form based on Euclidean distances is limited to cases with spherical clusters of approximately the same volume and spread of points. Recent developments in the area of merging mixture components for clustering show good promise. We propose a general framework for hierarchical merging based on pairwise overlap between components which can be readily applied in the context of the K-means algorithm to produce meaningful clusters. Such an approach preserves the main advantage of the K-means algorithm—its speed. The developed ideas are illustrated on examples, studied through simulations, and applied to the problem of digit recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Fig. 6

An extension of the K-means algorithm to clustering skewed data

Article 04 July 2018

QuicK-means: accelerating inference for K-means by learning fast transforms

Article 14 April 2021

Braverman’s Spectrum and Matrix Diagonalization Versus iK-Means: A Unified Framework for Clustering

References

Aletti, G., & Micheletti, A. (2017). A clustering algorithm for multivariate data streams with correlated components. Journal of Big Data, 4(1), 4–48.
Article Google Scholar
Alimoglu, F., & Alpaydin, E. (1996). Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition. In Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96).
Baudry, J.P., Raftery, A., Celeux, G., Lo, K., Gottardo, R. (2010). Combining mixture components for clustering. Journal of Computational and Graphical Statistics, 19(2), 332–353.
Article MathSciNet Google Scholar
Bouveyron, C., & Brunet, C. (2014). Model-based clustering of high-dimensional data: a review. Computational Statistics and Data Analysis, 71, 52–78.
Article MathSciNet Google Scholar
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.
MathSciNet MATH Google Scholar
Campbell, N.A., & Mahon, R.J. (1974). A multivariate study of variation in two species of rock crab of Genus Leptograsus. Australian Journal of Zoology, 22, 417–25.
Article Google Scholar
Celebi, M.E., Kingravi, H.A., Vela, P.A. (2012). A comparative study of efficient initialization methods for the k-means clustering algorithm. arXiv:1209.1960.
Celebi, M E (Ed.). (2015). Partitional Clustering Algorithms. New York: Springer.
MATH Google Scholar
Celeux, G., & Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14, 315–332.
Article MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38.
MATH Google Scholar
Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23, 298–305.
MathSciNet MATH Google Scholar
Finak, G., & Gottardo, R. (2016). Flowmerge: Cluster merging for flow cytometry data. Bioconductor.
Fraley, C., & Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.
Article MathSciNet Google Scholar
Fraley, C., & Raftery, A.E. (2006). MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Tech. Rep. 504, University of Washington, Department of Statistics: Seattle, WA.
Goutte, C., Hansen, L.K., Liptrot, M.G., Rostrup, E. (2001). Feature-Space Clustering for fMRI Meta-Analysis. Human Brain Mapping, 13, 165–183.
Article Google Scholar
Han, J, Kamber, M, Pei, J (Eds.). (2012). Data mining: concepts and techniques, 3rd edn. New York: Elsevier.
MATH Google Scholar
Hennig, C. (2010). Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification, 4, 3–34. https://doi.org/10.1007/s11634-010-0058-3.
Article MathSciNet Google Scholar
Jain, S., Munos, R., Stephan, F. (2013). Zeugmann T (eds) Fast Spectral Clustering via the Nyström Method. Berlin: Springer.
Google Scholar
Johnson, RA, & Wichern, W (Eds.). (2007). Applied multivariate statistical analysis, 6th edn. London: Pearson.
MATH Google Scholar
Kaufman, L., & Rousseeuw, P.J. (1990). Finding Groups in Data. New York: Wiley.
Book Google Scholar
Krzanowski, W.J., & Lai, Y.T. (1985). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 23–34.
Article MathSciNet Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium, 1, 281–297.
MathSciNet MATH Google Scholar
Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376. https://doi.org/10.1198/jcgs.2009.08054.
Article MathSciNet Google Scholar
McLachlan, G., & Peel, D. (2000). Finite Mixture Models. New York: Wiley.
Book Google Scholar
Melnykov, V. (2013). On the distribution of posterior probabilities in finite mixture models with application in clustering. Journal of Multivariate Analysis, 122, 175–189.
Article MathSciNet Google Scholar
Melnykov, I., & Melnykov, V. (2014). On k-means algorithm with the use of Mahalanobis distances. Statistics and Probability Letters, 84, 88–95.
Article MathSciNet Google Scholar
Melnykov, V. (2016). Merging mixture components for clustering through pairwise overlap. Journal of Computational and Graphical Statistics, 25, 66–90.
Article MathSciNet Google Scholar
Michael, S., & Melnykov, V. (2016). Studying complexity of model-based clustering. Communications in Statistics - Simulation and Computation, 45, 2051–2069.
Article MathSciNet Google Scholar
Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? Journal of Classification, 31, 274–295.
Article MathSciNet Google Scholar
Prates, M., Cabral, C., Lachos, V. (2013). mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. Journal of Statistical Software, 54, 1–20.
Article Google Scholar
Riani, M., Cerioli, A., Perrotta, D., Torti, F. (2015). Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library. Advances in Data Analysis and Classification, 9, 461–481.
Article MathSciNet Google Scholar
Sneath, P. (1957). The application of computers to taxonomy. Journal of General Microbiology, 17, 201–226.
Article Google Scholar
Spielman, D., & Teng, S. (1996). Spectral partitioning works: planar graphs and finite element meshes. In 37th Annual Symposium on Foundations of Computer Science, IEEE Comput. Soc. Press (pp. 96–105).
Steinley, D., & Brusco, M.J. (2007). Initializing k-means batch clustering: a critical evaluation of several techniques. Journal of Classification, 24, 99–121.
Article MathSciNet Google Scholar
Stuetzle, W., & Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics. https://doi.org/10.1198/jcgs.2009.07049.
Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems, Statistics, and Management Science at the University of Alabama, Tuscaloosa, AL, 35487, USA
Volodymyr Melnykov
Department of Mathematics and Statistics at South Dakota State University, Brookings, SD, 57007, USA
Semhar Michael

Authors

Volodymyr Melnykov
View author publications
You can also search for this author in PubMed Google Scholar
Semhar Michael
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Semhar Michael.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

1.1 A.1 HoEC-K-means: K-means with Homoscedastic Elliptical Components

This form of the K-means algorithm is based on Mahalanobis distances and assigns y_i to the k th component if inequality $\sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})^{T}\boldsymbol {\Sigma }^{-1}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})} < \sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})^{T}\boldsymbol {\Sigma }^{-1}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})}$ holds for all k^′ = 1, 2,…,k − 1,k + 1,…,K. This inequality can be obtained from Eq. 10 by letting Σ₁ = … = Σ_K ≡Σ and τ₁ = … = τ_K = 1/K. Thus, HoEC-K-means can be seen as model-based clustering through the CEM algorithm for Gaussian mixtures with homoscedastic elliptical components and equal proportions. Based on these restrictions, the $\tilde Q$-function is given by

$$ \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)}) = \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac1K + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma})\right], $$

where the parameter vector is given by $\boldsymbol {\Psi } = ({\boldsymbol {\mu }_{1}^{T}}, {\boldsymbol {\mu }_{2}^{T}}, \ldots , {\boldsymbol {\mu }_{K}^{T}}, \text {vech}(\boldsymbol {\Sigma })^{T})^{T}$ with vech(Σ) being the vector that consists of unique elements in Σ, namely, vech(Σ)^T = (σ₁₁,…,σ_p1,σ₂₂,…,σ_p2,…,σ_pp). Thus, the total number of parameters is M = Kp + p(p + 1)/2. As before, the k th mean vector is estimated at the M-step according to Eq. 1. The common variance-covariance matrix Σ is estimated by

$$ \boldsymbol{\Sigma}^{(b)} = \frac{1}{n}\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})^{T}. $$

(13)

Incorporating the above-listed restrictions in Eq. 8, the overlap can be calculated by

$$ \omega_{kk^{\prime}} = 2{\Phi}\left( -\frac12 \sqrt{(\boldsymbol{\mu}_{k} - \boldsymbol{\mu}_{k^{\prime}})^{T}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{k} - \boldsymbol{\mu}_{k^{\prime}})}\right). $$

1.2 A.2 HeSC-K-means: K-means with Heteroscedastic Spherical Components

Under this form of K-means, y_i is assigned to the k th component if inequality $\sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})^{T}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})/{\sigma _{k}^{2}}} < \sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})^{T}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})/\sigma _{k^{\prime }}^{2}}$ is satisfied for all k^′ = 1, 2,…,k − 1,k + 1,…,K. This inequality is equivalent to decision rule (10) with the following restrictions: $\boldsymbol {\Sigma }_{1} = {\sigma _{1}^{2}}\boldsymbol {I}, \ldots , \boldsymbol {\Sigma }_{K} = {\sigma _{K}^{2}}\boldsymbol {I}$ and $\tau _{1} = {\sigma _{1}^{p}} / {\sum }_{h = 1}^{K}{{\sigma ^{p}_{h}}}, \ldots , \tau _{K} = {\sigma _{K}^{p}} / {\sum }_{h = 1}^{K}{{\sigma ^{p}_{h}}}$. Thus, HeSC-K-means is equivalent to model-based clustering based on the CEM algorithm for Gaussian mixtures with heteroscedastic spherical components and mixing proportions as provided above. The $\tilde Q$-function from the CEM algorithm in this setting is given by

$$ \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)}) = \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac{{\sigma_{k}^{p}}}{{\sum}_{h = 1}^{K} {\sigma_{h}^{p}}} + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, {\sigma_{k}^{2}})\right], $$

where $\boldsymbol {\Psi } = ({\boldsymbol {\mu }_{1}^{T}}, {\boldsymbol {\mu }_{2}^{T}}, \ldots , {\boldsymbol {\mu }_{K}^{T}}, {\sigma _{1}^{2}}, {\sigma _{2}^{2}}, \ldots , {\sigma _{K}^{2}})^{T}$ is the parameter vector with M = K(p + 1) parameters. Mean vectors are estimated by Eq. 1 and variance parameters are calculated by the expression

$$ ({\sigma^{2}_{k}})^{(b)} = \frac{{\sum}_{i = 1}^{n} I(z_{i}^{(b)} = k)(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})^{T}(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})}{pn\tau_{k}^{(b)}}, $$

(14)

where mixing proportions are estimated by $\tau _{k}^{(b)} = ({\sigma _{k}^{p}})^{(b-1)} / {\sum }_{h = 1}^{K} ({\sigma _{h}^{p}})^{(b-1)}$. Incorporating the above-mentioned restrictions into result (7) leads to calculating the overlap value by

$$ \omega_{kk^{\prime}} = \text{Pr}\left[\chi^{2}_{p,\nu {\sigma_{k}^{2}}} \le \nu\sigma_{k^{\prime}}^{2}\right] + \text{Pr}\left[\chi^{2}_{p,\nu \sigma_{k^{\prime}}^{2}} \le {\nu\sigma_{k}^{2}}\right], $$

where $\nu = (\boldsymbol {\mu }_{k} - \boldsymbol {\mu }_{k^{\prime }})^{T}(\boldsymbol {\mu }_{k} - \boldsymbol {\mu }_{k^{\prime }}) / ({\sigma _{k}^{2}} - \sigma _{k^{\prime }}^{2})^{2}$ and $\chi ^{2}_{p,{\nu \sigma _{k}^{2}}}$ with $\chi ^{2}_{p,\nu \sigma _{k^{\prime }}^{2}}$ are non-central χ² random variables with p degrees of freedom and noncentrality parameters $\nu {\sigma _{k}^{2}}$ and $\nu \sigma _{k^{\prime }}^{2}$, respectively.

1.3 A.3 HeEC-K-means: K-means with Heteroscedastic Elliptical Components

This final variation of K-means assigns y_i to the k th component if inequality $\sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})^{T}\boldsymbol {\Sigma }^{-1}_{k}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})} < \sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})^{T}\boldsymbol {\Sigma }^{-1}_{k^{\prime }}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})}$ holds for all k^′ = 1, 2,…,k − 1,k + 1,…,K. This inequality can be obtained from the decision rule given in Eq. 10 by imposing restrictions $\tau _{1} = |\boldsymbol {\Sigma }_{1}|^{\frac {1}{2}} / {\sum }_{h = 1}^{K} |\boldsymbol {\Sigma }_{h}|^{\frac {1}{2}}, \ldots , \tau _{K} = |\boldsymbol {\Sigma }_{K}|^{\frac {1}{2}} / {\sum }_{h = 1}^{K} |\boldsymbol {\Sigma }_{h}|^{\frac {1}{2}}$. Thus, HeEC-K-means can be seen as model-based clustering relying on the CEM algorithm for Gaussian mixtures with heteroscedastic elliptical components and mixing weights proportional to the square root of the covariance matrix determinant. The $\tilde Q$-function from the CEM algorithm is given by

$$ \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)}) = \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac{|\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})\right], $$

(15)

where $\boldsymbol {\Psi } = ({\boldsymbol {\mu }_{1}^{T}}, {\boldsymbol {\mu }_{2}^{T}}, \ldots , {\boldsymbol {\mu }_{K}^{T}}, \text {vech}(\boldsymbol {\Sigma }_{1})^{T}, \text {vech}(\boldsymbol {\Sigma }_{2})^{T}, \ldots , \text {vech}(\boldsymbol {\Sigma }_{K})^{T})^{T}$. The total number of parameters is M = Kp + Kp(p + 1)/2. It can be shown that the M-step involves updating mean vectors by Eq. 1 and covariance matrices by

$$ \boldsymbol{\Sigma}_{k}^{(b)} = \frac{{\sum}_{i = 1}^{n} I(z_{i}^{(b)} = k)(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})^{T}}{n\tau_{k}^{(b)}}, $$

(16)

where $\tau _{k}^{(b)}$ represents the current mixing proportion estimated by $\tau _{k}^{(b)} = |\boldsymbol {\Sigma }_{k}^{{(b-1)}}|^{\frac {1}{2}} / {\sum }_{h = 1}^{K}$$ |\boldsymbol {\Sigma }_{h}^{{(b-1)}}|^{\frac {1}{2}}$. Although we do not provide calculations for estimates (11), (13), and (14), the derivation of Eq. 16 is more delicate and is provided below. In order to derive expression (16), $\tilde Q$-function (15) has to be maximized with respect to Σ_k as follows below. From matrix differential calculus, it is known that $\frac {\partial |\boldsymbol {\Xi }|}{\partial \boldsymbol {\Xi }} = |\boldsymbol {\Xi }|(\boldsymbol {\Xi }^{-1})^{T}$ and $\frac {\partial \boldsymbol {A}\boldsymbol {\Xi }^{-1}\boldsymbol {B}}{\partial \boldsymbol {\Xi }} = -(\boldsymbol {\Xi }^{-1}\boldsymbol {B}\boldsymbol {A}\boldsymbol {\Xi }^{-1})^{T}$, where A and B are some matrices of constants and Ξ is an unstructured nonsingular matrix. These two results will be also used in our derivation.

$$ \begin{array}{ll} \frac{\partial \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)})}{\partial \boldsymbol{\Sigma}_{k}} & = \frac{\partial}{\partial \boldsymbol{\Sigma}_{k}}\left\{\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac{|\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})\right]\right\}\\ & = -\frac{\partial}{\partial \boldsymbol{\Sigma}_{k}}\left\{\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log{\sum\limits_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \frac{1}{2} (\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})^{T} \boldsymbol{\Sigma}_{k}^{-1} (\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})\right]\right\}\\ & = -\sum\limits_{i = 1}^{n} \left[\frac{1}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}}\frac{\partial |\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}}{\partial \boldsymbol{\Sigma}_{k}} - \frac12I(z_{i}^{(b)} = k)\boldsymbol{\Sigma}_{k}^{-1}(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})^{T}\boldsymbol{\Sigma}_{k}^{-1}\right]\\ & = \frac12\left[-\frac{n|\boldsymbol{\Sigma}_{k}|^{\frac12}\boldsymbol{\Sigma}_{k}^{-1}}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \sum\limits_{i = 1}^{n} I(z_{i}^{(b)} = k)\boldsymbol{\Sigma}_{k}^{-1}(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})^{T}\boldsymbol{\Sigma}_{k}^{-1}\right]. \end{array} $$

Setting this derivative equal to zero and letting $\tau _{k} = |\boldsymbol {\Sigma }_{k}|^{\frac 12} / {\sum }_{h = 1}^{K}|\boldsymbol {\Sigma }_{h}|^{\frac {1}{2}}$ yields result (16).

Under considered restrictions, the overlap is calculated as the sum of the misclassification probabilities $\omega _{kk^{\prime }} = \omega _{k^{\prime } | k} + \omega _{k | k^{\prime }}$, where $\omega _{k^{\prime } | k}$ is given by

$$ \omega_{k^{\prime}|k} = \text{Pr}\left[\sum\limits_{j = 1 \atop j: \lambda_{j} \ne 1}^{p} (\lambda_{j} - 1)U_{j} + 2\sum\limits_{j = 1 \atop j: \lambda_{j} = 1}^{p} \delta_{j}W_{j} \le \sum\limits_{j = 1 \atop j: \lambda_{j} \ne 1}^{p} \frac{\lambda_{j}{\delta_{j}^{2}}}{\lambda_{j} - 1} - \sum\limits_{j = 1 \atop j: \lambda_{j} = 1}^{p} {\delta_{j}^{2}}\right] $$

with parameters as described in Eq. 6 and $\omega _{k | k^{\prime }}$ defined similarly.

This version of the K-means algorithm closely resembles the CEM algorithm for the general form of Gaussian mixture models but with mixing proportions of a special form. Recall that the generalized variance for Σ_k is given by $|\boldsymbol {\Sigma }_{k}| = {\prod }_{j = 1}^{p} \lambda _{kj}$, where λ_k1,λ_k2,…,λ_kp are the eigenvalues of Σ_k. The length of the j th axis of the hyperellipsoid corresponding to Σ_k is proportional to $\sqrt {\lambda _{kj}}$ and the volume of hyperellipsoid $V_{\boldsymbol {\Sigma }_{k}}$ is proportional to ${\prod }_{j = 1}^{p} \sqrt {\lambda _{kj}}$. As a result, we conclude that $\tau _{k} = V_{\boldsymbol {\Sigma }_{k}} / {\sum }_{h = 1}^{K} V_{\boldsymbol {\Sigma }_{h}}$, i.e., the restriction imposed on mixing proportions requires that they represent the proportion of the overall volume associated with the k th component. Alternatively, consider the expected size of the k th cluster defined as $n_{k} = {\sum }_{i = 1}^{n} \pi _{ik}$. The log-term in Eq 10 is equal to zero when

$$ \frac{\tau_{k}|\boldsymbol{\Sigma}_{k^{\prime}}|^{\frac{1}{2}}}{\tau_{k^{\prime}}|\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}} = \frac{n_{k} V_{\boldsymbol{\Sigma}_{k^{\prime}}}}{n_{k^{\prime}} V_{\boldsymbol{\Sigma}_{k}}} = \frac{D_{k}}{D_{k^{\prime}}}, $$

where $D_{k} = n_{k} / V_{\boldsymbol {\Sigma }_{k}}$ can be seen as the “density” of the k th cluster. As a result, we conclude that the HeEC assumption is appropriate when clusters have similar “densities” of data points. Intuitively, such a restriction is often reasonable. At the same time, traditional model-based clustering with unrestricted mixing proportions allows for much higher degrees of modeling flexibility.

1.4 A.4 An Illustration of the Robustness of the Proposed Procedure to the Choice of K

In this example, we repeat an experiment outlined in Section 4.2.3 in a more challenging setting. A mixture model with three ten-dimensional skew-normal components is employed. Two of the components have a very substantial overlap as they share the mean parameter. Under such a setting, the choice of a linkage predetermines the obtained partitioning. If a single linkage is chosen, one cannot expect that clusters associated with the three mixture components will be detected due to the considerable overlap between two components with a common mean. If a Ward’s linkage is selected, approximately elliptical clusters will be produced but the problem of substantial overlap between components will be relaxed. No matter what linkage is chosen, our main question is whether the proposed procedure is sensitive to the choice of the number of K-means components K. 100 datasets are simulated with each size n = 10³, 10⁴, 10⁵, 10⁶. Figure 7 provides the median adjusted Rand index obtained as a result of merging K-means solutions for different numbers K. The first plot represents the performance of the algorithm with the single linkage used. The second plot corresponds to the Ward’s linkage. As we can see, the use of the single linkage produces almost identical results for K ≥ 7 and all considered sample sizes. Thus, the results are very robust to the choice of the initial number of K-means components. As expected, the situation is somewhat more challenging when the Ward’s linkage is used. In this case, hierarchical clustering targets detecting approximately elliptical clusters rather than well-separated ones and we can notice some sample size effect: the results are slightly worse for n = 1000. With regard to the robustness of the procedure to the choice of the number of K-means components, we can notice that roughly the same result is produced after K = 25 for n = 1000 and K = 30 for higher sample sizes. It is worth mentioning here that the considered situation is rather challenging as two ten-dimensional clusters share a common mean. Despite that, the achieved value of the adjusted Rand index is about 0.84–0.85 and corresponding classification agreement is around 94%. It is also worth mentioning that due to high sensitivity of the adjusted Rand index, we observe some slight variation for K ≥ 30 for n = 10⁴, 10⁵, 10⁶. However, this variation is so minor that the maximum difference in the proportion of classification agreements does not exceed 0.003 for any K ≥ 30.

1.5 A.5 Alternative Solutions for Illustrative Examples from Section 4.1

In this section, we present an alternative solution for the HoSC-K-means algorithm with K = 20. Figure 8 demonstrates obtained solutions as well as corresponding overlap maps.

1.6 A.6 Details of the Simulation Studies Considered in Section 4.2

Some technical details associated with the application of the abovementioned clustering methods are as follows below. A Nyström sample to estimate eigenvalues is chosen to be equal to 100 (option “nystrom.sample” in function specc). The number of samples drawn by Clara is 100. The K-means algorithm is initialized by seeding initial cluster centers at random. The results of the algorithm are based on 100 restarts. Finally, mclust is applied assuming the most general form of Gaussian components (i.e., unrestricted in volume, shape, and orientation and denoted as “VVV”). Model parameters for the three experiments considered in Sections 4.2.1-4.2.3 are provided in Table 4.

Table 4 Parameters of the models considered in experiment 1 (Section 4.2.1), experiment 2 (Section 4.2.2), and Experiment 3 (Section 4.2.3)

Full size table

1.7 A.7 Plot for Selecting the Number of K-means Components in the Pen Written Digits Application Considered in Section 5

Figure 9 shows the relationship between the proportion of the overall variability explained by the between cluster variability and the number of K-means clusters. As we can see, substantial improvements are obtained for K up to the values of 20–25. After K = 30, the increase in the proportion becomes rather marginal. Therefore, we decided to choose K = 30.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Melnykov, V., Michael, S. Clustering Large Datasets by Merging K-Means Solutions. J Classif 37, 97–123 (2020). https://doi.org/10.1007/s00357-019-09314-8

Download citation

Published: 29 March 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s00357-019-09314-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Large Datasets by Merging K-Means Solutions

Abstract

Access this article

Similar content being viewed by others

An extension of the K-means algorithm to clustering skewed data

QuicK-means: accelerating inference for K-means by learning fast transforms

Braverman’s Spectrum and Matrix Diagonalization Versus iK-Means: A Unified Framework for Clustering

References