Skip to main content
Log in

Clustering Large Datasets by Merging K-Means Solutions

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Existing clustering methods range from simple but very restrictive to complex but more flexible. The K-means algorithm is one of the most popular clustering procedures due to its computational speed and intuitive construction. Unfortunately, the application of K-means in its traditional form based on Euclidean distances is limited to cases with spherical clusters of approximately the same volume and spread of points. Recent developments in the area of merging mixture components for clustering show good promise. We propose a general framework for hierarchical merging based on pairwise overlap between components which can be readily applied in the context of the K-means algorithm to produce meaningful clusters. Such an approach preserves the main advantage of the K-means algorithm—its speed. The developed ideas are illustrated on examples, studied through simulations, and applied to the problem of digit recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Aletti, G., & Micheletti, A. (2017). A clustering algorithm for multivariate data streams with correlated components. Journal of Big Data, 4(1), 4–48.

    Article  Google Scholar 

  • Alimoglu, F., & Alpaydin, E. (1996). Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition. In Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96).

  • Baudry, J.P., Raftery, A., Celeux, G., Lo, K., Gottardo, R. (2010). Combining mixture components for clustering. Journal of Computational and Graphical Statistics, 19(2), 332–353.

    Article  MathSciNet  Google Scholar 

  • Bouveyron, C., & Brunet, C. (2014). Model-based clustering of high-dimensional data: a review. Computational Statistics and Data Analysis, 71, 52–78.

    Article  MathSciNet  Google Scholar 

  • Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.

    MathSciNet  MATH  Google Scholar 

  • Campbell, N.A., & Mahon, R.J. (1974). A multivariate study of variation in two species of rock crab of Genus Leptograsus. Australian Journal of Zoology, 22, 417–25.

    Article  Google Scholar 

  • Celebi, M.E., Kingravi, H.A., Vela, P.A. (2012). A comparative study of efficient initialization methods for the k-means clustering algorithm. arXiv:1209.1960.

  • Celebi, M E (Ed.). (2015). Partitional Clustering Algorithms. New York: Springer.

    MATH  Google Scholar 

  • Celeux, G., & Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14, 315–332.

    Article  MathSciNet  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38.

    MATH  Google Scholar 

  • Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23, 298–305.

    MathSciNet  MATH  Google Scholar 

  • Finak, G., & Gottardo, R. (2016). Flowmerge: Cluster merging for flow cytometry data. Bioconductor.

  • Fraley, C., & Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.

    Article  MathSciNet  Google Scholar 

  • Fraley, C., & Raftery, A.E. (2006). MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Tech. Rep. 504, University of Washington, Department of Statistics: Seattle, WA.

  • Goutte, C., Hansen, L.K., Liptrot, M.G., Rostrup, E. (2001). Feature-Space Clustering for fMRI Meta-Analysis. Human Brain Mapping, 13, 165–183.

    Article  Google Scholar 

  • Han, J, Kamber, M, Pei, J (Eds.). (2012). Data mining: concepts and techniques, 3rd edn. New York: Elsevier.

    MATH  Google Scholar 

  • Hennig, C. (2010). Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification, 4, 3–34. https://doi.org/10.1007/s11634-010-0058-3.

    Article  MathSciNet  Google Scholar 

  • Jain, S., Munos, R., Stephan, F. (2013). Zeugmann T (eds) Fast Spectral Clustering via the Nyström Method. Berlin: Springer.

    Google Scholar 

  • Johnson, RA, & Wichern, W (Eds.). (2007). Applied multivariate statistical analysis, 6th edn. London: Pearson.

    MATH  Google Scholar 

  • Kaufman, L., & Rousseeuw, P.J. (1990). Finding Groups in Data. New York: Wiley.

    Book  Google Scholar 

  • Krzanowski, W.J., & Lai, Y.T. (1985). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 23–34.

    Article  MathSciNet  Google Scholar 

  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium, 1, 281–297.

    MathSciNet  MATH  Google Scholar 

  • Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376. https://doi.org/10.1198/jcgs.2009.08054.

    Article  MathSciNet  Google Scholar 

  • McLachlan, G., & Peel, D. (2000). Finite Mixture Models. New York: Wiley.

    Book  Google Scholar 

  • Melnykov, V. (2013). On the distribution of posterior probabilities in finite mixture models with application in clustering. Journal of Multivariate Analysis, 122, 175–189.

    Article  MathSciNet  Google Scholar 

  • Melnykov, I., & Melnykov, V. (2014). On k-means algorithm with the use of Mahalanobis distances. Statistics and Probability Letters, 84, 88–95.

    Article  MathSciNet  Google Scholar 

  • Melnykov, V. (2016). Merging mixture components for clustering through pairwise overlap. Journal of Computational and Graphical Statistics, 25, 66–90.

    Article  MathSciNet  Google Scholar 

  • Michael, S., & Melnykov, V. (2016). Studying complexity of model-based clustering. Communications in Statistics - Simulation and Computation, 45, 2051–2069.

    Article  MathSciNet  Google Scholar 

  • Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? Journal of Classification, 31, 274–295.

    Article  MathSciNet  Google Scholar 

  • Prates, M., Cabral, C., Lachos, V. (2013). mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. Journal of Statistical Software, 54, 1–20.

    Article  Google Scholar 

  • Riani, M., Cerioli, A., Perrotta, D., Torti, F. (2015). Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library. Advances in Data Analysis and Classification, 9, 461–481.

    Article  MathSciNet  Google Scholar 

  • Sneath, P. (1957). The application of computers to taxonomy. Journal of General Microbiology, 17, 201–226.

    Article  Google Scholar 

  • Spielman, D., & Teng, S. (1996). Spectral partitioning works: planar graphs and finite element meshes. In 37th Annual Symposium on Foundations of Computer Science, IEEE Comput. Soc. Press (pp. 96–105).

  • Steinley, D., & Brusco, M.J. (2007). Initializing k-means batch clustering: a critical evaluation of several techniques. Journal of Classification, 24, 99–121.

    Article  MathSciNet  Google Scholar 

  • Stuetzle, W., & Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics. https://doi.org/10.1198/jcgs.2009.07049.

  • Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Semhar Michael.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

1.1 A.1 HoEC-K-means: K-means with Homoscedastic Elliptical Components

This form of the K-means algorithm is based on Mahalanobis distances and assigns yi to the k th component if inequality \(\sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})^{T}\boldsymbol {\Sigma }^{-1}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})} < \sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})^{T}\boldsymbol {\Sigma }^{-1}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})}\) holds for all k = 1, 2,…,k − 1,k + 1,…,K. This inequality can be obtained from Eq. 10 by letting Σ1 = … = ΣKΣ and τ1 = … = τK = 1/K. Thus, HoEC-K-means can be seen as model-based clustering through the CEM algorithm for Gaussian mixtures with homoscedastic elliptical components and equal proportions. Based on these restrictions, the \(\tilde Q\)-function is given by

$$ \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)}) = \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac1K + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma})\right], $$

where the parameter vector is given by \(\boldsymbol {\Psi } = ({\boldsymbol {\mu }_{1}^{T}}, {\boldsymbol {\mu }_{2}^{T}}, \ldots , {\boldsymbol {\mu }_{K}^{T}}, \text {vech}(\boldsymbol {\Sigma })^{T})^{T}\) with vech(Σ) being the vector that consists of unique elements in Σ, namely, vech(Σ)T = (σ11,…,σp1,σ22,…,σp2,…,σpp). Thus, the total number of parameters is M = Kp + p(p + 1)/2. As before, the k th mean vector is estimated at the M-step according to Eq. 1. The common variance-covariance matrix Σ is estimated by

$$ \boldsymbol{\Sigma}^{(b)} = \frac{1}{n}\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})^{T}. $$
(13)

Incorporating the above-listed restrictions in Eq. 8, the overlap can be calculated by

$$ \omega_{kk^{\prime}} = 2{\Phi}\left( -\frac12 \sqrt{(\boldsymbol{\mu}_{k} - \boldsymbol{\mu}_{k^{\prime}})^{T}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{k} - \boldsymbol{\mu}_{k^{\prime}})}\right). $$

1.2 A.2 HeSC-K-means: K-means with Heteroscedastic Spherical Components

Under this form of K-means, yi is assigned to the k th component if inequality \(\sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})^{T}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})/{\sigma _{k}^{2}}} < \sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})^{T}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})/\sigma _{k^{\prime }}^{2}}\) is satisfied for all k = 1, 2,…,k − 1,k + 1,…,K. This inequality is equivalent to decision rule (10) with the following restrictions: \(\boldsymbol {\Sigma }_{1} = {\sigma _{1}^{2}}\boldsymbol {I}, \ldots , \boldsymbol {\Sigma }_{K} = {\sigma _{K}^{2}}\boldsymbol {I}\) and \(\tau _{1} = {\sigma _{1}^{p}} / {\sum }_{h = 1}^{K}{{\sigma ^{p}_{h}}}, \ldots , \tau _{K} = {\sigma _{K}^{p}} / {\sum }_{h = 1}^{K}{{\sigma ^{p}_{h}}}\). Thus, HeSC-K-means is equivalent to model-based clustering based on the CEM algorithm for Gaussian mixtures with heteroscedastic spherical components and mixing proportions as provided above. The \(\tilde Q\)-function from the CEM algorithm in this setting is given by

$$ \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)}) = \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac{{\sigma_{k}^{p}}}{{\sum}_{h = 1}^{K} {\sigma_{h}^{p}}} + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, {\sigma_{k}^{2}})\right], $$

where \(\boldsymbol {\Psi } = ({\boldsymbol {\mu }_{1}^{T}}, {\boldsymbol {\mu }_{2}^{T}}, \ldots , {\boldsymbol {\mu }_{K}^{T}}, {\sigma _{1}^{2}}, {\sigma _{2}^{2}}, \ldots , {\sigma _{K}^{2}})^{T}\) is the parameter vector with M = K(p + 1) parameters. Mean vectors are estimated by Eq. 1 and variance parameters are calculated by the expression

$$ ({\sigma^{2}_{k}})^{(b)} = \frac{{\sum}_{i = 1}^{n} I(z_{i}^{(b)} = k)(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})^{T}(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})}{pn\tau_{k}^{(b)}}, $$
(14)

where mixing proportions are estimated by \(\tau _{k}^{(b)} = ({\sigma _{k}^{p}})^{(b-1)} / {\sum }_{h = 1}^{K} ({\sigma _{h}^{p}})^{(b-1)}\). Incorporating the above-mentioned restrictions into result (7) leads to calculating the overlap value by

$$ \omega_{kk^{\prime}} = \text{Pr}\left[\chi^{2}_{p,\nu {\sigma_{k}^{2}}} \le \nu\sigma_{k^{\prime}}^{2}\right] + \text{Pr}\left[\chi^{2}_{p,\nu \sigma_{k^{\prime}}^{2}} \le {\nu\sigma_{k}^{2}}\right], $$

where \(\nu = (\boldsymbol {\mu }_{k} - \boldsymbol {\mu }_{k^{\prime }})^{T}(\boldsymbol {\mu }_{k} - \boldsymbol {\mu }_{k^{\prime }}) / ({\sigma _{k}^{2}} - \sigma _{k^{\prime }}^{2})^{2}\) and \(\chi ^{2}_{p,{\nu \sigma _{k}^{2}}}\) with \(\chi ^{2}_{p,\nu \sigma _{k^{\prime }}^{2}}\) are non-central χ2 random variables with p degrees of freedom and noncentrality parameters \(\nu {\sigma _{k}^{2}}\) and \(\nu \sigma _{k^{\prime }}^{2}\), respectively.

1.3 A.3 HeEC-K-means: K-means with Heteroscedastic Elliptical Components

This final variation of K-means assigns yi to the k th component if inequality \(\sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})^{T}\boldsymbol {\Sigma }^{-1}_{k}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k})} < \sqrt {(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})^{T}\boldsymbol {\Sigma }^{-1}_{k^{\prime }}(\boldsymbol {y}_{i} - \boldsymbol {\mu }_{k^{\prime }})}\) holds for all k = 1, 2,…,k − 1,k + 1,…,K. This inequality can be obtained from the decision rule given in Eq. 10 by imposing restrictions \(\tau _{1} = |\boldsymbol {\Sigma }_{1}|^{\frac {1}{2}} / {\sum }_{h = 1}^{K} |\boldsymbol {\Sigma }_{h}|^{\frac {1}{2}}, \ldots , \tau _{K} = |\boldsymbol {\Sigma }_{K}|^{\frac {1}{2}} / {\sum }_{h = 1}^{K} |\boldsymbol {\Sigma }_{h}|^{\frac {1}{2}}\). Thus, HeEC-K-means can be seen as model-based clustering relying on the CEM algorithm for Gaussian mixtures with heteroscedastic elliptical components and mixing weights proportional to the square root of the covariance matrix determinant. The \(\tilde Q\)-function from the CEM algorithm is given by

$$ \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)}) = \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac{|\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})\right], $$
(15)

where \(\boldsymbol {\Psi } = ({\boldsymbol {\mu }_{1}^{T}}, {\boldsymbol {\mu }_{2}^{T}}, \ldots , {\boldsymbol {\mu }_{K}^{T}}, \text {vech}(\boldsymbol {\Sigma }_{1})^{T}, \text {vech}(\boldsymbol {\Sigma }_{2})^{T}, \ldots , \text {vech}(\boldsymbol {\Sigma }_{K})^{T})^{T}\). The total number of parameters is M = Kp + Kp(p + 1)/2. It can be shown that the M-step involves updating mean vectors by Eq. 1 and covariance matrices by

$$ \boldsymbol{\Sigma}_{k}^{(b)} = \frac{{\sum}_{i = 1}^{n} I(z_{i}^{(b)} = k)(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k}^{(b)})^{T}}{n\tau_{k}^{(b)}}, $$
(16)

where \(\tau _{k}^{(b)}\) represents the current mixing proportion estimated by \(\tau _{k}^{(b)} = |\boldsymbol {\Sigma }_{k}^{{(b-1)}}|^{\frac {1}{2}} / {\sum }_{h = 1}^{K}\)\( |\boldsymbol {\Sigma }_{h}^{{(b-1)}}|^{\frac {1}{2}}\). Although we do not provide calculations for estimates (11), (13), and (14), the derivation of Eq. 16 is more delicate and is provided below. In order to derive expression (16), \(\tilde Q\)-function (15) has to be maximized with respect to Σk as follows below. From matrix differential calculus, it is known that \(\frac {\partial |\boldsymbol {\Xi }|}{\partial \boldsymbol {\Xi }} = |\boldsymbol {\Xi }|(\boldsymbol {\Xi }^{-1})^{T}\) and \(\frac {\partial \boldsymbol {A}\boldsymbol {\Xi }^{-1}\boldsymbol {B}}{\partial \boldsymbol {\Xi }} = -(\boldsymbol {\Xi }^{-1}\boldsymbol {B}\boldsymbol {A}\boldsymbol {\Xi }^{-1})^{T}\), where A and B are some matrices of constants and Ξ is an unstructured nonsingular matrix. These two results will be also used in our derivation.

$$ \begin{array}{ll} \frac{\partial \tilde Q(\boldsymbol{\Psi}|\boldsymbol{\Psi}^{(b-1)})}{\partial \boldsymbol{\Sigma}_{k}} & = \frac{\partial}{\partial \boldsymbol{\Sigma}_{k}}\left\{\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log\frac{|\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \log \phi_{p}(\boldsymbol{y}_{i}| \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})\right]\right\}\\ & = -\frac{\partial}{\partial \boldsymbol{\Sigma}_{k}}\left\{\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{K} I(z_{i}^{(b)} = k)\left[\log{\sum\limits_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \frac{1}{2} (\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})^{T} \boldsymbol{\Sigma}_{k}^{-1} (\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})\right]\right\}\\ & = -\sum\limits_{i = 1}^{n} \left[\frac{1}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}}\frac{\partial |\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}}{\partial \boldsymbol{\Sigma}_{k}} - \frac12I(z_{i}^{(b)} = k)\boldsymbol{\Sigma}_{k}^{-1}(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})^{T}\boldsymbol{\Sigma}_{k}^{-1}\right]\\ & = \frac12\left[-\frac{n|\boldsymbol{\Sigma}_{k}|^{\frac12}\boldsymbol{\Sigma}_{k}^{-1}}{{\sum}_{h = 1}^{K}|\boldsymbol{\Sigma}_{h}|^{\frac{1}{2}}} + \sum\limits_{i = 1}^{n} I(z_{i}^{(b)} = k)\boldsymbol{\Sigma}_{k}^{-1}(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})(\boldsymbol{y}_{i} - \boldsymbol{\mu}_{k})^{T}\boldsymbol{\Sigma}_{k}^{-1}\right]. \end{array} $$

Setting this derivative equal to zero and letting \(\tau _{k} = |\boldsymbol {\Sigma }_{k}|^{\frac 12} / {\sum }_{h = 1}^{K}|\boldsymbol {\Sigma }_{h}|^{\frac {1}{2}}\) yields result (16).

Under considered restrictions, the overlap is calculated as the sum of the misclassification probabilities \(\omega _{kk^{\prime }} = \omega _{k^{\prime } | k} + \omega _{k | k^{\prime }}\), where \(\omega _{k^{\prime } | k}\) is given by

$$ \omega_{k^{\prime}|k} = \text{Pr}\left[\sum\limits_{j = 1 \atop j: \lambda_{j} \ne 1}^{p} (\lambda_{j} - 1)U_{j} + 2\sum\limits_{j = 1 \atop j: \lambda_{j} = 1}^{p} \delta_{j}W_{j} \le \sum\limits_{j = 1 \atop j: \lambda_{j} \ne 1}^{p} \frac{\lambda_{j}{\delta_{j}^{2}}}{\lambda_{j} - 1} - \sum\limits_{j = 1 \atop j: \lambda_{j} = 1}^{p} {\delta_{j}^{2}}\right] $$

with parameters as described in Eq. 6 and \(\omega _{k | k^{\prime }}\) defined similarly.

This version of the K-means algorithm closely resembles the CEM algorithm for the general form of Gaussian mixture models but with mixing proportions of a special form. Recall that the generalized variance for Σk is given by \(|\boldsymbol {\Sigma }_{k}| = {\prod }_{j = 1}^{p} \lambda _{kj}\), where λk1,λk2,…,λkp are the eigenvalues of Σk. The length of the j th axis of the hyperellipsoid corresponding to Σk is proportional to \(\sqrt {\lambda _{kj}}\) and the volume of hyperellipsoid \(V_{\boldsymbol {\Sigma }_{k}}\) is proportional to \({\prod }_{j = 1}^{p} \sqrt {\lambda _{kj}}\). As a result, we conclude that \(\tau _{k} = V_{\boldsymbol {\Sigma }_{k}} / {\sum }_{h = 1}^{K} V_{\boldsymbol {\Sigma }_{h}}\), i.e., the restriction imposed on mixing proportions requires that they represent the proportion of the overall volume associated with the k th component. Alternatively, consider the expected size of the k th cluster defined as \(n_{k} = {\sum }_{i = 1}^{n} \pi _{ik}\). The log-term in Eq 10 is equal to zero when

$$ \frac{\tau_{k}|\boldsymbol{\Sigma}_{k^{\prime}}|^{\frac{1}{2}}}{\tau_{k^{\prime}}|\boldsymbol{\Sigma}_{k}|^{\frac{1}{2}}} = \frac{n_{k} V_{\boldsymbol{\Sigma}_{k^{\prime}}}}{n_{k^{\prime}} V_{\boldsymbol{\Sigma}_{k}}} = \frac{D_{k}}{D_{k^{\prime}}}, $$

where \(D_{k} = n_{k} / V_{\boldsymbol {\Sigma }_{k}}\) can be seen as the “density” of the k th cluster. As a result, we conclude that the HeEC assumption is appropriate when clusters have similar “densities” of data points. Intuitively, such a restriction is often reasonable. At the same time, traditional model-based clustering with unrestricted mixing proportions allows for much higher degrees of modeling flexibility.

1.4 A.4 An Illustration of the Robustness of the Proposed Procedure to the Choice of K

In this example, we repeat an experiment outlined in Section 4.2.3 in a more challenging setting. A mixture model with three ten-dimensional skew-normal components is employed. Two of the components have a very substantial overlap as they share the mean parameter. Under such a setting, the choice of a linkage predetermines the obtained partitioning. If a single linkage is chosen, one cannot expect that clusters associated with the three mixture components will be detected due to the considerable overlap between two components with a common mean. If a Ward’s linkage is selected, approximately elliptical clusters will be produced but the problem of substantial overlap between components will be relaxed. No matter what linkage is chosen, our main question is whether the proposed procedure is sensitive to the choice of the number of K-means components K. 100 datasets are simulated with each size n = 103, 104, 105, 106. Figure 7 provides the median adjusted Rand index obtained as a result of merging K-means solutions for different numbers K. The first plot represents the performance of the algorithm with the single linkage used. The second plot corresponds to the Ward’s linkage. As we can see, the use of the single linkage produces almost identical results for K ≥ 7 and all considered sample sizes. Thus, the results are very robust to the choice of the initial number of K-means components. As expected, the situation is somewhat more challenging when the Ward’s linkage is used. In this case, hierarchical clustering targets detecting approximately elliptical clusters rather than well-separated ones and we can notice some sample size effect: the results are slightly worse for n = 1000. With regard to the robustness of the procedure to the choice of the number of K-means components, we can notice that roughly the same result is produced after K = 25 for n = 1000 and K = 30 for higher sample sizes. It is worth mentioning here that the considered situation is rather challenging as two ten-dimensional clusters share a common mean. Despite that, the achieved value of the adjusted Rand index is about 0.84–0.85 and corresponding classification agreement is around 94%. It is also worth mentioning that due to high sensitivity of the adjusted Rand index, we observe some slight variation for K ≥ 30 for n = 104, 105, 106. However, this variation is so minor that the maximum difference in the proportion of classification agreements does not exceed 0.003 for any K ≥ 30.

Fig. 7
figure 7

Correspondence between the median adjusted Rand index values obtained as a result of merging K-means solutions with different numbers of components for various sample sizes

1.5 A.5 Alternative Solutions for Illustrative Examples from Section 4.1

In this section, we present an alternative solution for the HoSC-K-means algorithm with K = 20. Figure 8 demonstrates obtained solutions as well as corresponding overlap maps.

Fig. 8
figure 8

Illustrative examples considered in Section 4.1. The description is similar to that of Fig. 3. The number of HoSC-K-means components is chosen to be equal to 20

1.6 A.6 Details of the Simulation Studies Considered in Section 4.2

Some technical details associated with the application of the abovementioned clustering methods are as follows below. A Nyström sample to estimate eigenvalues is chosen to be equal to 100 (option “nystrom.sample” in function specc). The number of samples drawn by Clara is 100. The K-means algorithm is initialized by seeding initial cluster centers at random. The results of the algorithm are based on 100 restarts. Finally, mclust is applied assuming the most general form of Gaussian components (i.e., unrestricted in volume, shape, and orientation and denoted as “VVV”). Model parameters for the three experiments considered in Sections 4.2.1-4.2.3 are provided in Table 4.

Table 4 Parameters of the models considered in experiment 1 (Section 4.2.1), experiment 2 (Section 4.2.2), and Experiment 3 (Section 4.2.3)

1.7 A.7 Plot for Selecting the Number of K-means Components in the Pen Written Digits Application Considered in Section 5

Figure 9 shows the relationship between the proportion of the overall variability explained by the between cluster variability and the number of K-means clusters. As we can see, substantial improvements are obtained for K up to the values of 20–25. After K = 30, the increase in the proportion becomes rather marginal. Therefore, we decided to choose K = 30.

Fig. 9
figure 9

Proportion of the total variability explained by between cluster variability versus the number of K-means clusters

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Melnykov, V., Michael, S. Clustering Large Datasets by Merging K-Means Solutions. J Classif 37, 97–123 (2020). https://doi.org/10.1007/s00357-019-09314-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-019-09314-8

Keywords

Navigation