Abstract
Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.
Similar content being viewed by others
References
Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute.
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.
Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific.
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.
Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.
Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X.
Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.
Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.
Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7).
Dunn, J.C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145. https://doi.org/10.1023/A:1012801612483, http://dblp.uni-trier.de/db/journals/jiis/jiis17.html#HalkidiBV01. http://www.bibsonomy.org/bibtex/2d5ad72294e83dff72417a6f5c68f75fc/dblp.
Handl, J., Knowles, J.D., Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21 (15), 3201–3212. https://doi.org/10.1093/bioinformatics/bti517, http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics21.html#HandlKK05http://www.bibsonomy.org/bibtex/236e89bd65762f2b6274b4dc60ba299b1/dblp.
Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075. ISSN 0176-4268.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.
Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université.
Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16 (6), 1299–1323. https://doi.org/10.1162/089976604773717621, http://dblp.uni-trier.de/db/journals/neco/neco16.html#LangeRBB04, http://www.bibsonomy.org/bibtex/23bdb518b89f88cdaac004cfa86fd70a1/dblp.
Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.
Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.
McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.
Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118).
Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20).
Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.
Qiu, W., & Joe, H. (2006). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334. https://doi.org/10.1007/s00357-006-0018-y, http://dblp.uni-trier.de/db/journals/classification/classification23.html#QiuJ06, http://www.bibsonomy.org/bibtex/2b242e6b052f477826c8307641ff32a80/dblp.
Qiu, W., & Joe, H. (2013). clusterGeneration: random cluster generation (with specified degree of separation). http://CRAN.R-project.org/package=clusterGeneration. R package version 1.3.1.
R Core Team. (2014). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.
Rogers, D.J., & Tanimoto, T.T. (1960). A computer program for classifying plants. Science, 132(3434), 1115–1118. http://www.sciencemag.org/content/132/3434/1115.short.
Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128).
Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.
Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy.
Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.
Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.
Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.
Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.
Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: External Indices
As already mentioned, external indices use pairwise comparisons of group labels to calculate similarity. When comparing the group labels of two randomly selected observations from to partitions Y1 and Y2, four combinations are possible:
-
the two observations are located in the same cluster in both partitions (a)
-
are in different clusters (d)
-
in the same cluster in Y1 but in different ones in Y2 (b) and vice versa (c).
Further notation:
-
n denotes the total number of observations in X
-
m as a specific observation in X
-
k is the number of clusters, where K is the maximum number of clusters that is calculated
-
L denotes a clustering procedure
Other notation is introduced when required.
1.1 Rand Index
The Rand index (Rand 1971) is defined by
1.2 Adjusted Rand Index
The ARI is a modified form of the Rand Index that corrects for matches that are due to pure chance—in contrast to the Rand Index the, ARI can take on values between − 1 and 1 (Hubert and Arabie 1985).
where nij is the number of observations that are common to cluster i in Y1 and cluster j in Y2 (i.e., a), and ni. and n.j denote the number of observations in cluster i and j in Y and Y′.
1.3 Fowlkes-Mallows Index
The Fowlkes-Mallows index (Fowlkes and Mallows 1983) is given by Eq. 3
1.4 Prediction Strength
Prediction Strength, proposed by Tibshirani and Walther (2005), modifies the concept of the external index somewhat in the sense that in measures agreement between partitions not generally, but cluster-wise. The algorithm works as follows:
-
1.
Split the original data X in a training and test set (\(X^{\prime }_{1}\) and \(X^{\prime }_{2}\) respectively).
-
2.
Cluster \(X^{\prime }_{1}\) and \(X^{\prime }_{2}\) into k clusters, where 2 ≤ k ≤ K
-
3.
For each test set cluster, count the proportion of pairs of observations that would also be in the same cluster if they were assigned to the closest centroid of the training set clusters.
-
4.
The minimum over all clusters (i.e., the least stable cluster) is defined as the Prediction Strength.
Thus, for a candidate number of clusters, k, Ak1,Ak2, …, Akk denote the indices of test observations in the test clusters 1, 2, … k. Furthermore, nk1,nk2, …, nkk is defined as the number of observations in the test clusters. The Prediction Strength for a particular k is then given by
where \(D [ P(X_{tr},k), X_{te} ]_{ii^{\prime }}\) is a N × N matrix with the ii′th element = 1 if observations i and i′ from the test data fall into the same cluster if the cluster centers from the training set clustering (P(Xtr,k)) were used for assignment. A slightly different notation is used here: P denotes the partition of X, thus P(Xtr,k) is defined as the partition of the training set using the number of clusters k.
1.5 Jaccard Similarity
Using the same notation as for the FM and Rand Index, the Jaccard similarity (Jaccard 1912) is given by
1.6 McNemar Index
The McNemar index, as described in Desgraupes (2013), is defined by
1.7 Sokal-Sneath Index
The Sokal-Sneath index (Sokal et al. 1963) is given by
1.8 Czekanowski-Dice Index
The Czekanowski-Dice index as illustrated in Desgraupes (2013), is defined as
The index is the harmonic mean of precision and recall coefficients and thus identical to the F measure.
1.9 CLEST
The CLEST algorithm (Dudoit and Fridlyand 2002) works as follows: until the maximum number of possible clusters K, 2 < K < n is reached, do for each k, 2 < k < K steps 1-4.
-
1.
Repeat B times:
-
(a)
Randomly split the original data set X into two non-overlapping sets, a learning set \(X^{\prime }_{1}\) and a test set \(X^{\prime }_{2}\)
-
(b)
Apply a clustering procedure P to the learning set \(X^{\prime }_{1}\) to obtain a partition Y1
-
(c)
Build a classifier Cl using the learning set and the cluster labels from Y1
-
(d)
Apply the clustering procedure L to the test set to obtain a partition Y2
-
(e)
Use an external index to compare Y1 and Y2
-
(a)
-
2.
tk shall be the median of the similarity statistics obtained for each k from point 1
-
3.
Generate B0 data sets under a suitable null hypothesis. For each reference data set, repeat steps 1 and 2 to obtain B0 similarity statistics tk,1, …, \(t_{k,B_{0}}\)
-
4.
Define \({t^{0}_{k}}\) as the average of these B0 statistics and let pk be the proportion of the tk,b, 1 ≤ b ≤ B that are at least as large as the observed value tk, i.e., the p value for tk. Finally define \(d_{k} = t_{k} - {t^{0}_{k}}\) as the difference between observed and expected similarity statistics under the null hypothesis.
Now, the set V is defined as V− = {2 ≤ k ≤ K : pk ≤ pmax,dk ≥ dmin} where pmax and dmin are preset thresholds. If the set is empty, the null hypothesis (no cluster structure) holds. Otherwise the number of clusters is estimated as the number that corresponds to the largest significant difference statistic dk.
Crucial parameters for the CLEST algorithm are the choice of the classifier and the external index for comparison. Dudoit and Fridlyand (2002) recommend as classifier linear discriminant analysis with diagonal covariance matrix (DLDA) and as external index the Fowlkes-Mallows similarity measure (FM). pmax and dmin are both set to 0.05, and the reference data sets are proposed to be sampled from the uniform distribution (similar to GAP).
In this study, the splitting of the original data in point 1a is obviously used in the splitting scheme, but has been substituted by two bootstrap and subset data sets in the respective test schemes. In the simple scheme, the original data set is used twice.
1.10 Kulczynski Index
The Kulczynski index is defined by Kulczyński (1928)
1.11 Hubert \(\hat {\Gamma }\) Index
The Hubert \(\hat {\Gamma }\) index described in Halkidi et al. (2001) is the correlation coefficient of two indicator variables Z1 and Z2 that are defined as binary variables that take on the value 1 if observations mi and mj (i < j) are in the same cluster of the partition and 0 otherwise. The index is thus defined as
Using the definition of the pairwise group membership definitions, the index can also be written like this
Unlike most other external indices, HUB takes on values between -1 and 1.
1.12 Rogers-Tanimoto Index
The Rogers-Tanimoto index (Rogers and Tanimoto 1960) is defined as follows:
Appendix B: Internal Indices
For internal indices the notation from external indices is extended by:
-
W and B are defined as the within and between group sum of squares respectively
-
m1…mn denote the points representing all observations
-
C denotes a cluster and G the cluster centroid
2.1 Hartigan Index
The Hartigan index (Hartigan 1975) is defined as
2.2 Banfeld-Rafterty Index
Banfield and Raftery (1993) define an index which is calculated by the weighted sums of the logarithms of the mean within-cluster sum of squares of each cluster.
2.3 Calinski-Harabasz Index
The CH index, introduced by Calinski and Harabasz (1974) is defined as
2.4 Krzanowski-Lai Index
The KL index, proposed by Lai and Krzanowski (1988) is given by Eq. 16:
where DIFF is defined as
where p is the number of variables. The maximum of value of KL(k) denotes the optimal number of clusters.
2.5 Log-SS-Ratio Index
The Log-SS-Ratio index used in Dimitriadou et al. (2002) is given by Eq. 18:
2.6 Trace-W Index
The Trace-W index in Eq. 19 is defined as the within-sum of squares of the partition:
2.7 Ball-Hall Index
Ball and Hall (1965) introduced an index that calculates as the mean of the of the squared distances of all points with respect to their cluster centroid:
In the special case where all clusters have equal size, the equation is reduced to \(\frac {1}{n}W\)
2.8 Davies-Bouldin Index
For the Davies-Bouldin index (Davies and Bouldin 1979), we first define δk the mean distance of the points that belong to cluster Ck to their cluster center Gk
We also define as \(\delta _{k,k^{\prime }}\) the distance between two cluster centers Gk and \(G^{k^{\prime }}\)
For each cluster k, the maximum Mk of \(\frac {\delta _{k} + \delta _{k^{\prime }}}{\delta _{k,k^{\prime }}}\) is computed for all k≠k′. The Davies-Bouldin index is the mean of the values Mk:
2.9 Dunn Index
The Dunn index (Dunn 1974) essentially measures the distance between two clusters (Ck) by calculating the distance between their closest points:
and taking dmin as the minimum of the vector of distances \(d_{k,k^{\prime }}\):
Furthermore, we denote with Dk the largest distance between two points within a cluster
and define dmax as the largest distance of the cluster diameters Dk
Finally, the Dunn index is defined as follows in Eq. 28:
2.10 PBM Index
The PBM index developed by Pakhira et al. (2004) is calculated as follows from Eq. 29:
where EW is the sum of the distances of the points in each cluster to their centroid and ET the same to the data set centroid (i.e., the one-cluster solution). Obviously, ET does not depend on the partition or the number of clusters, but is a constant value. DB denotes the largest distance between two cluster centroids (CCk):
2.11 Silhouette Index
The silhouette index, introduced by Rousseeuw (1987) is computed by Eq. 31
where
and
a(i) and d(i) have been slightly modified in this study. Usually, these values indicate the average dissimilarity of the i th object to all other objects of the same and nearest cluster respectively. In this study, we have replaced this with average dissimilarity to the cluster centroids. This reduces the computational burden immensely, and is a justifiable trade-off for the possibly slightly decreased accuracy. Therefore, the Silhouette index is abbreviated QSIL (Quasi-Silhouette) in the paper.
2.12 Xu Index
Proposed by Xu (1997), the Xu index is given by Eq. 36:
where d is the number of variables in the data set.
2.13 Gap Statistic
The Gap statistic, proposed by Tibshirani et al. (2001) is computed as follows in Eq. 37:
where B is the number of reference data sets generated from a uniform distribution within the bounding rectangle of the original data. Wkb denotes the within-cluster sum-of-squares of the reference data sets. The optimal number of clusters is indicated by the largest gap in the index values.
Appendix C: Simulation Settings
The following table lists the parameters for each of the simulation settings. These values are used as input to the function genRandomClust of package clusterGeneration. The size of the clusters is determined by selecting a value \(\frac {Observations \times Variables}{k}\), and multiplying by 0.97 and 1.03 for an lower and upper boundary. Within this range, a value for each cluster is randomly selected. By this, the clusters have roughly equal, but not exactly the same size. All other inputs of genRandomClust have been left at their default values.
Rights and permissions
About this article
Cite this article
Dangl, R., Leisch, F. Effects of Resampling in Determining the Number of Clusters in a Data Set. J Classif 37, 558–583 (2020). https://doi.org/10.1007/s00357-019-09328-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09328-2