Effects of Resampling in Determining the Number of Clusters in a Data Set

Dangl, Rainer; Leisch, Friedrich

doi:10.1007/s00357-019-09328-2

Effects of Resampling in Determining the Number of Clusters in a Data Set

Published: 16 July 2019

Volume 37, pages 558–583, (2020)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Rainer Dangl¹ &
Friedrich Leisch¹

454 Accesses
4 Citations
Explore all metrics

Abstract

Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resampling Techniques in Cluster Analysis: Is Subsampling Better Than Bootstrapping?

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

DStab: estimating clustering quality by distance stability

Article 21 June 2023

References

Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute.
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.
Article MathSciNet Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific.
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.
Article MathSciNet Google Scholar
Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.
Article Google Scholar
Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X.
Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.
Article MathSciNet Google Scholar
Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.
Article Google Scholar
Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7).
Dunn, J.C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.
Article MathSciNet Google Scholar
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.
Article Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145. https://doi.org/10.1023/A:1012801612483, http://dblp.uni-trier.de/db/journals/jiis/jiis17.html#HalkidiBV01. http://www.bibsonomy.org/bibtex/2d5ad72294e83dff72417a6f5c68f75fc/dblp.
Article Google Scholar
Handl, J., Knowles, J.D., Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21 (15), 3201–3212. https://doi.org/10.1093/bioinformatics/bti517, http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics21.html#HandlKK05 http://www.bibsonomy.org/bibtex/236e89bd65762f2b6274b4dc60ba299b1/dblp.
Article Google Scholar
Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.
MATH Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075. ISSN 0176-4268.
Article MATH Google Scholar
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.
Article Google Scholar
Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université.
Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.
Article MathSciNet Google Scholar
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16 (6), 1299–1323. https://doi.org/10.1162/089976604773717621, http://dblp.uni-trier.de/db/journals/neco/neco16.html#LangeRBB04, http://www.bibsonomy.org/bibtex/23bdb518b89f88cdaac004cfa86fd70a1/dblp.
Article Google Scholar
Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.
Article MathSciNet Google Scholar
Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.
Article Google Scholar
McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.
Article MathSciNet Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118).
Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20).
Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.
Article Google Scholar
Qiu, W., & Joe, H. (2006). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334. https://doi.org/10.1007/s00357-006-0018-y, http://dblp.uni-trier.de/db/journals/classification/classification23.html#QiuJ06, http://www.bibsonomy.org/bibtex/2b242e6b052f477826c8307641ff32a80/dblp.
Article MathSciNet Google Scholar
Qiu, W., & Joe, H. (2013). clusterGeneration: random cluster generation (with specified degree of separation). http://CRAN.R-project.org/package=clusterGeneration. R package version 1.3.1.
R Core Team. (2014). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.
Article Google Scholar
Rogers, D.J., & Tanimoto, T.T. (1960). A computer program for classifying plants. Science, 132(3434), 1115–1118. http://www.sciencemag.org/content/132/3434/1115.short.
Article Google Scholar
Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128).
Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.
Article Google Scholar
Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy.
Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.
Article MathSciNet Google Scholar
Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.
Article MathSciNet Google Scholar
Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.
Article MathSciNet Google Scholar
Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.
Article Google Scholar
Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Applied Statistics and Computing, University of Natural Resources and Life Sciences, Vienna, Austria
Rainer Dangl & Friedrich Leisch

Authors

Rainer Dangl
View author publications
You can also search for this author in PubMed Google Scholar
Friedrich Leisch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rainer Dangl.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: External Indices

As already mentioned, external indices use pairwise comparisons of group labels to calculate similarity. When comparing the group labels of two randomly selected observations from to partitions Y₁ and Y₂, four combinations are possible:

the two observations are located in the same cluster in both partitions (a)
are in different clusters (d)
in the same cluster in Y₁ but in different ones in Y₂ (b) and vice versa (c).

Further notation:

n denotes the total number of observations in X
m as a specific observation in X
k is the number of clusters, where K is the maximum number of clusters that is calculated
L denotes a clustering procedure

Other notation is introduced when required.

1.1 Rand Index

The Rand index (Rand 1971) is defined by

$$ RI = \frac{a+b}{a+b+c+d} $$

(1)

1.2 Adjusted Rand Index

The ARI is a modified form of the Rand Index that corrects for matches that are due to pure chance—in contrast to the Rand Index the, ARI can take on values between − 1 and 1 (Hubert and Arabie 1985).

$$ ARI = \frac{{\sum}_{ij} {n_{ij} \choose 2} - [{\sum}_{i} {n_{i.} \choose 2} {\sum}_{j} {n_{.j} \choose 2}]/{n \choose 2} }{ \frac{1}{2} [{\sum}_{i} {n{_{i}.} \choose 2} + {\sum}_{j} {n_{.j} \choose 2}] - [{\sum}_{i} {n_{i.} \choose 2} {\sum}_{j} {n_{.j} \choose 2}]/{n \choose 2} } $$

(2)

where n_ij is the number of observations that are common to cluster i in Y₁ and cluster j in Y₂ (i.e., a), and n_i. and n_.j denote the number of observations in cluster i and j in Y and Y^′.

1.3 Fowlkes-Mallows Index

The Fowlkes-Mallows index (Fowlkes and Mallows 1983) is given by Eq. 3

$$ FM = \sqrt{\frac{a}{a+b} \frac{a}{a+c}} $$

(3)

1.4 Prediction Strength

Prediction Strength, proposed by Tibshirani and Walther (2005), modifies the concept of the external index somewhat in the sense that in measures agreement between partitions not generally, but cluster-wise. The algorithm works as follows:

1.
Split the original data X in a training and test set ($X^{\prime }_{1}$ and $X^{\prime }_{2}$ respectively).
2.
Cluster $X^{\prime }_{1}$ and $X^{\prime }_{2}$ into k clusters, where 2 ≤ k ≤ K
3.
For each test set cluster, count the proportion of pairs of observations that would also be in the same cluster if they were assigned to the closest centroid of the training set clusters.
4.
The minimum over all clusters (i.e., the least stable cluster) is defined as the Prediction Strength.

Thus, for a candidate number of clusters, k, A_k1,A_k2, …, A_kk denote the indices of test observations in the test clusters 1, 2, … k. Furthermore, n_k1,n_k2, …, n_kk is defined as the number of observations in the test clusters. The Prediction Strength for a particular k is then given by

$$ PS(k) = \min\limits_{1 \leq j \leq k} \frac{1}{n_{kj}(n_{kj} - 1)} \sum\limits_{i \neq i^{\prime} \in A_{kj}} D [ C(X_{tr},k), X_{te} ]_{ii^{\prime}} $$

(4)

where $D [ P(X_{tr},k), X_{te} ]_{ii^{\prime }}$ is a N × N matrix with the ii^′th element = 1 if observations i and i^′ from the test data fall into the same cluster if the cluster centers from the training set clustering (P(X_tr,k)) were used for assignment. A slightly different notation is used here: P denotes the partition of X, thus P(X_tr,k) is defined as the partition of the training set using the number of clusters k.

1.5 Jaccard Similarity

Using the same notation as for the FM and Rand Index, the Jaccard similarity (Jaccard 1912) is given by

$$ J = \frac{a}{a+b+c} $$

(5)

1.6 McNemar Index

The McNemar index, as described in Desgraupes (2013), is defined by

$$ NI = \frac{d-c}{\sqrt{d-c}} $$

(6)

1.7 Sokal-Sneath Index

The Sokal-Sneath index (Sokal et al. 1963) is given by

$$ SSI = \frac{a}{a+2(b+c)} $$

(7)

1.8 Czekanowski-Dice Index

The Czekanowski-Dice index as illustrated in Desgraupes (2013), is defined as

$$ CDI = \frac{2a}{2a+b+c} $$

(8)

The index is the harmonic mean of precision and recall coefficients and thus identical to the F measure.

1.9 CLEST

The CLEST algorithm (Dudoit and Fridlyand 2002) works as follows: until the maximum number of possible clusters K, 2 < K < n is reached, do for each k, 2 < k < K steps 1-4.

1.
Repeat B times:
1. (a)
  Randomly split the original data set X into two non-overlapping sets, a learning set $X^{\prime }_{1}$ and a test set $X^{\prime }_{2}$
2. (b)
  Apply a clustering procedure P to the learning set $X^{\prime }_{1}$ to obtain a partition Y₁
3. (c)
  Build a classifier Cl using the learning set and the cluster labels from Y₁
4. (d)
  Apply the clustering procedure L to the test set to obtain a partition Y₂
5. (e)
  Use an external index to compare Y₁ and Y₂
2.
t_k shall be the median of the similarity statistics obtained for each k from point 1
3.
Generate B₀ data sets under a suitable null hypothesis. For each reference data set, repeat steps 1 and 2 to obtain B₀ similarity statistics t_k,1, …, $t_{k,B_{0}}$
4.
Define ${t^{0}_{k}}$ as the average of these B₀ statistics and let p_k be the proportion of the t_k,b, 1 ≤ b ≤ B that are at least as large as the observed value t_k, i.e., the p value for t_k. Finally define $d_{k} = t_{k} - {t^{0}_{k}}$ as the difference between observed and expected similarity statistics under the null hypothesis.

Now, the set V is defined as V⁻ = {2 ≤ k ≤ K : p_k ≤ p_max,d_k ≥ d_min} where p_max and d_min are preset thresholds. If the set is empty, the null hypothesis (no cluster structure) holds. Otherwise the number of clusters is estimated as the number that corresponds to the largest significant difference statistic d_k.

Crucial parameters for the CLEST algorithm are the choice of the classifier and the external index for comparison. Dudoit and Fridlyand (2002) recommend as classifier linear discriminant analysis with diagonal covariance matrix (DLDA) and as external index the Fowlkes-Mallows similarity measure (FM). p_max and d_min are both set to 0.05, and the reference data sets are proposed to be sampled from the uniform distribution (similar to GAP).

In this study, the splitting of the original data in point 1a is obviously used in the splitting scheme, but has been substituted by two bootstrap and subset data sets in the respective test schemes. In the simple scheme, the original data set is used twice.

1.10 Kulczynski Index

The Kulczynski index is defined by Kulczyński (1928)

$$ KI = \frac{1}{2}(\frac{a}{a+c} + \frac{a}{a+b}) $$

(9)

1.11 Hubert $\hat {\Gamma }$ Index

The Hubert $\hat {\Gamma }$ index described in Halkidi et al. (2001) is the correlation coefficient of two indicator variables Z₁ and Z₂ that are defined as binary variables that take on the value 1 if observations m_i and m_j (i < j) are in the same cluster of the partition and 0 otherwise. The index is thus defined as

$$ HUB = Corr(Z_{1}, Z_{2}) = \frac{{\sum}_{i<j}(Z_{1}(i,j)-\mu_{Z_{1}})(Z_{2}(i,j)-\mu_{Z_{2}})}{n \sigma_{Z_{1}} \sigma_{Z_{2}}} $$

(10)

Using the definition of the pairwise group membership definitions, the index can also be written like this

$$ HUB = \frac{n \times a - (a+b)(a+c)}{\sqrt{(a+b)(a+c)(d+b)(d+c)}}) $$

(11)

Unlike most other external indices, HUB takes on values between -1 and 1.

1.12 Rogers-Tanimoto Index

The Rogers-Tanimoto index (Rogers and Tanimoto 1960) is defined as follows:

$$ RTI(k) = \frac{a + d}{a + d + 2(b + c)} $$

(12)

Appendix B: Internal Indices

For internal indices the notation from external indices is extended by:

W and B are defined as the within and between group sum of squares respectively
m₁…m_n denote the points representing all observations
C denotes a cluster and G the cluster centroid

2.1 Hartigan Index

The Hartigan index (Hartigan 1975) is defined as

$$ H(k) = (\frac{W_{k}}{W_{k+1}})(n - k - 1) $$

(13)

2.2 Banfeld-Rafterty Index

Banfield and Raftery (1993) define an index which is calculated by the weighted sums of the logarithms of the mean within-cluster sum of squares of each cluster.

$$ BFI(k) = \sum\limits_{k=1}^{K} n_{k} ~ log(\frac{W_{k}}{n_{k}}) $$

(14)

2.3 Calinski-Harabasz Index

The CH index, introduced by Calinski and Harabasz (1974) is defined as

$$ CH(k) = \frac{n - k}{k - 1} \frac{B}{W} $$

(15)

2.4 Krzanowski-Lai Index

The KL index, proposed by Lai and Krzanowski (1988) is given by Eq. 16:

$$ KL(k) = |\frac{DIFF_{k}}{DIFF_{k+1}}| $$

(16)

where DIFF is defined as

$$ DIFF(k) = (k-1)^{2/p} trace(W_{k-1})-q^{2/p} trace(W_{k}) $$

(17)

where p is the number of variables. The maximum of value of KL(k) denotes the optimal number of clusters.

2.5 Log-SS-Ratio Index

The Log-SS-Ratio index used in Dimitriadou et al. (2002) is given by Eq. 18:

$$ LSR(k) = log(\frac{B_{k}}{W_{k}}) $$

(18)

2.6 Trace-W Index

The Trace-W index in Eq. 19 is defined as the within-sum of squares of the partition:

$$ TR(k) = W_{k} $$

(19)

2.7 Ball-Hall Index

Ball and Hall (1965) introduced an index that calculates as the mean of the of the squared distances of all points with respect to their cluster centroid:

$$ BALL(k) = \frac{1}{K} \sum\limits_{k=1}^{K} \frac{1}{n_{k}} \sum\limits_{i \in C_{k}} \|{m_{i}^{k}} - G^{k}\|^{2} $$

(20)

In the special case where all clusters have equal size, the equation is reduced to $\frac {1}{n}W$

2.8 Davies-Bouldin Index

For the Davies-Bouldin index (Davies and Bouldin 1979), we first define δ_k the mean distance of the points that belong to cluster C^k to their cluster center G^k

$$ \delta_{k} = \frac{1}{n_{k}} \sum\limits_{i \in C_{k}} \|{m_{i}^{k}} - G^{k} \| $$

(21)

We also define as $\delta _{k,k^{\prime }}$ the distance between two cluster centers G^k and $G^{k^{\prime }}$

$$ {\Delta}_{k,k^{\prime}} = d(G^{k}, G^{k^{\prime}}) = \| G^{k^{\prime}} - G^{k} \| $$

(22)

For each cluster k, the maximum M_k of $\frac {\delta _{k} + \delta _{k^{\prime }}}{\delta _{k,k^{\prime }}}$ is computed for all k≠k^′. The Davies-Bouldin index is the mean of the values M_k:

$$ DB(k) = \frac{1}{K} \sum\limits^{K}_{k=1} M_{k} = \frac{1}{K} \sum\limits^{K}_{k=1} \max\limits_{k \neq k^{\prime}} \frac{\delta_{k} + \delta_{k^{\prime}}}{{\Delta}_{k,k^{\prime}}} $$

(23)

2.9 Dunn Index

The Dunn index (Dunn 1974) essentially measures the distance between two clusters (C_k) by calculating the distance between their closest points:

$$ d_{k,k^{\prime}} = \min\limits_{i \in C_{k} \\ j \in C_{k}^{\prime}} \|{m^{k}_{i}} - m^{k^{\prime}}_{j} \| $$

(24)

and taking d_min as the minimum of the vector of distances $d_{k,k^{\prime }}$:

$$ d_{min} = \min\limits_{k \neq k^{\prime}} d_{k,k^{\prime}} $$

(25)

Furthermore, we denote with D_k the largest distance between two points within a cluster

$$ D_{k} = \max\limits_{i,j \in C_{k} \\ i \neq j} \|{m^{k}_{i}} - m^{k^{\prime}}_{j} \| $$

(26)

and define d_max as the largest distance of the cluster diameters D_k

$$ d_{max} = \max\limits_{1 \leq k \leq K} D_{k} $$

(27)

Finally, the Dunn index is defined as follows in Eq. 28:

$$ DUNN(k) = \frac{d_{min}}{d_{max}} $$

(28)

2.10 PBM Index

The PBM index developed by Pakhira et al. (2004) is calculated as follows from Eq. 29:

$$ PBM(k) = (\frac{1}{k} \times \frac{E_{T}}{E_{W}} \times D_{B})^{2} $$

(29)

where E_W is the sum of the distances of the points in each cluster to their centroid and E_T the same to the data set centroid (i.e., the one-cluster solution). Obviously, E_T does not depend on the partition or the number of clusters, but is a constant value. D_B denotes the largest distance between two cluster centroids (CC_k):

$$ D_{B} = \max d(G_{k}, G_{k^{\prime}}) $$

(30)

2.11 Silhouette Index

The silhouette index, introduced by Rousseeuw (1987) is computed by Eq. 31

$$ Silhouette(k) = \frac{{\sum}^{n}_{i=1} S(i)}{m}, Silhouette \in [-1,1] $$

(31)

where

$$ S(i) = \frac{b(i)-a(i)}{max\{a(i);b(i)\}} $$

(32)

and

$$ a(i) = \frac{{\sum}_{j \in \{ C_{r} \backslash i\}} d_{ij}}{n_{r} - 1} $$

(33)

$$ b(i) = \min\limits_{s \neq r} \{d_{i_{C_{s}}}\} $$

(34)

$$ d_{i_{C_{s}}} = \frac{{\sum}_{j \in C_{s}} d_{ij}}{n_{s}} $$

(35)

a(i) and d(i) have been slightly modified in this study. Usually, these values indicate the average dissimilarity of the i th object to all other objects of the same and nearest cluster respectively. In this study, we have replaced this with average dissimilarity to the cluster centroids. This reduces the computational burden immensely, and is a justifiable trade-off for the possibly slightly decreased accuracy. Therefore, the Silhouette index is abbreviated QSIL (Quasi-Silhouette) in the paper.

2.12 Xu Index

Proposed by Xu (1997), the Xu index is given by Eq. 36:

$$ XU(k) = d \log (\sqrt{\frac{W_{k}}{dn^{2}}}) + \log(k) $$

(36)

where d is the number of variables in the data set.

2.13 Gap Statistic

The Gap statistic, proposed by Tibshirani et al. (2001) is computed as follows in Eq. 37:

$$ GAP(k) = \frac{1}{B} \sum\limits^{B}_{b=1} \log W_{kb} - \log W_{k} $$

(37)

where B is the number of reference data sets generated from a uniform distribution within the bounding rectangle of the original data. W_kb denotes the within-cluster sum-of-squares of the reference data sets. The optimal number of clusters is indicated by the largest gap in the index values.

Appendix C: Simulation Settings

The following table lists the parameters for each of the simulation settings. These values are used as input to the function genRandomClust of package clusterGeneration. The size of the clusters is determined by selecting a value $\frac {Observations \times Variables}{k}$, and multiplying by 0.97 and 1.03 for an lower and upper boundary. Within this range, a value for each cluster is randomly selected. By this, the clusters have roughly equal, but not exactly the same size. All other inputs of genRandomClust have been left at their default values.

Table 15 Summary of simulation parameters

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dangl, R., Leisch, F. Effects of Resampling in Determining the Number of Clusters in a Data Set. J Classif 37, 558–583 (2020). https://doi.org/10.1007/s00357-019-09328-2

Download citation

Published: 16 July 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s00357-019-09328-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effects of Resampling in Determining the Number of Clusters in a Data Set

Abstract

Access this article

Similar content being viewed by others

Resampling Techniques in Cluster Analysis: Is Subsampling Better Than Bootstrapping?

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

DStab: estimating clustering quality by distance stability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix A: External Indices

1.1 Rand Index

1.2 Adjusted Rand Index

1.3 Fowlkes-Mallows Index

1.4 Prediction Strength

1.5 Jaccard Similarity

1.6 McNemar Index

1.7 Sokal-Sneath Index

1.8 Czekanowski-Dice Index

1.9 CLEST

1.10 Kulczynski Index

1.11 Hubert \(\hat {\Gamma }\) Index

1.12 Rogers-Tanimoto Index

Appendix B: Internal Indices

2.1 Hartigan Index

2.2 Banfeld-Rafterty Index

2.3 Calinski-Harabasz Index

2.4 Krzanowski-Lai Index

2.5 Log-SS-Ratio Index

2.6 Trace-W Index

2.7 Ball-Hall Index

2.8 Davies-Bouldin Index

2.9 Dunn Index

2.10 PBM Index

2.11 Silhouette Index

2.12 Xu Index

2.13 Gap Statistic

Appendix C: Simulation Settings

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation