Skip to main content
Log in

Effects of Resampling in Determining the Number of Clusters in a Data Set

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rainer Dangl.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: External Indices

As already mentioned, external indices use pairwise comparisons of group labels to calculate similarity. When comparing the group labels of two randomly selected observations from to partitions Y1 and Y2, four combinations are possible:

  • the two observations are located in the same cluster in both partitions (a)

  • are in different clusters (d)

  • in the same cluster in Y1 but in different ones in Y2 (b) and vice versa (c).

Further notation:

  • n denotes the total number of observations in X

  • m as a specific observation in X

  • k is the number of clusters, where K is the maximum number of clusters that is calculated

  • L denotes a clustering procedure

Other notation is introduced when required.

1.1 Rand Index

The Rand index (Rand 1971) is defined by

$$ RI = \frac{a+b}{a+b+c+d} $$
(1)

1.2 Adjusted Rand Index

The ARI is a modified form of the Rand Index that corrects for matches that are due to pure chance—in contrast to the Rand Index the, ARI can take on values between − 1 and 1 (Hubert and Arabie 1985).

$$ ARI = \frac{{\sum}_{ij} {n_{ij} \choose 2} - [{\sum}_{i} {n_{i.} \choose 2} {\sum}_{j} {n_{.j} \choose 2}]/{n \choose 2} }{ \frac{1}{2} [{\sum}_{i} {n{_{i}.} \choose 2} + {\sum}_{j} {n_{.j} \choose 2}] - [{\sum}_{i} {n_{i.} \choose 2} {\sum}_{j} {n_{.j} \choose 2}]/{n \choose 2} } $$
(2)

where nij is the number of observations that are common to cluster i in Y1 and cluster j in Y2 (i.e., a), and ni. and n.j denote the number of observations in cluster i and j in Y and Y.

1.3 Fowlkes-Mallows Index

The Fowlkes-Mallows index (Fowlkes and Mallows 1983) is given by Eq. 3

$$ FM = \sqrt{\frac{a}{a+b} \frac{a}{a+c}} $$
(3)

1.4 Prediction Strength

Prediction Strength, proposed by Tibshirani and Walther (2005), modifies the concept of the external index somewhat in the sense that in measures agreement between partitions not generally, but cluster-wise. The algorithm works as follows:

  1. 1.

    Split the original data X in a training and test set (\(X^{\prime }_{1}\) and \(X^{\prime }_{2}\) respectively).

  2. 2.

    Cluster \(X^{\prime }_{1}\) and \(X^{\prime }_{2}\) into k clusters, where 2 ≤ kK

  3. 3.

    For each test set cluster, count the proportion of pairs of observations that would also be in the same cluster if they were assigned to the closest centroid of the training set clusters.

  4. 4.

    The minimum over all clusters (i.e., the least stable cluster) is defined as the Prediction Strength.

Thus, for a candidate number of clusters, k, Ak1,Ak2, …, Akk denote the indices of test observations in the test clusters 1, 2, … k. Furthermore, nk1,nk2, …, nkk is defined as the number of observations in the test clusters. The Prediction Strength for a particular k is then given by

$$ PS(k) = \min\limits_{1 \leq j \leq k} \frac{1}{n_{kj}(n_{kj} - 1)} \sum\limits_{i \neq i^{\prime} \in A_{kj}} D [ C(X_{tr},k), X_{te} ]_{ii^{\prime}} $$
(4)

where \(D [ P(X_{tr},k), X_{te} ]_{ii^{\prime }}\) is a N × N matrix with the iith element = 1 if observations i and i from the test data fall into the same cluster if the cluster centers from the training set clustering (P(Xtr,k)) were used for assignment. A slightly different notation is used here: P denotes the partition of X, thus P(Xtr,k) is defined as the partition of the training set using the number of clusters k.

1.5 Jaccard Similarity

Using the same notation as for the FM and Rand Index, the Jaccard similarity (Jaccard 1912) is given by

$$ J = \frac{a}{a+b+c} $$
(5)

1.6 McNemar Index

The McNemar index, as described in Desgraupes (2013), is defined by

$$ NI = \frac{d-c}{\sqrt{d-c}} $$
(6)

1.7 Sokal-Sneath Index

The Sokal-Sneath index (Sokal et al. 1963) is given by

$$ SSI = \frac{a}{a+2(b+c)} $$
(7)

1.8 Czekanowski-Dice Index

The Czekanowski-Dice index as illustrated in Desgraupes (2013), is defined as

$$ CDI = \frac{2a}{2a+b+c} $$
(8)

The index is the harmonic mean of precision and recall coefficients and thus identical to the F measure.

1.9 CLEST

The CLEST algorithm (Dudoit and Fridlyand 2002) works as follows: until the maximum number of possible clusters K, 2 < K < n is reached, do for each k, 2 < k < K steps 1-4.

  1. 1.

    Repeat B times:

    1. (a)

      Randomly split the original data set X into two non-overlapping sets, a learning set \(X^{\prime }_{1}\) and a test set \(X^{\prime }_{2}\)

    2. (b)

      Apply a clustering procedure P to the learning set \(X^{\prime }_{1}\) to obtain a partition Y1

    3. (c)

      Build a classifier Cl using the learning set and the cluster labels from Y1

    4. (d)

      Apply the clustering procedure L to the test set to obtain a partition Y2

    5. (e)

      Use an external index to compare Y1 and Y2

  2. 2.

    tk shall be the median of the similarity statistics obtained for each k from point 1

  3. 3.

    Generate B0 data sets under a suitable null hypothesis. For each reference data set, repeat steps 1 and 2 to obtain B0 similarity statistics tk,1, …, \(t_{k,B_{0}}\)

  4. 4.

    Define \({t^{0}_{k}}\) as the average of these B0 statistics and let pk be the proportion of the tk,b, 1 ≤ bB that are at least as large as the observed value tk, i.e., the p value for tk. Finally define \(d_{k} = t_{k} - {t^{0}_{k}}\) as the difference between observed and expected similarity statistics under the null hypothesis.

Now, the set V is defined as V = {2 ≤ kK : pkpmax,dkdmin} where pmax and dmin are preset thresholds. If the set is empty, the null hypothesis (no cluster structure) holds. Otherwise the number of clusters is estimated as the number that corresponds to the largest significant difference statistic dk.

Crucial parameters for the CLEST algorithm are the choice of the classifier and the external index for comparison. Dudoit and Fridlyand (2002) recommend as classifier linear discriminant analysis with diagonal covariance matrix (DLDA) and as external index the Fowlkes-Mallows similarity measure (FM). pmax and dmin are both set to 0.05, and the reference data sets are proposed to be sampled from the uniform distribution (similar to GAP).

In this study, the splitting of the original data in point 1a is obviously used in the splitting scheme, but has been substituted by two bootstrap and subset data sets in the respective test schemes. In the simple scheme, the original data set is used twice.

1.10 Kulczynski Index

The Kulczynski index is defined by Kulczyński (1928)

$$ KI = \frac{1}{2}(\frac{a}{a+c} + \frac{a}{a+b}) $$
(9)

1.11 Hubert \(\hat {\Gamma }\) Index

The Hubert \(\hat {\Gamma }\) index described in Halkidi et al. (2001) is the correlation coefficient of two indicator variables Z1 and Z2 that are defined as binary variables that take on the value 1 if observations mi and mj (i < j) are in the same cluster of the partition and 0 otherwise. The index is thus defined as

$$ HUB = Corr(Z_{1}, Z_{2}) = \frac{{\sum}_{i<j}(Z_{1}(i,j)-\mu_{Z_{1}})(Z_{2}(i,j)-\mu_{Z_{2}})}{n \sigma_{Z_{1}} \sigma_{Z_{2}}} $$
(10)

Using the definition of the pairwise group membership definitions, the index can also be written like this

$$ HUB = \frac{n \times a - (a+b)(a+c)}{\sqrt{(a+b)(a+c)(d+b)(d+c)}}) $$
(11)

Unlike most other external indices, HUB takes on values between -1 and 1.

1.12 Rogers-Tanimoto Index

The Rogers-Tanimoto index (Rogers and Tanimoto 1960) is defined as follows:

$$ RTI(k) = \frac{a + d}{a + d + 2(b + c)} $$
(12)

Appendix B: Internal Indices

For internal indices the notation from external indices is extended by:

  • W and B are defined as the within and between group sum of squares respectively

  • m1mn denote the points representing all observations

  • C denotes a cluster and G the cluster centroid

2.1 Hartigan Index

The Hartigan index (Hartigan 1975) is defined as

$$ H(k) = (\frac{W_{k}}{W_{k+1}})(n - k - 1) $$
(13)

2.2 Banfeld-Rafterty Index

Banfield and Raftery (1993) define an index which is calculated by the weighted sums of the logarithms of the mean within-cluster sum of squares of each cluster.

$$ BFI(k) = \sum\limits_{k=1}^{K} n_{k} ~ log(\frac{W_{k}}{n_{k}}) $$
(14)

2.3 Calinski-Harabasz Index

The CH index, introduced by Calinski and Harabasz (1974) is defined as

$$ CH(k) = \frac{n - k}{k - 1} \frac{B}{W} $$
(15)

2.4 Krzanowski-Lai Index

The KL index, proposed by Lai and Krzanowski (1988) is given by Eq. 16:

$$ KL(k) = |\frac{DIFF_{k}}{DIFF_{k+1}}| $$
(16)

where DIFF is defined as

$$ DIFF(k) = (k-1)^{2/p} trace(W_{k-1})-q^{2/p} trace(W_{k}) $$
(17)

where p is the number of variables. The maximum of value of KL(k) denotes the optimal number of clusters.

2.5 Log-SS-Ratio Index

The Log-SS-Ratio index used in Dimitriadou et al. (2002) is given by Eq. 18:

$$ LSR(k) = log(\frac{B_{k}}{W_{k}}) $$
(18)

2.6 Trace-W Index

The Trace-W index in Eq. 19 is defined as the within-sum of squares of the partition:

$$ TR(k) = W_{k} $$
(19)

2.7 Ball-Hall Index

Ball and Hall (1965) introduced an index that calculates as the mean of the of the squared distances of all points with respect to their cluster centroid:

$$ BALL(k) = \frac{1}{K} \sum\limits_{k=1}^{K} \frac{1}{n_{k}} \sum\limits_{i \in C_{k}} \|{m_{i}^{k}} - G^{k}\|^{2} $$
(20)

In the special case where all clusters have equal size, the equation is reduced to \(\frac {1}{n}W\)

2.8 Davies-Bouldin Index

For the Davies-Bouldin index (Davies and Bouldin 1979), we first define δk the mean distance of the points that belong to cluster Ck to their cluster center Gk

$$ \delta_{k} = \frac{1}{n_{k}} \sum\limits_{i \in C_{k}} \|{m_{i}^{k}} - G^{k} \| $$
(21)

We also define as \(\delta _{k,k^{\prime }}\) the distance between two cluster centers Gk and \(G^{k^{\prime }}\)

$$ {\Delta}_{k,k^{\prime}} = d(G^{k}, G^{k^{\prime}}) = \| G^{k^{\prime}} - G^{k} \| $$
(22)

For each cluster k, the maximum Mk of \(\frac {\delta _{k} + \delta _{k^{\prime }}}{\delta _{k,k^{\prime }}}\) is computed for all kk. The Davies-Bouldin index is the mean of the values Mk:

$$ DB(k) = \frac{1}{K} \sum\limits^{K}_{k=1} M_{k} = \frac{1}{K} \sum\limits^{K}_{k=1} \max\limits_{k \neq k^{\prime}} \frac{\delta_{k} + \delta_{k^{\prime}}}{{\Delta}_{k,k^{\prime}}} $$
(23)

2.9 Dunn Index

The Dunn index (Dunn 1974) essentially measures the distance between two clusters (Ck) by calculating the distance between their closest points:

$$ d_{k,k^{\prime}} = \min\limits_{i \in C_{k} \\ j \in C_{k}^{\prime}} \|{m^{k}_{i}} - m^{k^{\prime}}_{j} \| $$
(24)

and taking dmin as the minimum of the vector of distances \(d_{k,k^{\prime }}\):

$$ d_{min} = \min\limits_{k \neq k^{\prime}} d_{k,k^{\prime}} $$
(25)

Furthermore, we denote with Dk the largest distance between two points within a cluster

$$ D_{k} = \max\limits_{i,j \in C_{k} \\ i \neq j} \|{m^{k}_{i}} - m^{k^{\prime}}_{j} \| $$
(26)

and define dmax as the largest distance of the cluster diameters Dk

$$ d_{max} = \max\limits_{1 \leq k \leq K} D_{k} $$
(27)

Finally, the Dunn index is defined as follows in Eq. 28:

$$ DUNN(k) = \frac{d_{min}}{d_{max}} $$
(28)

2.10 PBM Index

The PBM index developed by Pakhira et al. (2004) is calculated as follows from Eq. 29:

$$ PBM(k) = (\frac{1}{k} \times \frac{E_{T}}{E_{W}} \times D_{B})^{2} $$
(29)

where EW is the sum of the distances of the points in each cluster to their centroid and ET the same to the data set centroid (i.e., the one-cluster solution). Obviously, ET does not depend on the partition or the number of clusters, but is a constant value. DB denotes the largest distance between two cluster centroids (CCk):

$$ D_{B} = \max d(G_{k}, G_{k^{\prime}}) $$
(30)

2.11 Silhouette Index

The silhouette index, introduced by Rousseeuw (1987) is computed by Eq. 31

$$ Silhouette(k) = \frac{{\sum}^{n}_{i=1} S(i)}{m}, Silhouette \in [-1,1] $$
(31)

where

$$ S(i) = \frac{b(i)-a(i)}{max\{a(i);b(i)\}} $$
(32)

and

$$ a(i) = \frac{{\sum}_{j \in \{ C_{r} \backslash i\}} d_{ij}}{n_{r} - 1} $$
(33)
$$ b(i) = \min\limits_{s \neq r} \{d_{i_{C_{s}}}\} $$
(34)
$$ d_{i_{C_{s}}} = \frac{{\sum}_{j \in C_{s}} d_{ij}}{n_{s}} $$
(35)

a(i) and d(i) have been slightly modified in this study. Usually, these values indicate the average dissimilarity of the i th object to all other objects of the same and nearest cluster respectively. In this study, we have replaced this with average dissimilarity to the cluster centroids. This reduces the computational burden immensely, and is a justifiable trade-off for the possibly slightly decreased accuracy. Therefore, the Silhouette index is abbreviated QSIL (Quasi-Silhouette) in the paper.

2.12 Xu Index

Proposed by Xu (1997), the Xu index is given by Eq. 36:

$$ XU(k) = d \log (\sqrt{\frac{W_{k}}{dn^{2}}}) + \log(k) $$
(36)

where d is the number of variables in the data set.

2.13 Gap Statistic

The Gap statistic, proposed by Tibshirani et al. (2001) is computed as follows in Eq. 37:

$$ GAP(k) = \frac{1}{B} \sum\limits^{B}_{b=1} \log W_{kb} - \log W_{k} $$
(37)

where B is the number of reference data sets generated from a uniform distribution within the bounding rectangle of the original data. Wkb denotes the within-cluster sum-of-squares of the reference data sets. The optimal number of clusters is indicated by the largest gap in the index values.

Appendix C: Simulation Settings

The following table lists the parameters for each of the simulation settings. These values are used as input to the function genRandomClust of package clusterGeneration. The size of the clusters is determined by selecting a value \(\frac {Observations \times Variables}{k}\), and multiplying by 0.97 and 1.03 for an lower and upper boundary. Within this range, a value for each cluster is randomly selected. By this, the clusters have roughly equal, but not exactly the same size. All other inputs of genRandomClust have been left at their default values.

Table 15 Summary of simulation parameters

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dangl, R., Leisch, F. Effects of Resampling in Determining the Number of Clusters in a Data Set. J Classif 37, 558–583 (2020). https://doi.org/10.1007/s00357-019-09328-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-019-09328-2

Keywords

Navigation