Skip to main content
Log in

Why Clusters and Other Patterns Can Seem to be Found in Analyses of High-Dimensional Data

  • Research Article
  • Published:
Evolutionary Biology Aims and scope Submit manuscript

Abstract

Recent papers by Cardini et al. (Evolutionary Biology 46:307–316, 2019) and Bookstein (Evolutionary Biology 46:271–302, 2019) show that, when there are many variables and when sample sizes are small, scatterplots made using the between-groups principal components analysis method can appear to indicate clear group differences with little or no overlap between samples even though the samples are all drawn from a single multivariate normally distributed population. The corresponding scatterplots made after a canonical variates analysis (CVA) show an even more extreme separation of groups even though the usual test statistics yield the correct uniform distribution of probabilities. Users of CVA are usually concerned about the problems of small sample sizes and correlated variables but the problems discussed here are present even for large samples and uncorrelated variables. Some less-appreciated properties of sampling from high-dimensional spaces and the “curse of dimensionality” are reviewed to find a simple explanation for these problems. The ratio of variables to sample size is a useful index to predict when false clusters and these other problems may arise. While dependent upon the same variables, this index is not based on Marchenko and Pastur (Mathematics of the USSR–Sbornik 1:457–483, 1967) as discussed by Bookstein (Evolutionary Biology 44:522–541, 2017). It is also shown that multiple regression analysis can have related problems when there are large numbers of independent variables. The explanation for these problems is an incompatibility of showing both points separated by their full p-dimensional distances and low-dimensional projections of points in the same plot. Some implications for geometric morphometric and other multivariate analyses in biology are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

Only simulated data were used.

Code Availability

No formal documented software was produced. Sampling experiments were carried out using MATLAB.

References

  • Affentranger, A. (1991). The convex hull of random points with spherically symmetric distributions. Rend. Sem. Mat. Univ. Poi. Torino, 49(3), 359–383.

    Google Scholar 

  • Aggarwal, C. C., Hinneburg, A., Keim, D. A 2001 On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Database Theory, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 420-434

  • Anderson, T. W. (2004). An introduction to multivariate statistical analysis (3rd ed.). Hoboken: John Wiley.

    Google Scholar 

  • Bellman, R. (1961). Adaptive control processes: A guided tour (Karreman mathematics research collection). Princeton: Princeton University Press.

    Book  Google Scholar 

  • Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.

    Google Scholar 

  • Bellman, R. L. (1961). Adaptive control processes. N.J.: Princeton University Press.

    Book  Google Scholar 

  • Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U 1999 When is “Nearest Neighbor” Meaningful? In 7th International Conference on Database Theory – ICDT’99 (Lecture Notes in Computer Science), Springer, New York, Vol. 1540, pp. 217–235, Doi: https://doi.org/10.1007/3-540-49257-7_15.

  • Bickel, P. J., Kur, G., & Nadler, B. (2018). Projection pursuit in high dimensions. Proceedings of the National Academy of Sciences, 115(37), 9151–9156. https://doi.org/10.1073/pnas.1801177115

    Article  CAS  Google Scholar 

  • Bookstein, F. L. (2002). Creases as morphometric characters. In N. MacLeod & P. L. Forey (Eds.), Morphology, shape and phylogeny (pp. 139–174). New York: Taylor & Francis.

    Chapter  Google Scholar 

  • Bookstein, F. L. (2017). A newly noticed formula enforces fundamental limits on geometric morphometric analyses. Evolutionary Biology, 44(4), 522–541. https://doi.org/10.1007/s11692-017-9424-9

    Article  Google Scholar 

  • Bookstein, F. L. (2019). Pathologies of between-groups principal components analysis in geometric morphometrics. Evolutionary Biology, 46(4), 271–302. https://doi.org/10.1101/627448

    Article  Google Scholar 

  • Campbell, N. A. (1979). Some practical aspects of canonical variate analysis. Journal of Applied Statistics, 6(1), 7–18. https://doi.org/10.1080/02664767900000002

    Article  Google Scholar 

  • Campbell, N. A., & Atchley, W. R. (1981). The geometry of canonical variates analysis. Systematic Zoology, 30(3), 268–280. https://doi.org/10.1093/sysbio/30.3.268

    Article  Google Scholar 

  • Cardini, A. (2003). The geometry of the marmot (Rodentia: Sciuridae) mandible: Phylogeny and patterns of morphological evolution. Systematic Biology, 52, 186–205. https://doi.org/10.1080/10635150390192807

    Article  PubMed  Google Scholar 

  • Cardini, A. (2020). Less tautology, more biology? A comment on “high-density” morphometrics. Zoomorphology. https://doi.org/10.1007/s00435-020-00499-w

    Article  Google Scholar 

  • Cardini, A., O’Higgins, P., & Rohlf, F. J. (2019). Seeing distinct groups where there are none: spurious patterns from between-group PCA. Evolutionary Biology, 46(1), 307–316. https://doi.org/10.1007/s11692-019-09487-5

    Article  Google Scholar 

  • Cardini, A., & Polly, P. D. (2020). Cross-validated between group PCA scatterplots: A solution to spurious group separation? Evolutionary Biology, 47, 85–95. https://doi.org/10.1007/s11692-020-09494-x

    Article  Google Scholar 

  • Dhillon, I. S., Modha, D. S., & Spangler, W. S. (2002). Class visualization of high-dimensional data with applications. Computational Statistics & Data Analysis, 41, 59–90.

    Article  Google Scholar 

  • Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755

    Article  Google Scholar 

  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179–188.

    Article  Google Scholar 

  • Friedman, J. H., & Tukey, J. (1974). A projection pursit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23, 881–885.

    Article  Google Scholar 

  • Goswami, A., Watanabe, A., Felice, R. N., Bardua, C., Fabre, A.-C., & Polly, P. D. (2020). High-density morphometric analysis of shape and integration: The good, the bad, and the not-really-a-problem. Integrative and Comparative Biology, 59(3), 669–683.

    Article  Google Scholar 

  • Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3/4), 325–338. https://doi.org/10.2307/2333639

    Article  Google Scholar 

  • Hou, S. F., & Wentzell, P. D. (2011). Fast and simple methods for the optimization of kurtosis used as a projection pursuit index. Analytica Chimica Acta, 704, 1–15.

    Article  CAS  Google Scholar 

  • Houle, M. R., Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A (2010) Can Shared-Neighbor Distance Defeat the Curse of Dimensionality? Paper presented at the 22nd International Conference, SSDBM, Heidelberg, Germany

  • Klingenberg, C. P., & Monteiro, L. R. (2005). Distances and directions in multidimensional shape spaces: Implications for morphometric applications. Systematic Biology, 54(4), 678–688.

    Article  Google Scholar 

  • Kovarovic, K., Aiello, L. C., Cardini, A., & Lockwood, C. A. (2011). Discriminant function analyses in archaeology: Are classification rates too good to be true? Journal of Archaeological Science, 38(11), 3006–3018. https://doi.org/10.1016/j.jas.2011.06.028

    Article  Google Scholar 

  • Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27.

    Article  Google Scholar 

  • Lamb, E. (2016). Why you should care about high dimensional sphere packing. Roots of unity, Scientific American, New York

  • Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR Sbornik, 1, 457–483.

    Article  Google Scholar 

  • Mitteroecker, P., & Bookstein, F. (2011). Linear discrimination, ordination, and the visualization of selection gradients in modern morphometrics. Evolutionary Biology, 38(1), 100–114. https://doi.org/10.1007/s11692-011-9109-8

    Article  Google Scholar 

  • Nørgaard, L., Bro, R., Westad, F., & Engelsen, S. B. (2006). A modification of canonical variates analysis to handle highly collinear multivariate data. Journal of Chemometrics, 20, 425–435.

    Article  Google Scholar 

  • Rao, R. C. (1948). The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society, Series B, 10(2), 159–203.

    Google Scholar 

  • Rohlf, F. J., Loy, A., & Corti, M. (1996). Morphometric analysis of old world talpidae (Mammalia, Insectivora) using partial warp scores. Systematic Biology, 45, 344–362. https://doi.org/10.1093/sysbio/45.3.344

    Article  Google Scholar 

  • van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

    Google Scholar 

  • Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Yendle, P. W., & MacFie, H. J. H. (1989). Discriminant principal components analysis. Journal of Chemometrics, 3(4), 589–600. https://doi.org/10.1002/cem.1180030407

    Article  CAS  Google Scholar 

Download references

Acknowledgements

Special thanks to Fred L. Bookstein for his extensive and insightful comments on an earlier version of this paper. Helpful discussions with Andrea Cardini about problems with using very large numbers of landmarks are also appreciated as well as the helpful comments from anonymous reviewers.

Funding

Self-funded.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to F. James Rohlf.

Ethics declarations

Conflict of interest

The author declares that he has no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 357 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rohlf, F.J. Why Clusters and Other Patterns Can Seem to be Found in Analyses of High-Dimensional Data. Evol Biol 48, 1–16 (2021). https://doi.org/10.1007/s11692-020-09518-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11692-020-09518-6

Keywords

Navigation