Abstract
Recent papers by Cardini et al. (Evolutionary Biology 46:307–316, 2019) and Bookstein (Evolutionary Biology 46:271–302, 2019) show that, when there are many variables and when sample sizes are small, scatterplots made using the between-groups principal components analysis method can appear to indicate clear group differences with little or no overlap between samples even though the samples are all drawn from a single multivariate normally distributed population. The corresponding scatterplots made after a canonical variates analysis (CVA) show an even more extreme separation of groups even though the usual test statistics yield the correct uniform distribution of probabilities. Users of CVA are usually concerned about the problems of small sample sizes and correlated variables but the problems discussed here are present even for large samples and uncorrelated variables. Some less-appreciated properties of sampling from high-dimensional spaces and the “curse of dimensionality” are reviewed to find a simple explanation for these problems. The ratio of variables to sample size is a useful index to predict when false clusters and these other problems may arise. While dependent upon the same variables, this index is not based on Marchenko and Pastur (Mathematics of the USSR–Sbornik 1:457–483, 1967) as discussed by Bookstein (Evolutionary Biology 44:522–541, 2017). It is also shown that multiple regression analysis can have related problems when there are large numbers of independent variables. The explanation for these problems is an incompatibility of showing both points separated by their full p-dimensional distances and low-dimensional projections of points in the same plot. Some implications for geometric morphometric and other multivariate analyses in biology are also discussed.
Similar content being viewed by others
Data Availability
Only simulated data were used.
Code Availability
No formal documented software was produced. Sampling experiments were carried out using MATLAB.
References
Affentranger, A. (1991). The convex hull of random points with spherically symmetric distributions. Rend. Sem. Mat. Univ. Poi. Torino, 49(3), 359–383.
Aggarwal, C. C., Hinneburg, A., Keim, D. A 2001 On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Database Theory, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 420-434
Anderson, T. W. (2004). An introduction to multivariate statistical analysis (3rd ed.). Hoboken: John Wiley.
Bellman, R. (1961). Adaptive control processes: A guided tour (Karreman mathematics research collection). Princeton: Princeton University Press.
Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.
Bellman, R. L. (1961). Adaptive control processes. N.J.: Princeton University Press.
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U 1999 When is “Nearest Neighbor” Meaningful? In 7th International Conference on Database Theory – ICDT’99 (Lecture Notes in Computer Science), Springer, New York, Vol. 1540, pp. 217–235, Doi: https://doi.org/10.1007/3-540-49257-7_15.
Bickel, P. J., Kur, G., & Nadler, B. (2018). Projection pursuit in high dimensions. Proceedings of the National Academy of Sciences, 115(37), 9151–9156. https://doi.org/10.1073/pnas.1801177115
Bookstein, F. L. (2002). Creases as morphometric characters. In N. MacLeod & P. L. Forey (Eds.), Morphology, shape and phylogeny (pp. 139–174). New York: Taylor & Francis.
Bookstein, F. L. (2017). A newly noticed formula enforces fundamental limits on geometric morphometric analyses. Evolutionary Biology, 44(4), 522–541. https://doi.org/10.1007/s11692-017-9424-9
Bookstein, F. L. (2019). Pathologies of between-groups principal components analysis in geometric morphometrics. Evolutionary Biology, 46(4), 271–302. https://doi.org/10.1101/627448
Campbell, N. A. (1979). Some practical aspects of canonical variate analysis. Journal of Applied Statistics, 6(1), 7–18. https://doi.org/10.1080/02664767900000002
Campbell, N. A., & Atchley, W. R. (1981). The geometry of canonical variates analysis. Systematic Zoology, 30(3), 268–280. https://doi.org/10.1093/sysbio/30.3.268
Cardini, A. (2003). The geometry of the marmot (Rodentia: Sciuridae) mandible: Phylogeny and patterns of morphological evolution. Systematic Biology, 52, 186–205. https://doi.org/10.1080/10635150390192807
Cardini, A. (2020). Less tautology, more biology? A comment on “high-density” morphometrics. Zoomorphology. https://doi.org/10.1007/s00435-020-00499-w
Cardini, A., O’Higgins, P., & Rohlf, F. J. (2019). Seeing distinct groups where there are none: spurious patterns from between-group PCA. Evolutionary Biology, 46(1), 307–316. https://doi.org/10.1007/s11692-019-09487-5
Cardini, A., & Polly, P. D. (2020). Cross-validated between group PCA scatterplots: A solution to spurious group separation? Evolutionary Biology, 47, 85–95. https://doi.org/10.1007/s11692-020-09494-x
Dhillon, I. S., Modha, D. S., & Spangler, W. S. (2002). Class visualization of high-dimensional data with applications. Computational Statistics & Data Analysis, 41, 59–90.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179–188.
Friedman, J. H., & Tukey, J. (1974). A projection pursit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23, 881–885.
Goswami, A., Watanabe, A., Felice, R. N., Bardua, C., Fabre, A.-C., & Polly, P. D. (2020). High-density morphometric analysis of shape and integration: The good, the bad, and the not-really-a-problem. Integrative and Comparative Biology, 59(3), 669–683.
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3/4), 325–338. https://doi.org/10.2307/2333639
Hou, S. F., & Wentzell, P. D. (2011). Fast and simple methods for the optimization of kurtosis used as a projection pursuit index. Analytica Chimica Acta, 704, 1–15.
Houle, M. R., Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A (2010) Can Shared-Neighbor Distance Defeat the Curse of Dimensionality? Paper presented at the 22nd International Conference, SSDBM, Heidelberg, Germany
Klingenberg, C. P., & Monteiro, L. R. (2005). Distances and directions in multidimensional shape spaces: Implications for morphometric applications. Systematic Biology, 54(4), 678–688.
Kovarovic, K., Aiello, L. C., Cardini, A., & Lockwood, C. A. (2011). Discriminant function analyses in archaeology: Are classification rates too good to be true? Journal of Archaeological Science, 38(11), 3006–3018. https://doi.org/10.1016/j.jas.2011.06.028
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27.
Lamb, E. (2016). Why you should care about high dimensional sphere packing. Roots of unity, Scientific American, New York
Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR Sbornik, 1, 457–483.
Mitteroecker, P., & Bookstein, F. (2011). Linear discrimination, ordination, and the visualization of selection gradients in modern morphometrics. Evolutionary Biology, 38(1), 100–114. https://doi.org/10.1007/s11692-011-9109-8
Nørgaard, L., Bro, R., Westad, F., & Engelsen, S. B. (2006). A modification of canonical variates analysis to handle highly collinear multivariate data. Journal of Chemometrics, 20, 425–435.
Rao, R. C. (1948). The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society, Series B, 10(2), 159–203.
Rohlf, F. J., Loy, A., & Corti, M. (1996). Morphometric analysis of old world talpidae (Mammalia, Insectivora) using partial warp scores. Systematic Biology, 45, 344–362. https://doi.org/10.1093/sysbio/45.3.344
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge: Cambridge University Press.
Yendle, P. W., & MacFie, H. J. H. (1989). Discriminant principal components analysis. Journal of Chemometrics, 3(4), 589–600. https://doi.org/10.1002/cem.1180030407
Acknowledgements
Special thanks to Fred L. Bookstein for his extensive and insightful comments on an earlier version of this paper. Helpful discussions with Andrea Cardini about problems with using very large numbers of landmarks are also appreciated as well as the helpful comments from anonymous reviewers.
Funding
Self-funded.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Rohlf, F.J. Why Clusters and Other Patterns Can Seem to be Found in Analyses of High-Dimensional Data. Evol Biol 48, 1–16 (2021). https://doi.org/10.1007/s11692-020-09518-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11692-020-09518-6