当前位置: X-MOL 学术Evol. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Why Clusters and Other Patterns Can Seem to be Found in Analyses of High-Dimensional Data
Evolutionary Biology ( IF 2.5 ) Pub Date : 2020-10-27 , DOI: 10.1007/s11692-020-09518-6
F. James Rohlf

Recent papers by Cardini et al. (Evolutionary Biology 46:307–316, 2019) and Bookstein (Evolutionary Biology 46:271–302, 2019) show that, when there are many variables and when sample sizes are small, scatterplots made using the between-groups principal components analysis method can appear to indicate clear group differences with little or no overlap between samples even though the samples are all drawn from a single multivariate normally distributed population. The corresponding scatterplots made after a canonical variates analysis (CVA) show an even more extreme separation of groups even though the usual test statistics yield the correct uniform distribution of probabilities. Users of CVA are usually concerned about the problems of small sample sizes and correlated variables but the problems discussed here are present even for large samples and uncorrelated variables. Some less-appreciated properties of sampling from high-dimensional spaces and the “curse of dimensionality” are reviewed to find a simple explanation for these problems. The ratio of variables to sample size is a useful index to predict when false clusters and these other problems may arise. While dependent upon the same variables, this index is not based on Marchenko and Pastur (Mathematics of the USSR–Sbornik 1:457–483, 1967) as discussed by Bookstein (Evolutionary Biology 44:522–541, 2017). It is also shown that multiple regression analysis can have related problems when there are large numbers of independent variables. The explanation for these problems is an incompatibility of showing both points separated by their full p-dimensional distances and low-dimensional projections of points in the same plot. Some implications for geometric morphometric and other multivariate analyses in biology are also discussed.



中文翻译:

为什么在高维数据分析中可以找到聚类和其他模式

Cardini等人的最新论文。(Evolutionary Biology 46:307–316,2019)和Bookstein(Evolutionary Biology 46:271–302,2019)显示,当变量多且样本量较小时,使用组间主成分分析方法进行散点图即使样本都是从单个多元正态分布总体中抽取的,也可能表明存在明显的群体差异,样本之间几乎没有重叠或没有重叠。在标准变量分析(CVA)之后进行的相应散点图显示,即使通常的测试统计数据得出正确的概率均匀分布,组之间的分离更加极端。CVA的用户通常会担心样本量小和相关变量的问题,但是此处讨论的问题甚至存在于样本量大和不相关的变量中。回顾了一些从高维空间采样的不那么受欢迎的属性和“维数的诅咒”,以找到对这些问题的简单解释。变量与样本量的比率是预测何时出现错误簇和其他问题的有用指标。尽管依赖于相同的变量,但该指数并非基于Bookstein所讨论的Marchenko和Pastur(苏联数学– Sbornik 1:457–483,1967)(进化生物学44:522–541,2017)。还表明,当存在大量自变量时,多元回归分析可能会遇到相关问题。同一图中点的p维距离和低维投影。还讨论了几何形态计量学和生物学中其他多元分析的一些含义。

更新日期:2020-10-30
down
wechat
bug