Seeing Distinct Groups Where There are None: Spurious Patterns from Between-Group PCA,Evolutionary Biology

当前位置： X-MOL 学术 › Evol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Seeing Distinct Groups Where There are None: Spurious Patterns from Between-Group PCA
Evolutionary Biology ( IF 2.5 ) Pub Date : 2019-10-18 , DOI: 10.1007/s11692-019-09487-5
Andrea Cardini , Paul O’Higgins , F. James Rohlf

Using sampling experiments, we found that, when there are fewer groups than variables, between-groups PCA (bgPCA) may suggest surprisingly distinct differences among groups for data in which none exist. While apparently not noticed before, the reasons for this problem are easy to understand. A bgPCA captures the g − 1 dimensions of variation among the g group means, but only a fraction of the \(\sum {n_{i} } - g\) dimensions of within-group variation (\(n_{i}\) are the sample sizes), when the number of variables, p, is greater than g − 1. This introduces a distortion in the appearance of the bgPCA plots because the within-group variation will be underrepresented, unless the variables are sufficiently correlated so that the total variation can be accounted for with just g − 1 dimensions. The effect is most obvious when sample sizes are small relative to the number of variables, because smaller samples spread out less, but the distortion is present even for large samples. Strong covariance among variables largely reduces the magnitude of the problem, because it effectively reduces the dimensionality of the data and thus enables a larger proportion of the within-group variation to be accounted for within the g − 1-dimensional space of a bgPCA. The distortion will still be relevant though its strength will vary from case to case depending on the structure of the data (p, g, covariances etc.). These are important problems for a method mainly designed for the analysis of variation among groups when there are very large numbers of variables and relatively small samples. In such cases, users are likely to conclude that the groups they are comparing are much more distinct than they really are. Having many variables but just small sample sizes is a common problem in fields ranging from morphometrics (as in our examples) to molecular analyses.

中文翻译：

看到不存在的不同组：组间PCA的虚假模式

通过抽样实验，我们发现，当组数少于变量时，组间PCA（bgPCA）可能会令人惊讶地暗示各组之间不存在数据的差异。尽管以前似乎没有注意到，但此问题的原因很容易理解。bgPCA捕获g组平均值之间的g − 1变异维度，但仅捕获组内变异\（\ sum {n_ {i}}-g \）维度（\（n_ {i} \ ）是样本大小），当变量数量p大于g时 − 1.这会导致bgPCA图的外观出现失真，因为除非变量之间具有足够的相关性，否则组内变异将得不到充分体现，从而仅用g − 1维即可解决总变异。当样本量相对于变量数较小时，效果最明显，因为较小的样本分布较少，但是即使对于较大的样本也存在失真。变量之间的强协方差极大地降低了问题的严重性，因为它有效地降低了数据的维数，因此可以在g内解释更大比例的组内变化。 − bgPCA的一维空间。尽管失真的强度会因数据结构（p，g，协方差等）的不同而有所差异，但失真仍将是相关的。对于主要设计用于在变量数量非常大且样本相对较小的情况下分析群体间差异的方法，这些是重要的问题。在这种情况下，用户可能会得出结论，他们正在比较的组与实际组相比要相距甚远。在形态计量学（如我们的示例）到分子分析等领域，拥有很多变量但样本量很小是一个普遍的问题。

更新日期：2019-10-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>