当前位置: X-MOL 学术Ecol. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Principal component analysis of incomplete data – A simple solution to an old problem
Ecological Informatics ( IF 5.1 ) Pub Date : 2021-01-23 , DOI: 10.1016/j.ecoinf.2021.101235
János Podani , Tibor Kalapos , Barbara Barta , Dénes Schmera

A long-standing problem in biological data analysis is the unintentional absence of values for some observations or variables, preventing the use of standard multivariate exploratory methods, such as principal component analysis (PCA). Solutions include deleting parts of the data by which information is lost, data imputation, which is always arbitrary, and restriction of the analysis to either the variables or observations, thereby losing the advantages of biplot diagrams. We describe a minor modification of eigenanalysis-based PCA in which correlations or covariances are calculated using different numbers of observations for each pair of variables, and the resulting eigenvalues and eigenvectors are used to calculate component scores such that missing values are skipped. This procedure avoids artificial data imputation, exhausts all information from the data and allows the preparation of biplots for the simultaneous display of the ordination of variables and observations. The use of the modified PCA, called InDaPCA (PCA of Incomplete Data) is demonstrated on actual biological examples: leaf functional traits of plants, functional traits of invertebrates, cranial morphometry of crocodiles and fish hybridization data – with biologically meaningful results. Our study suggests that it is not the percentage of missing entries in the data matrix that matters; the success of InDaPCA is mostly affected by the minimum number of observations available for comparing a given pair of variables. In the present study, interpretation of results in the space of the first two components was not hindered, however.



中文翻译:

不完整数据的主成分分析–解决旧问题的简单方法

生物数据分析中的一个长期问题是某些观察值或变量无意缺少值,从而妨碍了使用标准的多元探索性方法,例如主成分分析(PCA)。解决方案包括删除丢失信息的部分数据,始终是任意的数据插补以及对变量或观测值的分析限制,从而失去了双线图的优势。我们描述了一种基于特征分析的PCA的较小修改,其中使用每对变量的不同观察值来计算相关性或协方差,并将所得特征值和特征向量用于计算成分评分,从而忽略缺失值。此过程避免了人工数据插补,从数据中耗尽所有信息,并允许准备双图,以同时显示变量和观测值的排序。在实际的生物学实例上证明了使用经修饰的PCA(称为InDaPCA(不完全数据的PCA)):植物的叶片功能性状,无脊椎动物的功能性状,鳄鱼的颅骨形态和鱼类杂交数据-具有生物学意义的结果。我们的研究表明,重要的不是数据矩阵中缺失条目的百分比。InDaPCA的成功主要受到可用于比较给定变量对的最少观察值的影响。在当前的研究中,结果的解释在前两个成分的空间中并没有受到阻碍。

更新日期:2021-01-29
down
wechat
bug