当前位置: X-MOL 学术Nutr. Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Missing data imputation via the expectation–maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns
Nutrition Research ( IF 4.5 ) Pub Date : 2020-03-01 , DOI: 10.1016/j.nutres.2020.01.001
Linda Malan 1 , Cornelius M Smuts 1 , Jeannine Baumgartner 2 , Cristian Ricci 3
Affiliation  

Principal component analysis (PCA) is a popular statistical tool. However, despite numerous advantages, the good practice of imputing missing data before PCA is not common. In the present work, we evaluated the hypothesis that the expectation-maximization (EM) algorithm for missing data imputation is a reliable and advantageous procedure when using PCA to derive biomarker profiles and dietary patterns. To this aim, we used numerical simulations aimed to mimic real data commonly observed in nutritional research. Finally, we showed the advantages and pitfalls of the EM algorithm for missing data imputation applied to plasma fatty acid concentrations and nutrient intakes from real data sets deriving from the US National Health and Nutrition Examination Survey. PCA applied to simulated data having missing values resulted in biased eigenvalues with respect to the original data set without missing values. The bias between the eigenvalues from the original set of data and from the data set with missing values increased with number of missing values and appeared as independent with respect to the correlation structure among variables. On the other hand, when data were imputed, the mean of the eigenvalues over the 10 missing imputation runs overlapped with the ones derived from the PCA applied to the original data set. These results were confirmed when real data sets from the National Health and Nutrition Examination Survey were analyzed. We accept the hypothesis that the EM algorithm for missing data imputation applied before PCA aimed to derive biochemical profiles and dietary patterns is an effective technique especially for relatively small sample sizes.

中文翻译:

通过期望最大化算法的缺失数据插补可以改进旨在推导生物标志物概况和饮食模式的主成分分析

主成分分析 (PCA) 是一种流行的统计工具。然而,尽管有许多优点,但在 PCA 之前插补缺失数据的良好做法并不常见。在目前的工作中,我们评估了这样一个假设,即在使用 PCA 导出生物标志物概况和饮食模式时,用于缺失数据插补的期望最大化 (EM) 算法是一种可靠且有利的程序。为此,我们使用了旨在模拟营养研究中常见的真实数据的数值模拟。最后,我们展示了 EM 算法的优点和缺陷,用于从来自美国国家健康和营养检查调查的真实数据集中应用于血浆脂肪酸浓度和营养摄入的缺失数据插补。应用于具有缺失值的模拟数据的 PCA 导致相对于没有缺失值的原始数据集有偏差的特征值。来自原始数据集的特征值和来自具有缺失值的数据集的特征值之间的偏差随着缺失值的数量而增加,并且似乎与变量之间的相关结构无关。另一方面,当数据被插补时,10 次缺失插补运行的特征值的平均值与应用到原始数据集的 PCA 得出的值重叠。当分析来自国家健康和营养检查调查的真实数据集时,这些结果得到了证实。
更新日期:2020-03-01
down
wechat
bug