Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure,Molecular Ecology Resources

当前位置： X-MOL 学术 › Mol. Ecol. Resour. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure
Molecular Ecology Resources ( IF 5.5 ) Pub Date : 2021-08-30 , DOI: 10.1111/1755-0998.13498
Xueling Yi ₁ , Emily K Latch ₁

Affiliation

Population genetic studies in non-model systems increasingly use next-generation sequencing to obtain more loci, but such methods also generate more missing data that may affect downstream analyses. Here we focus on the principal component analysis (PCA) which has been widely used to explore and visualize population structure with mean-imputed missing data. We simulated data of different population models with various total missingness (1%, 10%, 20%) introduced either randomly or biased among individuals or populations. We found that individuals biased with missing data would be dragged away from their real population clusters to the origin of PCA plots, making them indistinguishable from true admixed individuals and potentially leading to misinterpreted population structure. We also generated empirical data of the big brown bat (Eptesicus fuscus) using restriction site-associated DNA sequencing (RADseq). We filtered three data sets with 19.12%, 9.87%, and 1.35% total missingness, all showing nonrandom missing data with biased individuals dragged towards the PCA origin, consistent with results from simulations. We highlight the importance of considering missing data effects on PCA in non-model systems where nonrandom missing data are common due to varying sample quality. To help detect missing data effects, we suggest to (1) plot PCA with a colour gradient showing per sample missingness, (2) interpret samples close to the PCA origin with extra caution, (3) explore filtering parameters with and without the missingness-biased samples, and (4) use complementary analyses (e.g., model-based methods) to cross-validate PCA results and help interpret population structure.

中文翻译：

非随机缺失数据会影响种群遗传结构的主成分分析推断

非模型系统中的群体遗传学研究越来越多地使用下一代测序来获得更多基因座，但这种方法也会产生更多可能影响下游分析的缺失数据。在这里，我们专注于主成分分析（PCA），它已被广泛用于探索和可视化具有均值估算缺失数据的人口结构。我们模拟了具有各种总缺失（1%、10%、20%）的不同人口模型的数据，这些数据要么随机引入，要么在个体或群体中存在偏差。我们发现，因缺失数据而有偏见的个体会被从他们的真实种群集群拖到 PCA 图的起源，使它们与真正的混合个体无法区分，并可能导致对种群结构的误解。我们还生成了大棕蝙蝠的经验数据（蜉蝣) 使用限制性位点相关 DNA 测序 (RADseq)。我们过滤了三个数据集，总缺失率为 19.12%、9.87% 和 1.35%，均显示非随机缺失数据，有偏见的个体被拖向 PCA 起源，与模拟结果一致。我们强调了在非模型系统中考虑缺失数据对 PCA 的影响的重要性，在这些系统中，由于样本质量不同，非随机缺失数据很常见。为了帮助检测缺失数据的影响，我们建议 (1) 绘制带有颜色梯度的 PCA，显示每个样本的缺失，(2) 解释接近 PCA 原点的样本时要格外小心，(3) 探索有和没有缺失的过滤参数 -有偏见的样本，以及 (4) 使用互补分析（例如，基于模型的方法）来交叉验证 PCA 结果并帮助解释群体结构。

更新日期：2021-08-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11