当前位置: X-MOL 学术Mol. Ecol. Resour. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data
Molecular Ecology Resources ( IF 7.7 ) Pub Date : 2021-01-16 , DOI: 10.1111/1755-0998.13326
Katharine L Korunes 1 , Kieran Samuk 2
Affiliation  

Population genetic analyses often use summary statistics to describe patterns of genetic variation and provide insight into evolutionary processes. Among the most fundamental of these summary statistics are π and dXY, which are used to describe genetic diversity within and between populations, respectively. Here, we address a widespread issue in π and dXY calculation: systematic bias generated by missing data of various types. Many popular methods for calculating π and dXY operate on data encoded in the variant call format (VCF), which condenses genetic data by omitting invariant sites. When calculating π and dXY using a VCF, it is often implicitly assumed that missing genotypes (including those at sites not represented in the VCF) are homozygous for the reference allele. Here, we show how this assumption can result in substantial downward bias in estimates of π and dXY that is directly proportional to the amount of missing data. We discuss the pervasive nature and importance of this problem in population genetics, and introduce a user‐friendly UNIX command line utility, pixy, that solves this problem via an algorithm that generates unbiased estimates of π and dXY in the face of missing data. We compare pixy to existing methods using both simulated and empirical data, and show that pixy alone produces unbiased estimates of π and dXY regardless of the form or amount of missing data. In summary, our software solves a long‐standing problem in applied population genetics and highlights the importance of properly accounting for missing data in population genetic analyses.

中文翻译:

pixy:在缺失数据的情况下对核苷酸多样性和分歧的无偏估计

种群遗传分析通常使用汇总统计来描述遗传变异的模式并提供对进化过程的洞察。在这些汇总统计中最基本的是πd XY,它们分别用于描述种群内和种群之间的遗传多样性。在这里,我们解决了πd XY计算中的一个普遍问题:由各种类型的缺失数据产生的系统偏差。许多流行的计算πd XY 的方法对以变​​异调用格式 (VCF) 编码的数据进行操作,该格式通过省略不变位点来压缩遗传数据。在计算πd XY使用 VCF,通常隐含地假设缺失的基因型(包括 VCF 中未表示的位点)对于参考等位基因是纯合的。在这里,我们展示了这个假设如何导致πd XY 的估计值出现大幅向下偏差,这与缺失数据的数量成正比。我们讨论了这个问题在种群遗传学中​​的普遍性质和重要性,并介绍了一个用户友好的 UNIX 命令行实用程序 pixy,它通过一种生成πd XY 的无偏估计的算法来解决这个问题面对缺失的数据。我们使用模拟和经验数据将 pixy 与现有方法进行比较,并表明无论缺失数据的形式或数量如何,pixy 单独产生πd XY 的无偏估计。总之,我们的软件解决了应用种群遗传学中​​长期存在的问题,并强调了在种群遗传分析中正确考虑缺失数据的重要性。
更新日期:2021-01-16
down
wechat
bug