当前位置: X-MOL 学术Am. J. Hum. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast two-stage phasing of large-scale sequence data
American Journal of Human Genetics ( IF 9.8 ) Pub Date : 2021-09-02 , DOI: 10.1016/j.ajhg.2021.08.005
Brian L. Browning 1, 2 , Xiaowen Tian 3 , Ying Zhou 4 , Sharon R. Browning 2
Affiliation  

Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.



中文翻译:

大规模序列数据的快速两阶段定相

单倍型定相是根据基因型数据估计单倍型。我们提出了一种快速、准确且内存高效的单倍型定相方法,可扩展到大规模 SNP 阵列和序列数据。该方法使用标记窗口和复合参考单倍型来减少内存使用和计算时间。它结合了渐进式定相算法,可在每次迭代中确定可靠定相的杂合子,并在后续迭代中确定这些杂合子的相位。对于具有许多低频变体的数据,例如全基因组序列数据,该方法采用两阶段定相算法,在第一阶段通过渐进定相对高频标记进行定相,并通过基因型插补对低频标记进行定相。第二阶段。这种单倍型定相方法在开源 Beagle 5 中实现。2个软件包。我们通过使用 485,301 个英国生物银行样本和 38,387 个 TOPMed 样本的扩展子集来比较 Beagle 5.2 和 SHAPEIT 4.2.1。对于 UK Biobank SNP 阵列数据,这两种方法具有非常相似的准确性和计算时间。但是,对于 TOPMed 序列数据,Beagle 比 SHAPEIT 快 20 倍以上,实现了相似的准确度,并且可以扩展到更大的样本量。

更新日期:2021-10-09
down
wechat
bug