当前位置: X-MOL 学术bioRxiv. Genom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes
bioRxiv - Genomics Pub Date : 2021-01-20 , DOI: 10.1101/2021.01.19.427308
Eric Y. Durand , Chuong B. Do , Peter R. Wilton , Joanna L. Mountain , Adam Auton , G. David Poznik , J. Michael Macpherson

Ancestry deconvolution is the task of identifying the ancestral origins of chromosomal segments of admixed individuals. It has important applications, from mapping disease genes to identifying loci potentially under natural selection. However, most existing methods are limited to a small number of ancestral populations and are unsuitable for large-scale applications. In this article, we describe Ancestry Composition, a modular pipeline for accurate and efficient ancestry deconvolution. In the first stage, a string-kernel support-vector-machines classifier assigns provisional ancestry labels to short statistically phased genomic segments. In the second stage, an autoregressive pair hidden Markov model corrects phasing errors, smooths local ancestry estimates, and computes confidence scores. Using publicly available datasets and more than 12,000 individuals from the customer database of the personal genetics company, 23andMe, Inc., we have constructed a reference panel containing more than 14,000 unrelated individuals of unadmixed ancestry. We used principal components analysis (PCA) and uniform manifold approximation and projection (UMAP) to identify genetic clusters and define 45 distinct reference populations upon which to train our method. In cross-validation experiments, Ancestry Composition achieves high precision and recall.

中文翻译:

使用成千上万个参考单倍型的可扩展管道,用于本地祖先推理

祖先反卷积是识别混合个体染色体片段的祖先起源的任务。它具有重要的应用,从绘制疾病基因到潜在地在自然选择下鉴定基因座。但是,大多数现有方法仅限于少数祖先群体,不适用于大规模应用。在本文中,我们描述了“先祖组合”,这是一种用于精确有效的祖先反卷积的模块化管道。在第一阶段,字符串内核支持向量机分类器将临时祖先标签分配给较短的统计阶段的基因组片段。在第二阶段,使用自回归对隐马尔可夫模型校​​正相位误差,平滑局部祖先估计并计算置信度分数。使用公开可用的数据集和超过12个 根据个人遗传学公司23andMe,Inc.的客户数据库中的000个个体,我们构建了一个参考小组,其中包含14,000多名无血统血统的无关个体。我们使用主成分分析(PCA)和统一流形逼近与投影(UMAP)来识别遗传簇,并定义45个不同的参考种群,在这些种群上进行我们的方法训练。在交叉验证实验中,“祖先合成”可实现高精度和查全率。我们使用主成分分析(PCA)和统一流形逼近与投影(UMAP)来识别遗传簇,并定义45个不同的参考种群,在这些种群上进行我们的方法训练。在交叉验证实验中,“祖先合成”可实现高精度和查全率。我们使用主成分分析(PCA)和统一流形逼近与投影(UMAP)来识别遗传簇,并定义45个不同的参考种群,在这些种群上进行我们的方法训练。在交叉验证实验中,“祖先合成”可实现高精度和查全率。
更新日期:2021-01-20
down
wechat
bug