当前位置: X-MOL 学术Am. J. Hum. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets.
American Journal of Human Genetics ( IF 8.1 ) Pub Date : 2020-04-23 , DOI: 10.1016/j.ajhg.2020.03.013
Sheng Yang 1 , Xiang Zhou 2
Affiliation  

Accurate construction of polygenic scores (PGS) can enable early diagnosis of diseases and facilitate the development of personalized medicine. Accurate PGS construction requires prediction models that are both adaptive to different genetic architectures and scalable to biobank scale datasets with millions of individuals and tens of millions of genetic variants. Here, we develop such a method called Deterministic Bayesian Sparse Linear Mixed Model (DBSLMM). DBSLMM relies on a flexible modeling assumption on the effect size distribution to achieve robust and accurate prediction performance across a range of genetic architectures. DBSLMM also relies on a simple deterministic search algorithm to yield an approximate analytic estimation solution using summary statistics only. The deterministic search algorithm, when paired with further algebraic innovations, results in substantial computational savings. With simulations, we show that DBSLMM achieves scalable and accurate prediction performance across a range of realistic genetic architectures. We then apply DBSLMM to analyze 25 traits in UK Biobank. For these traits, compared to existing approaches, DBSLMM achieves an average of 2.03%-101.09% accuracy gain in internal cross-validations. In external validations on two separate datasets, including one from BioBank Japan, DBSLMM achieves an average of 14.74%-522.74% accuracy gain. In these real data applications, DBSLMM is 1.03-28.11 times faster and uses only 7.4%-24.8% of physical memory as compared to other multiple regression-based PGS methods. Overall, DBSLMM represents an accurate and scalable method for constructing PGS in biobank scale datasets.

中文翻译:


在大型生物库数据集中准确且可扩展地构建多基因评分。



准确构建多基因评分(PGS)可以实现疾病的早期诊断并促进个性化医疗的发展。准确的 PGS 构建需要预测模型既能适应不同的遗传结构,又能扩展到包含数百万个体和数千万遗传变异的生物库规模数据集。在这里,我们开发了一种称为确定性贝叶斯稀疏线性混合模型(DBSLMM)的方法。 DBSLMM 依赖于效应大小分布的灵活建模假设,以在一系列遗传架构中实现稳健且准确的预测性能。 DBSLMM 还依赖于简单的确定性搜索算法,仅使用汇总统计数据来生成近似分析估计解决方案。确定性搜索算法与进一步的代数创新相结合,可以节省大量计算量。通过模拟,我们表明 DBSLMM 在一系列现实的遗传架构中实现了可扩展且准确的预测性能。然后我们应用 DBSLMM 分析英国生物银行的 25 个性状。对于这些特征,与现有方法相比,DBSLMM 在内部交叉验证中平均获得了 2.03%-101.09% 的准确度增益。在两个独立数据集(包括来自日本 BioBank 的数据集)的外部验证中,DBSLMM 平均实现了 14.74%-522.74% 的准确度增益。在这些实际数据应用中,与其他基于多重回归的 PGS 方法相比,DBSLMM 的速度快 1.03-28.11 倍,并且仅使用 7.4%-24.8% 的物理内存。总体而言,DBSLMM 代表了一种在生物样本库规模数据集中构建 PGS 的准确且可扩展的方法。
更新日期:2020-04-23
down
wechat
bug