当前位置: X-MOL 学术Am. J. Hum. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network
American Journal of Human Genetics ( IF 8.1 ) Pub Date : 2021-09-15 , DOI: 10.1016/j.ajhg.2021.08.010
Souhrid Mukherjee 1 , Joy D Cogan 2 , John H Newman 3 , John A Phillips 2 , Rizwan Hamid 2 , , Jens Meiler 4 , John A Capra 5
Affiliation  

Rare diseases affect millions of people worldwide, and discovering their genetic causes is challenging. More than half of the individuals analyzed by the Undiagnosed Diseases Network (UDN) remain undiagnosed. The central hypothesis of this work is that many of these rare genetic disorders are caused by multiple variants in more than one gene. However, given the large number of variants in each individual genome, experimentally evaluating combinations of variants for potential to cause disease is currently infeasible. To address this challenge, we developed the digenic predictor (DiGePred), a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. We trained the DiGePred classifier by using DIDA, the largest available database of known digenic-disease-causing gene pairs, and several sets of non-digenic gene pairs, including variant pairs derived from unaffected relatives of UDN individuals. DiGePred achieved high precision and recall in cross-validation and on a held-out test set (PR area under the curve > 77%), and we further demonstrate its utility by using digenic pairs from the recent literature. In contrast to other approaches, DiGePred also appropriately controls the number of false positives when applied in realistic clinical settings. Finally, to enable the rapid screening of variant gene pairs for digenic disease potential, we freely provide the predictions of DiGePred on all human gene pairs. Our work enables the discovery of genetic causes for rare non-monogenic diseases by providing a means to rapidly evaluate variant gene pairs for the potential to cause digenic disease.



中文翻译:


通过未诊断疾病网络中的机器学习识别双基因疾病基因



罕见疾病影响着全世界数百万人,发现其遗传原因具有挑战性。未确诊疾病网络 (UDN) 分析的个体中,超过一半仍未确诊。这项工作的中心假设是,许多罕见的遗传性疾病是由多个基因的多种变异引起的。然而,鉴于每个个体基因组中存在大量变异,通过实验评估变异组合是否可能导致疾病目前是不可行的。为了应对这一挑战,我们开发了双基因预测器(DiGePred),这是一种随机森林分类器,用于通过生物网络、基因组学、进化历史和功能注释中的特征来识别候选双基因疾病基因对。我们使用 DIDA(已知双基因疾病致病基因对的最大可用数据库)和几组非双基因基因对(包括源自 UDN 个体未受影响亲属的变异对)来训练 DiGePred 分类器。 DiGePred 在交叉验证和保留测试集上实现了高精度和召回率(曲线下的 PR 面积 > 77%),我们通过使用最近文献中的双基因对进一步证明了其实用性。与其他方法相比,DiGePred 在实际临床环境中应用时还可以适当控制误报数量。最后,为了能够快速筛选双基因疾病潜力的变异基因对,我们免费提供 DiGePred 对所有人类基因对的预测。我们的工作通过提供一种快速评估变异基因对是否可能引起双基因疾病的方法,能够发现罕见非单基因疾病的遗传原因。

更新日期:2021-10-09
down
wechat
bug