当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying digenic disease genes using machine learning in the undiagnosed diseases network
bioRxiv - Bioinformatics Pub Date : 2020-06-01 , DOI: 10.1101/2020.05.31.125716
Souhrid Mukherjee , Joy D Cogan , John H Newman , John A Phillips , Rizwan Hamid , Jens Meiler , John A. Capra ,

Rare diseases affect hundreds of millions of people worldwide, and diagnosing their genetic causes is challenging. The Undiagnosed Diseases Network (UDN) was formed in 2014 to identify and treat novel rare genetic diseases, and despite many successes, more than half of UDN patients remain undiagnosed. The central hypothesis of this work is that many unsolved rare genetic disorders are caused by multiple variants in more than one gene. However, given the large number of variants in each individual genome, experimentally evaluating even just pairs of variants for potential to cause disease is currently infeasible. To address this challenge, we developed DiGePred, a random forest classifier for identifying candidate digenic disease gene pairs using features derived from biological networks, genomics, evolutionary history, and functional annotations. We trained the DiGePred classifier using DIDA, the largest available database of known digenic disease causing gene pairs, and several sets of non-digenic gene pairs, including variant pairs derived from unaffected relatives of UDN patients. DiGePred achieved high precision and recall in cross-validation and on a held out test set (PR area under the curve >77%), and we further demonstrate its utility using novel digenic pairs from the recent literature. In contrast to other approaches, DiGePred also appropriately controls the number of false positives when applied in realistic clinical settings like the UDN. Finally, to facilitate the rapid screening of variant gene pairs for digenic disease potential, we freely provide the predictions of DiGePred on all human gene pairs. Our work facilitates the discovery of genetic causes for rare non-monogenic diseases by providing a means to rapidly evaluate variant gene pairs for the potential to cause digenic disease.

中文翻译:

在未诊断疾病网络中使用机器学习识别双基因疾病基因

罕见疾病影响着全世界亿万人民,诊断其遗传原因具有挑战性。未诊断疾病网络(UDN)成立于2014年,旨在识别和治疗新型罕见遗传病,尽管取得了许多成功,但仍有超过一半的UDN患者未被诊断。这项工作的中心假设是,许多未解决的罕见遗传疾病是由一个以上基因的多种变异引起的。但是,鉴于每个单独的基因组中都有大量的变异体,目前尚无法通过实验评估即使是成对的变异体也可能引起疾病。为了应对这一挑战,我们开发了DiGePred,这是一种随机森林分类器,可使用源自生物网络,基因组学,进化史和功能注释的特征来识别候选双基因疾病基因对。我们使用DIDA训练了DiGePred分类器,DIDA是已知的引起双基因疾病的基因对的最大的可用数据库,以及几套非双基因基因对,包括衍生自UDN患者未受影响亲属的变异对。DiGePred在交叉验证和保留的测试集(曲线下的PR面积> 77%)上实现了较高的精度和召回率,并且我们使用来自最近文献的新型双基因对进一步证明了其实用性。与其他方法相比,DiGePred在实际的临床环境(如UDN)中应用时,还可以适当地控制假阳性的数量。最后,为了便于快速筛选变异基因对的双基因疾病潜能,我们免费提供了对所有人类基因对DiGePred的预测。
更新日期:2020-06-01
down
wechat
bug