当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A random forest based computational model for predicting novel lncRNA-disease associations.
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-03-27 , DOI: 10.1186/s12859-020-3458-1
Dengju Yao 1 , Xiaojuan Zhan 2 , Xiaorong Zhan 3 , Chee Keong Kwoh 4 , Peng Li 1 , Jinke Wang 5
Affiliation  

BACKGROUND Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources. RESULTS To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models. CONCLUSIONS Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.

中文翻译:

基于随机森林的计算模型,用于预测新型lncRNA-疾病关联。

背景技术累积的证据表明,长非编码RNA(lncRNA)的异常调节与多种人类疾病有关。准确鉴定与疾病相关的lncRNAs有助于研究lncRNA在疾病中的机制,并探索新的疾病治疗方法。通过整合多种数据资源,已实现了许多lncRNA-疾病关联(LDA)预测模型。但是,大多数现有模型都忽略了这些数据资源之间的干扰信息和冗余信息。结果为了提高LDA预测模型的能力,我们实现了基于随机森林和特征选择的LDA预测模型(简称RFLDA)。首先,RFLDA整合了实验支持的miRNA疾病关联(MDA)和LDA,疾病语义相似度(DSS),lncRNA功能相似性(LFS)和lncRNA-miRNA相互作用(LMI)作为输入特征。然后,RFLDA根据随机森林变量重要性得分,通过特征选择来选择最有用的特征来训练预测模型,该得分不仅考虑了单个特征对预测结果的影响,还考虑了多个特征对预测结果的联合影响。最后,训练一个随机森林回归模型来对潜在的lncRNA-疾病关联进行评分。在五重交叉验证下,就接收器工作特性曲线(AUC)下的面积为0.976,在精确召回曲线(AUPR)下为0.779的面积而言,RFLDA的性能优于几种状态。最新的LDA预测模型。此外,对三种癌症的案例研究表明,RFLDA预测的45种lncRNA中有43种已通过实验数据验证,其他两种预测的lncRNA得到其他LDA预测模型的支持。结论交叉验证和案例研究表明,RFLDA具有出色的识别潜在疾病相关lncRNA的能力。
更新日期:2020-04-22
down
wechat
bug