Protein–protein interaction site prediction using random forest proximity distance,Journal of Bioinformatics and Computational Biology

当前位置： X-MOL 学术 › J. Bioinform. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Protein–protein interaction site prediction using random forest proximity distance
Journal of Bioinformatics and Computational Biology ( IF 1 ) Pub Date : 2020-09-17 , DOI: 10.1142/s0219720020500420
Zhijun Qiu _{1,

2} , Qingjie Liu ₁

Affiliation

A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein–protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.

中文翻译：

使用随机森林邻近距离预测蛋白质-蛋白质相互作用位点

一种基于随机森林邻近距离（PD）的前端方法用于筛选测试集以改进蛋白质-蛋白质相互作用位点（PPIS）的预测。距离度量的评估是在更高质量的距离定义导致更高分类的假设下进行的。在独立的测试集上，基于统计推断的数值分析表明，PD 比 Mahalanobis 和 Cosine 距离具有优势。基于邻近距离取决于随机森林模型的树构成这一事实，设计了一种迭代优化邻近距离的方法，通过调整训练集的大小来调整随机森林模型的树构成。通过迭代方法获得了两个 PD 度量，75PD 和 50PD。在两个独立的测试集上，1得分较高，两者差异有统计学意义。所有的数值实验都表明，测试数据和训练数据的距离越近，预测器的预测结果就越好。这些表明迭代方法可以优化邻近距离定义，并且PD提供的距离信息可以用来指示预测结果的可靠性。

更新日期：2020-09-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>