当前位置: X-MOL 学术Interdiscip. Sci. Comput. Life Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs)
Interdisciplinary Sciences: Computational Life Sciences ( IF 4.8 ) Pub Date : 2021-05-06 , DOI: 10.1007/s12539-021-00433-8
Haseeb Younis 1, 2 , Muhammad Waqas Anwar 2 , Muhammad Usman Ghani Khan 3 , Aisha Sikandar 4 , Usama Ijaz Bajwa 2
Affiliation  

Protein–protein interaction plays an important role in the understanding of biological processes in the body. A network of dynamic protein complexes within a cell that regulates most biological processes is known as a protein–protein interaction network (PPIN). Complex prediction from PPINs is a challenging task. Most of the previous computation approaches mine cliques, stars, linear and hybrid structures as complexes from PPINs by considering topological features and fewer of them focus on important biological information contained within protein amino acid sequence. In this study, we have computed a wide variety of topological features and integrate them with biological features computed from protein amino acid sequence such as bag of words, physicochemical and spectral domain features. We propose a new Sequential Forward Feature Selection (SFFS) algorithm, i.e., random forest-based Boruta feature selection for selecting the best features from computed large feature set. Decision tree, linear discriminant analysis and gradient boosting classifiers are used as learners. We have conducted experiments by considering two reference protein complex datasets of yeast, i.e., CYC2008 and MIPS. Human and mouse complex information is taken from CORUM 3.0 dataset. Protein interaction information is extracted from the database of interacting proteins (DIP). Our proposed SFFS, i.e., random forest-based Brouta feature selection in combination with decision trees, linear discriminant analysis and Gradient Boosting Classifiers outperforms other state of art algorithms by achieving precision, recall and F-measure rates, i.e. 94.58%, 94.92% and 94.45% for MIPS, 96.31%, 93.55% and 96.02% for CYC2008, 98.84%, 98.00%, 98.87 % for CORUM humans and 96.60%, 96.70%, 96.32% for CORUM mouse dataset complexes, respectively.



中文翻译:

一种新的序列前向特征选择 (SFFS) 算法,用于挖掘最佳拓扑和生物学特征,以从蛋白质 - 蛋白质相互作用网络 (PPIN) 中预测蛋白质复合物

蛋白质-蛋白质相互作用在理解体内生物过程中起着重要作用。细胞内调节大多数生物过程的动态蛋白质复合物网络称为蛋白质-蛋白质相互作用网络 (PPIN)。PPIN 的复杂预测是一项具有挑战性的任务。以前的大多数计算方法是通过考虑拓扑特征从 PPIN 中挖掘群、星、线性和混合结构作为复合物,很少关注蛋白质氨基酸序列中包含的重要生物信息。在这项研究中,我们计算了各种各样的拓扑特征,并将它们与从蛋白质氨基酸序列计算的生物特征相​​结合,例如词袋、物理化学和谱域特征。我们提出了一种新的顺序正向特征选择(SFFS)算法,即基于随机森林的 Boruta 特征选择,用于从计算的大特征集中选择最佳特征。决策树、线性判别分析和梯度提升分类器用作学习器。我们通过考虑酵母的两个参考蛋白质复合数据集进行了实验,即 CYC2008 和 MIPS。人类和小鼠的复杂信息取自 CORUM 3.0 数据集。蛋白质相互作用信息是从相互作用蛋白质数据库 (DIP) 中提取的。我们提出的 SFFS,即基于随机森林的 Brouta 特征选择与决策树、线性判别分析和梯度提升分类器相结合,在精度、召回率和 F 测量率方面优于其他最先进的算法,即 94.58%、94.92% 和94.

更新日期:2021-05-07
down
wechat
bug