NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data,Microbial Genomics

当前位置： X-MOL 学术 › Microb. Genom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
Microbial Genomics ( IF 4.0 ) Pub Date : 2020-12-01 , DOI: 10.1099/mgen.0.000483
Chao Wang ₁ , Jin Wu ₂ , Lei Xu ₃ , Quan Zou _{1,

4}

Affiliation

Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew’s correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.

中文翻译：

NonClasGP-Pred：通过集成不平衡数据的子集特定最优模型，对非经典分泌蛋白进行稳健且有效的预测

非经典分泌蛋白（NCSP）是位于细胞外环境中的蛋白质，尽管缺乏已知的信号肽或分泌基序。它们通常在细胞内和细胞外环境中发挥不同的生物学功能，其中一些生物学功能与细菌毒力和细胞防御有关。准确的蛋白质定位对于所有生物体都至关重要，然而，现有的 NCSP 识别方法的性能并不令人满意，特别是存在数据缺乏和可能的过度拟合问题。需要进一步改进，特别是解决不平衡数据集中缺乏信息特征和挖掘子集特定特征的问题。在本研究中，开发了一种新的计算预测器用于革兰氏阳性菌的 NCSP 预测。首先，为了解决数据不平衡问题可能导致的预测偏差，生成了十个平衡子数据集用于集成模型构建。然后，采用F-score算法结合顺序前向搜索来增强每个训练子数据集的特征表示能力。第三，采用子集特定的最优特征组合过程从不同方面表征原始数据，并将所有基于子数据集的模型集成为统一模型NonClasGP-Pred，取得了优异的性能，准确率达到93.23％。十倍交叉验证的敏感性为 100%，特异性为 89.01%，马修相关系数为 87.68%，曲线下面积值为 0.9975。根据对独立测试数据集的评估，所提出的模型优于最先进的可用工具包。有关可用性和实施，请参见：http://lab.malab.cn/~wangchao/softwares/NonClasGP/。

更新日期：2020-12-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文