当前位置: X-MOL 学术Proteins Struct. Funct. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Boosting phosphorylation site prediction with sequence feature-based machine learning.
Proteins: Structure, Function, and Bioinformatics ( IF 2.9 ) Pub Date : 2019-08-22 , DOI: 10.1002/prot.25801
Shyantani Maiti 1 , Atif Hassan 1 , Pralay Mitra 1
Affiliation  

Protein phosphorylation is one of the essential posttranslation modifications playing a vital role in the regulation of many fundamental cellular processes. We propose a LightGBM-based computational approach that uses evolutionary, geometric, sequence environment, and amino acid-specific features to decipher phosphate binding sites from a protein sequence. Our method, while compared with other existing methods on 2429 protein sequences taken from standard Phospho.ELM (P.ELM) benchmark data set featuring 11 organisms reports a higher F1 score = 0.504 (harmonic mean of the precision and recall) and ROC AUC = 0.836 (area under the curve of the receiver operating characteristics). The computation time of our proposed approach is much less than that of the recently developed deep learning-based framework. Structural analysis on selected protein sequences informs that our prediction is the superset of the phosphorylation sites, as mentioned in P.ELM data set. The foundation of our scheme is manual feature engineering and a decision tree-based classification. Hence, it is intuitive, and one can interpret the final tree as a set of rules resulting in a deeper understanding of the relationships between biophysical features and phosphorylation sites. Our innovative problem transformation method permits more control over precision and recall as is demonstrated by the fact that if we incorporate output probability of the existing deep learning framework as an additional feature, then our prediction improves (F1 score = 0.546; ROC AUC = 0.849). The implementation of our method can be accessed at http://cse.iitkgp.ac.in/~pralay/resources/PPSBoost/ and is mirrored at https://cosmos.iitkgp.ac.in/PPSBoost.

中文翻译:

通过基于序列特征的机器学习增强磷酸化位点的预测。

蛋白质磷酸化是重要的翻译后修饰之一,在许多基本细胞过程的调节中起着至关重要的作用。我们提出了一种基于LightGBM的计算方法,该方法使用进化,几何,序列环境和氨基酸特定功能来从蛋白质序列中破译磷酸盐结合位点。与其他现有方法相比,我们的方法从标准的Phospho.ELM(P.ELM)基准数据集中提取了2429种蛋白质序列,其中11种生物具有较高的F1得分= 0.504(精度和召回率的谐和平均值),ROC AUC = 0.836(接收器工作特性曲线下的面积)。我们提出的方法的计算时间比最近开发的基于深度学习的框架的计算时间少得多。对选定蛋白质序列的结构分析表明,我们的预测是磷酸化位点的超集,如P.ELM数据集所述。我们方案的基础是手动特征工程和基于决策树的分类。因此,它是直观的,并且可以将最终的树解释为一组规则,从而可以更深入地了解生物物理特征与磷酸化位点之间的关系。我们的创新问题转化方法可以更好地控制精度和召回率,这一事实表明,如果我们将现有深度学习框架的输出概率作为附加功能,那么我们的预测就会提高(F1分数= 0.546; ROC AUC = 0.849) 。可以在http://cse.iitkgp.ac上访问我们方法的实现。
更新日期:2020-01-04
down
wechat
bug