当前位置: X-MOL 学术Biotechnol. Biotechnol. Equip. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A novel sequence-based prediction method for ATP-binding sites using fusion of SMOTE algorithm and random forests classifier
Biotechnology & Biotechnological Equipment ( IF 1.5 ) Pub Date : 2020-01-01 , DOI: 10.1080/13102818.2020.1840436
Jiazhi Song 1, 2, 3 , Guixia Liu 1, 2 , Chuyi Song 4 , Jingqing Jiang 3
Affiliation  

Abstract Correctly identifying the protein-ATP binding site is valuable for both protein function annotation and new drug discovery. However, the number of non-ATP-binding residues is much more than the number of ATP-binding residues, which makes the prediction a classical imbalanced learning problem. Previous studies often apply the under-sampling technique to construct a relatively balanced dataset, but some information is inevitably lost during the sample process. In this work, we utilize the SMOTE algorithm, which generates the balanced dataset by generating ATP-binding sites with the idea of interpolation. The Random Forest is selected as classifier to ensure the acceptable training speed. With the combination of complementary template-based method, the prediction performance of the proposed method is further improved. After comparing with other sequence-based predictors, our proposed method achieves satisfying performance and proved to be efficient for ATP-binding sites prediction.

中文翻译:

基于 SMOTE 算法和随机森林分类器融合的 ATP 结合位点序列预测新方法

摘要 正确识别蛋白质-ATP 结合位点对于蛋白质功能注释和新药发现都很有价值。然而,非 ATP 结合残基的数量远多于 ATP 结合残基的数量,这使得预测成为经典的不平衡学习问题。以往的研究往往采用欠采样技术来构建相对平衡的数据集,但在采样过程中不可避免地会丢失一些信息。在这项工作中,我们利用 SMOTE 算法,该算法通过使用插值的思想生成 ATP 结合位点来生成平衡数据集。选择随机森林作为分类器以确保可接受的训练速度。结合互补的基于模板的方法,进一步提高了该方法的预测性能。
更新日期:2020-01-01
down
wechat
bug