当前位置: X-MOL 学术Int. J. Inf. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Protein class prediction based on Count Vectorizer and long short term memory
International Journal of Information Technology Pub Date : 2020-10-11 , DOI: 10.1007/s41870-020-00528-3
S. R. Mani Sekhar , G. M. Siddesh , Mithun Raj , Sunilkumar S. Manvi

Proteins class and function prediction is one of the most significant task in computational bioinformatics. The information about the protein functions and class plays a vital role in understanding biological cells and has a great impact on human life in factors such as personalized medicine. The technical advancement in the areas of biological aspects and understanding of biological processes results in features and characteristics of important Proteins. Prediction of amino acid sequence involves prediction of amino sequence folding and its structures from the primary sequence obtained. In this work, Machine learning prediction algorithms have applied for protein class prediction. This method takes consideration of macromolecules of biological significances. Later the solution focuses on the understanding of different protein family, subsequently classify the protein family type sequence. This is achieved through machine learning algorithms Naive Bayes (NB) and Random forest (RF) algorithms with count vectorized feature and LSTM. These algorithms are used to classify the protein family on its protein sequence. Finally, result shows that LSTM predicts the protein class more accurately than the RF, and NB algorithm. LSTM achieves an accuracy of 96% whereas RF & NB with an accuracy of 91% and 86%.



中文翻译:

基于Count Vectorizer和长期短期记忆的蛋白质类别预测

蛋白质的类别和功能预测是计算生物信息学中最重要的任务之一。有关蛋白质功能和类别的信息在理解生物细胞方面起着至关重要的作用,并在诸如个性化医学等因素中对人类生活产生重大影响。生物学方面的技术进步和对生物学过程的理解导致了重要蛋白质的特征和特性。氨基酸序列的预测涉及从获得的一级序列预测氨基酸序列折叠及其结构。在这项工作中,机器学习预测算法已应用于蛋白质类别预测。该方法考虑了具有生物学意义的大分子。后来的解决方案着重于对不同蛋白质家族的理解,随后对蛋白质家族类型序列进行分类。这是通过机器学习算法Naive Bayes(NB)和随机森林(RF)算法以及计数矢量化功能和LSTM实现的。这些算法用于在蛋白质序列上对蛋白质家族进行分类。最后,结果表明,LSTM比RF和NB算法更准确地预测蛋白质类别。LSTM的精度为96%,而RF和NB的精度为91%和86%。

更新日期:2020-10-11
down
wechat
bug