Skip to main content

Advertisement

Log in

Protein class prediction based on Count Vectorizer and long short term memory

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Proteins class and function prediction is one of the most significant task in computational bioinformatics. The information about the protein functions and class plays a vital role in understanding biological cells and has a great impact on human life in factors such as personalized medicine. The technical advancement in the areas of biological aspects and understanding of biological processes results in features and characteristics of important Proteins. Prediction of amino acid sequence involves prediction of amino sequence folding and its structures from the primary sequence obtained. In this work, Machine learning prediction algorithms have applied for protein class prediction. This method takes consideration of macromolecules of biological significances. Later the solution focuses on the understanding of different protein family, subsequently classify the protein family type sequence. This is achieved through machine learning algorithms Naive Bayes (NB) and Random forest (RF) algorithms with count vectorized feature and LSTM. These algorithms are used to classify the protein family on its protein sequence. Finally, result shows that LSTM predicts the protein class more accurately than the RF, and NB algorithm. LSTM achieves an accuracy of 96% whereas RF & NB with an accuracy of 91% and 86%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Pauling L, Corey RB, Branson HR (1951) The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci USA 37:205

    Article  Google Scholar 

  2. Rehman HU, Azam N, Yao J, Benso A (2017) A three-way approach for protein function classification. PLoS ONE 12(2):0171702

    Article  Google Scholar 

  3. Kabli F, Hamou RM, Amine A (2017) New classification system for protein sequences. In 2017 First International Conference on Embedded and Distributed Systems (EDiS), IEEE. Oran, Algeria, pp. 1–6

  4. Bankapur, Sanjay, and Nagamma Patil (2018) Protein Secondary Structural Class Prediction Using Effective Feature Modeling and Machine Learning Techniques. In 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE pp.18–21

  5. Lima, Emerson Correia, Fábio Lima Custódio, Gregório Kappaun Rocha, and Laurent E. Dardenne (2018) Estimating Protein Structure Prediction Models Quality Using Convolutional Neural Networks. In 2018 International Joint Conference on Neural Networks (IJCNN), IEEE pp. 1–6

  6. Fang, Chao, Yi Shang, and Dong Xu. (2017) A New Deep Neighbor Residual Network for Protein Secondary Structure Prediction. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE pp. 66–71

  7. Iqbal MJ, Faye I, Said AM, Samir BB (2014) Data mining of protein sequences with amino acid position-based feature encoding technique. In: Herawan T, Deris MM, Abawajy J (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering. Springer, Singapore

    Google Scholar 

  8. Anfinsen C (1972) The formation and stabilization of protein structure. Biochem J 128:737

    Article  Google Scholar 

  9. Dictionary (2019) Amino. https://www.dictionary.com/. Accessed 25 March 2019

  10. Amino acid, [Online]. Available: https://en.wikipedia.org/. Accessed 22 May 2015

  11. Robles V, Larrañaga P, Peña JM, Menasalvas E, Pérez MS, Herves V, Wasilewska A (2004) Bayesian network multi-classifiers for protein secondary structure prediction. Artif Intell Med 31:117

    Article  Google Scholar 

  12. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  13. Protein data bank. Availabe https://www.kaggle.com/shahir/protein-data-set#pdb_data_seq.csv

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  15. Hawkins J, Boden M (2005) The Applicability of recurrent neural networks for biological sequence analysis. IEEE/ACM Trans Comput Biol Bioinform 2(3):243–253

    Article  Google Scholar 

  16. Jain G, Sharma M, Agarwal B (2019) Optimizing semantic LSTM for spam detection. Int J Inf Technol 11:239–250

    Google Scholar 

  17. Chhachhiya D, Sharma A, Gupta M (2019) Designing optimal architecture of recurrent neural network (LSTM) with particle swarm optimization technique specifically for educational dataset. Int J Inf Technol 11(1):159–163

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. R. Mani Sekhar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sekhar, S.R.M., Siddesh, G.M., Raj, M. et al. Protein class prediction based on Count Vectorizer and long short term memory. Int. j. inf. tecnol. 13, 341–348 (2021). https://doi.org/10.1007/s41870-020-00528-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-020-00528-3

Keywords

Navigation