Skip to main content
Log in

ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Protein sequence is a wealth of experimental information which is yet to be exploited to extract information on protein homologues. Consequently, it is observed from publications that dynamic programming, heuristics and HMM profile-based alignment techniques along with the alignment free techniques do not directly utilize ordered profile of physicochemical properties of a protein to identify its homologue. Also, it is found that these works lack crucial bench-marking or validation in absence of which their incorporation in search engines may appears to be questionable. In this direction this research approach offers fixed dimensional numerical representation of protein sequences extending the concept of periodicity count value of nucleotide types (2017) to accommodate Euclidean distance as direct similarity measure between two proteins. Instead of bench-marking with BLAST and PSI-BLAST only, this new similarity measure was also compared with Needleman–Wunsch and Smith–Waterman. For enhancing the strength of comparison, this work for the first time introduces two novel benchmarking methods based on correlation of “similarity scores” and “proximity of ranked outputs from a standard sequence alignment method” between all possible pairs of search techniques including the new one presented in this paper. It is found that the novel and unique numerical representation of a protein can reduce computational complexity of protein sequence search to the tune of O(log(n)). It may also help implementation of various other similarity-based operation possible, such as clustering, phylogenetic analysis and classification of proteins on the basis of the properties used to build this numerical representation of protein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinform 42:1–3. https://doi.org/10.1002/0471250953.bi0301s42

    Article  Google Scholar 

  2. Vialle RA, Pedrosa FO, Weiss VA et al (2016) RAFTS3: rapid alignment-free tool for sequence similarity search. bioRxiv. https://doi.org/10.1101/055269

    Article  Google Scholar 

  3. Lambert C, Campenhout JM, DeBolle X, Depiereux E (2003) Review of common sequence alignment methods: clues to enhance reliability. Curr Genom 4:131–146. https://doi.org/10.2174/1389202033350038

    Article  CAS  Google Scholar 

  4. Vinga S, Almeida J (2003) Alignment-free sequence comparison—a review. Bioinform Oxf Engl 19:513–523. https://doi.org/10.1093/bioinformatics/btg005

    Article  CAS  Google Scholar 

  5. Zielezinski A, Vinga S, Almeida J, Karlowski WM (2017) Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 18:186. https://doi.org/10.1186/s13059-017-1319-7

    Article  PubMed  PubMed Central  Google Scholar 

  6. Krasnogor N, Pelta DA (2004) Measuring the similarity of protein structures by means of the universal similarity metric. Bioinform Oxf Engl 20:1015–1021. https://doi.org/10.1093/bioinformatics/bth031

    Article  CAS  Google Scholar 

  7. Mahmood K, Webb GI, Song J et al (2012) Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs. Nucleic Acids Res 40:e44. https://doi.org/10.1093/nar/gkr1261

    Article  CAS  PubMed  Google Scholar 

  8. Steinegger M, Söding J (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35:1026–1028. https://doi.org/10.1038/nbt.3988

    Article  CAS  PubMed  Google Scholar 

  9. Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM (2016) Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation. Annu Rev Anal Chem Palo Alto Calif 9:521–545. https://doi.org/10.1146/annurev-anchem-071015-041722

    Article  PubMed  PubMed Central  Google Scholar 

  10. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453

    Article  CAS  PubMed  Google Scholar 

  11. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  CAS  PubMed  Google Scholar 

  12. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Article  CAS  PubMed  Google Scholar 

  13. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448. https://doi.org/10.1073/pnas.85.8.2444

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation. Washington, DC, pp 345–352

    Google Scholar 

  15. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919. https://doi.org/10.1073/pnas.89.22.10915

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Yu Y-K, Altschul SF (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinform Oxf Engl 21:902–911. https://doi.org/10.1093/bioinformatics/bti070

    Article  CAS  Google Scholar 

  17. Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Panchenko AR, Bryant SH (2002) A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci Publ Protein Soc 11:361–370. https://doi.org/10.1110/ps.19902

    Article  CAS  Google Scholar 

  19. Jaroszewski L, Rychlewski L, Li Z et al (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Res 33:W284–W288. https://doi.org/10.1093/nar/gki418

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Biegert A, Söding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci 106:3770. https://doi.org/10.1073/pnas.0810767106

    Article  PubMed  PubMed Central  Google Scholar 

  21. Kaushik S, Nair AG, Mutt E et al (2016) Rapid and enhanced remote homology detection by cascading hidden Markov model searches in sequence space. Bioinformatics 32:338–344. https://doi.org/10.1093/bioinformatics/btv538

    Article  CAS  PubMed  Google Scholar 

  22. Kaznadzey A, Alexandrova N, Novichkov V, Kaznadzey D (2013) PSimScan: algorithm and utility for fast protein similarity search. PLoS ONE 8:e58505. https://doi.org/10.1371/journal.pone.0058505

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Ge H, Sun L, Yu J (2017) Fast batch searching for protein homology based on compression and clustering. BMC Bioinform 18:508. https://doi.org/10.1186/s12859-017-1938-8

    Article  CAS  Google Scholar 

  24. Nguyen VH, Lavenier D (2009) PLAST: parallel local alignment search tool for database comparison. BMC Bioinform 10:329. https://doi.org/10.1186/1471-2105-10-329

    Article  CAS  Google Scholar 

  25. Qi Z-H, Jin M-Z, Li S-L, Feng J (2015) A protein mapping method based on physicochemical properties and dimension reduction. Comput Biol Med 57:1–7. https://doi.org/10.1016/j.compbiomed.2014.11.012

    Article  CAS  PubMed  Google Scholar 

  26. Gupta MK, Niyogi R, Misra M (2013) An alignment-free method to find similarity among protein sequences via the general form of Chou’s pseudo amino acid composition. SAR QSAR Environ Res 24:597–609. https://doi.org/10.1080/1062936X.2013.773378

    Article  CAS  PubMed  Google Scholar 

  27. Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94. https://doi.org/10.1093/protein/12.2.85

    Article  CAS  PubMed  Google Scholar 

  28. Kumar R, Mishra BK, Lahiri T et al (2017) PCV: an alignment free method for finding homologous nucleotide sequences and its application in phylogenetic study. Interdiscip Sci Comput Life Sci 9:173–183. https://doi.org/10.1007/s12539-015-0136-5

    Article  CAS  Google Scholar 

  29. Vella F (1998) The cell. A molecular approach; Edited by G H Cooper. pp 673. ASM Press, Washington DC, Sinauer Associates, Sunderland, MA. 1997 ISBN 0-87893-119-8. Biochem Educ 26:98–99

    Article  Google Scholar 

  30. Sneath PH (1966) Relations between chemical structure and biological activity in peptides. J Theor Biol 12:157–195

    Article  CAS  PubMed  Google Scholar 

  31. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132

    Article  CAS  PubMed  Google Scholar 

  32. Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185:862–864

    Article  CAS  PubMed  Google Scholar 

  33. Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet TIG 16:276–277

    Article  CAS  PubMed  Google Scholar 

  34. Zielezinski A, Girgis HZ, Bernard G et al (2019) Benchmarking of alignment-free sequence comparison methods. Genome Biol 20:144. https://doi.org/10.1186/s13059-019-1755-7

    Article  PubMed  PubMed Central  Google Scholar 

  35. Abhilash CB, Rohitaksha K (2014) A comparative study on global and local alignment algorithm methods. Int J Emerg Technol Adv Eng 4:34–43

    Google Scholar 

  36. Kolekar P, Kale M, Kulkarni-Kale U (2012) Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol Phylogenet Evol 65:510–522. https://doi.org/10.1016/j.ympev.2012.07.003

    Article  PubMed  Google Scholar 

  37. Dolatshah M, Hadian A, Minaei-Bidgoli B (2015) Ball*-tree: Efficient spatial indexing for constrained nearest-neighbor search in metric spaces. ArXiv:151100628 Cs

  38. Rodgers JL, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42:59–66. https://doi.org/10.1080/00031305.1988.10475524

    Article  Google Scholar 

  39. Asamoah MK (2014) Re-examination of the limitations associated with correlational research. Educ Res Rev 2:45–52

    Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the logistic support of Biomedical Informatics Lab-Department of Applied Science, Computer Centre II, Indian Institute of Information Technology Allahabad(IIITA), India for providing financial support to procure computer systems and coding software, Perl (V5.26.1) and MATLAB (R2019b) used in this work. Manoj Kumar Pal and Rajnish Kumar are also thankful to Ministry of Human Resource Development (MHRD), Government of India for providing regular monthly Research Scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tapobrata Lahiri.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 17 kb)

Supplementary file2 (PDF 260 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pal, M.K., Lahiri, T. & Kumar, R. ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time. Interdiscip Sci Comput Life Sci 12, 276–287 (2020). https://doi.org/10.1007/s12539-020-00380-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-020-00380-w

Keywords

Navigation