ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time

Pal, Manoj Kumar; Lahiri, Tapobrata; Kumar, Rajnish

doi:10.1007/s12539-020-00380-w

ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time

Original research article
Published: 10 June 2020

Volume 12, pages 276–287, (2020)
Cite this article

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

305 Accesses
3 Citations
Explore all metrics

Abstract

Protein sequence is a wealth of experimental information which is yet to be exploited to extract information on protein homologues. Consequently, it is observed from publications that dynamic programming, heuristics and HMM profile-based alignment techniques along with the alignment free techniques do not directly utilize ordered profile of physicochemical properties of a protein to identify its homologue. Also, it is found that these works lack crucial bench-marking or validation in absence of which their incorporation in search engines may appears to be questionable. In this direction this research approach offers fixed dimensional numerical representation of protein sequences extending the concept of periodicity count value of nucleotide types (2017) to accommodate Euclidean distance as direct similarity measure between two proteins. Instead of bench-marking with BLAST and PSI-BLAST only, this new similarity measure was also compared with Needleman–Wunsch and Smith–Waterman. For enhancing the strength of comparison, this work for the first time introduces two novel benchmarking methods based on correlation of “similarity scores” and “proximity of ranked outputs from a standard sequence alignment method” between all possible pairs of search techniques including the new one presented in this paper. It is found that the novel and unique numerical representation of a protein can reduce computational complexity of protein sequence search to the tune of O(log(n)). It may also help implementation of various other similarity-based operation possible, such as clustering, phylogenetic analysis and classification of proteins on the basis of the properties used to build this numerical representation of protein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions

Article Open access 23 November 2015

Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance

Article Open access 02 June 2017

A novel two-level particle swarm optimization approach for efficient multiple sequence alignment

Article 22 February 2015

References

Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinform 42:1–3. https://doi.org/10.1002/0471250953.bi0301s42
Article Google Scholar
Vialle RA, Pedrosa FO, Weiss VA et al (2016) RAFTS3: rapid alignment-free tool for sequence similarity search. bioRxiv. https://doi.org/10.1101/055269
Article Google Scholar
Lambert C, Campenhout JM, DeBolle X, Depiereux E (2003) Review of common sequence alignment methods: clues to enhance reliability. Curr Genom 4:131–146. https://doi.org/10.2174/1389202033350038
Article CAS Google Scholar
Vinga S, Almeida J (2003) Alignment-free sequence comparison—a review. Bioinform Oxf Engl 19:513–523. https://doi.org/10.1093/bioinformatics/btg005
Article CAS Google Scholar
Zielezinski A, Vinga S, Almeida J, Karlowski WM (2017) Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 18:186. https://doi.org/10.1186/s13059-017-1319-7
Article PubMed PubMed Central Google Scholar
Krasnogor N, Pelta DA (2004) Measuring the similarity of protein structures by means of the universal similarity metric. Bioinform Oxf Engl 20:1015–1021. https://doi.org/10.1093/bioinformatics/bth031
Article CAS Google Scholar
Mahmood K, Webb GI, Song J et al (2012) Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs. Nucleic Acids Res 40:e44. https://doi.org/10.1093/nar/gkr1261
Article CAS PubMed Google Scholar
Steinegger M, Söding J (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35:1026–1028. https://doi.org/10.1038/nbt.3988
Article CAS PubMed Google Scholar
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM (2016) Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation. Annu Rev Anal Chem Palo Alto Calif 9:521–545. https://doi.org/10.1146/annurev-anchem-071015-041722
Article PubMed PubMed Central Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Article CAS PubMed Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article CAS PubMed Google Scholar
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448. https://doi.org/10.1073/pnas.85.8.2444
Article CAS PubMed PubMed Central Google Scholar
Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation. Washington, DC, pp 345–352
Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919. https://doi.org/10.1073/pnas.89.22.10915
Article CAS PubMed PubMed Central Google Scholar
Yu Y-K, Altschul SF (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinform Oxf Engl 21:902–911. https://doi.org/10.1093/bioinformatics/bti070
Article CAS Google Scholar
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389
Article CAS PubMed PubMed Central Google Scholar
Panchenko AR, Bryant SH (2002) A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci Publ Protein Soc 11:361–370. https://doi.org/10.1110/ps.19902
Article CAS Google Scholar
Jaroszewski L, Rychlewski L, Li Z et al (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Res 33:W284–W288. https://doi.org/10.1093/nar/gki418
Article CAS PubMed PubMed Central Google Scholar
Biegert A, Söding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci 106:3770. https://doi.org/10.1073/pnas.0810767106
Article PubMed PubMed Central Google Scholar
Kaushik S, Nair AG, Mutt E et al (2016) Rapid and enhanced remote homology detection by cascading hidden Markov model searches in sequence space. Bioinformatics 32:338–344. https://doi.org/10.1093/bioinformatics/btv538
Article CAS PubMed Google Scholar
Kaznadzey A, Alexandrova N, Novichkov V, Kaznadzey D (2013) PSimScan: algorithm and utility for fast protein similarity search. PLoS ONE 8:e58505. https://doi.org/10.1371/journal.pone.0058505
Article CAS PubMed PubMed Central Google Scholar
Ge H, Sun L, Yu J (2017) Fast batch searching for protein homology based on compression and clustering. BMC Bioinform 18:508. https://doi.org/10.1186/s12859-017-1938-8
Article CAS Google Scholar
Nguyen VH, Lavenier D (2009) PLAST: parallel local alignment search tool for database comparison. BMC Bioinform 10:329. https://doi.org/10.1186/1471-2105-10-329
Article CAS Google Scholar
Qi Z-H, Jin M-Z, Li S-L, Feng J (2015) A protein mapping method based on physicochemical properties and dimension reduction. Comput Biol Med 57:1–7. https://doi.org/10.1016/j.compbiomed.2014.11.012
Article CAS PubMed Google Scholar
Gupta MK, Niyogi R, Misra M (2013) An alignment-free method to find similarity among protein sequences via the general form of Chou’s pseudo amino acid composition. SAR QSAR Environ Res 24:597–609. https://doi.org/10.1080/1062936X.2013.773378
Article CAS PubMed Google Scholar
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94. https://doi.org/10.1093/protein/12.2.85
Article CAS PubMed Google Scholar
Kumar R, Mishra BK, Lahiri T et al (2017) PCV: an alignment free method for finding homologous nucleotide sequences and its application in phylogenetic study. Interdiscip Sci Comput Life Sci 9:173–183. https://doi.org/10.1007/s12539-015-0136-5
Article CAS Google Scholar
Vella F (1998) The cell. A molecular approach; Edited by G H Cooper. pp 673. ASM Press, Washington DC, Sinauer Associates, Sunderland, MA. 1997 ISBN 0-87893-119-8. Biochem Educ 26:98–99
Article Google Scholar
Sneath PH (1966) Relations between chemical structure and biological activity in peptides. J Theor Biol 12:157–195
Article CAS PubMed Google Scholar
Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132
Article CAS PubMed Google Scholar
Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185:862–864
Article CAS PubMed Google Scholar
Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet TIG 16:276–277
Article CAS PubMed Google Scholar
Zielezinski A, Girgis HZ, Bernard G et al (2019) Benchmarking of alignment-free sequence comparison methods. Genome Biol 20:144. https://doi.org/10.1186/s13059-019-1755-7
Article PubMed PubMed Central Google Scholar
Abhilash CB, Rohitaksha K (2014) A comparative study on global and local alignment algorithm methods. Int J Emerg Technol Adv Eng 4:34–43
Google Scholar
Kolekar P, Kale M, Kulkarni-Kale U (2012) Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol Phylogenet Evol 65:510–522. https://doi.org/10.1016/j.ympev.2012.07.003
Article PubMed Google Scholar
Dolatshah M, Hadian A, Minaei-Bidgoli B (2015) Ball*-tree: Efficient spatial indexing for constrained nearest-neighbor search in metric spaces. ArXiv:151100628 Cs
Rodgers JL, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42:59–66. https://doi.org/10.1080/00031305.1988.10475524
Article Google Scholar
Asamoah MK (2014) Re-examination of the limitations associated with correlational research. Educ Res Rev 2:45–52
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the logistic support of Biomedical Informatics Lab-Department of Applied Science, Computer Centre II, Indian Institute of Information Technology Allahabad(IIITA), India for providing financial support to procure computer systems and coding software, Perl (V5.26.1) and MATLAB (R2019b) used in this work. Manoj Kumar Pal and Rajnish Kumar are also thankful to Ministry of Human Resource Development (MHRD), Government of India for providing regular monthly Research Scholarship.

Author information

Authors and Affiliations

Department of Applied Sciences, Indian Institute of Information Technology, Allahabad, UP, 211015, India
Manoj Kumar Pal, Tapobrata Lahiri & Rajnish Kumar

Authors

Manoj Kumar Pal
View author publications
You can also search for this author in PubMed Google Scholar
Tapobrata Lahiri
View author publications
You can also search for this author in PubMed Google Scholar
Rajnish Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tapobrata Lahiri.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 17 kb)

Supplementary file2 (PDF 260 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pal, M.K., Lahiri, T. & Kumar, R. ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time. Interdiscip Sci Comput Life Sci 12, 276–287 (2020). https://doi.org/10.1007/s12539-020-00380-w

Download citation

Received: 18 November 2019
Revised: 19 May 2020
Accepted: 02 June 2020
Published: 10 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s12539-020-00380-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time

Abstract

Access this article

Similar content being viewed by others

MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions

Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance

A novel two-level particle swarm optimization approach for efficient multiple sequence alignment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Electronic supplementary material

Supplementary file1 (XLSX 17 kb)

Supplementary file2 (PDF 260 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time

Abstract

Access this article

Similar content being viewed by others

MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions

Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance

A novel two-level particle swarm optimization approach for efficient multiple sequence alignment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Electronic supplementary material

Supplementary file1 (XLSX 17 kb)

Supplementary file2 (PDF 260 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation