Abstract
Chloroplast is a type of subcellular organelle in green plants and algae. It is the main subcellular organelle for conducting photosynthetic process. The proteins, which localize within the chloroplast, are responsible for the photosynthetic process at molecular level. The chloroplast can be further divided into several compartments. Proteins in different compartments are related to different steps in the photosynthetic process. Since the molecular function of a protein is highly correlated to the exact cellular localization, pinpointing the subchloroplast location of a chloroplast protein is an important step towards the understanding of its role in the photosynthetic process. Experimental process for determining protein subchloroplast location is always costly and time consuming. Therefore, computational approaches were developed to predict the protein subchloroplast locations from the primary sequences. Over the last decades, more than a dozen studies have tried to predict protein subchloroplast locations with machine learning methods. Various sequence features and various machine learning algorithms have been introduced in this research topic. In this review, we collected the comprehensive information of all existing studies regarding the prediction of protein subchloroplast locations. We compare these studies in the aspects of benchmarking datasets, sequence features, machine learning algorithms, predictive performances, and the implementation availability. We summarized the progress and current status in this special research topic. We also try to figure out the most possible future works in predicting protein subchloroplast locations. We hope this review not only list all existing works, but also serve the readers as a useful resource for quickly grasping the big picture of this research topic. We also hope this review work can be a starting point of future methodology studies regarding the prediction of protein subchloroplast locations.
Similar content being viewed by others
References
Murphy R F. Automated interpretation of protein subcellular location patterns: implications for early cancer detection and assessment. Annals of the New York Academy of Sciences, 2004, 1020: 124–131
Imai K, Nakai K. Prediction of subcellular locations of proteins: where to proceed? Proteomics, 2010, 10(22): 3970–3983
Zhao Y, Wang J, Guo M, Zhang Z, Yu G. Protein function prediction based on zero-one matrix factorization. SCIENTIA SINICA Informationis, 2019, 49(9): 1159–1174
Wang Z, Zhao C, Wang Y, Sun Z, Wang N. PANDA: protein function prediction using domain architecture and affinity propagation. Scientific Reports, 2018, 8(1): 1–10
Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics, 2020, 36(2): 422–429
Yu G, Wang K, Domeniconi C, Guo M, Wang J. Isoform function prediction based on bi-random walks on a heterogeneous network. Bioinformatics, 2020, 36(1): 303–310
Reinhardt A, Hubbard T. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research, 1998, 26(1): 2230–2236
Raju T N K. The Nobel chronicles. The Lancet, 2000, 356: 261
Bacia K. Intracellular transport mechanisms: Nobel prize for medicine 2013. Angewandte Chemie International Edition, 2013, 52(48): 12486–12488
Friedrich M J. 2013 Nobel prize recognizes work of scientists who illuminated molecular transport system of cells. JAMA: The Journal of the American Medical Association, 2013, 310(19): 2027–2029
Wickner W T. Profile of Thomas Sudhof, James Rothman, And Randy Schekman, 2013 Nobel laureates in physiology or medicine. Proceedings of the National Academy of Sciences of the United States of America, 2013, 110(46): 18349–18350
Thul P J, Åesson L, Wiking M, Mahdessian D, Geladaki A, AitBlal H, Alm T, Asplund A, Björk L, Breckels LM, Bäckström A, Danielsson F, Fagerberg L, Fall J, Gatto L, Gnann C, Hober S, Hjelmare M, Johansson F, Lee S, Lindskog C, Mulder J, Mulvey CM, Nilsson P, Oksvold P, Rockberg J, Schutten R, Schwenk J M, Sivertsson Å, Sjöstedt E, Skogs M, Stadler C, Sullivan D P, Tegel H, Winsnes C, Zhang C, Zwahlen M, Mardinoglu A, Pontén F, von Feilitzen K, Lilley K S, Uhlén M, Lundberg E. A subcellular map of the human proteome. Science, 2017, 356(6340): eaal3321
Horwitz R, Johnson G T. Whole cell maps chart a course for 21st-century cell biology. Science, 2017, 356(6340): 806–807
Chou K C, Shen H B. Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE, 2010, 5(6): e11335
Shen H B, Chou K C. A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Analytical Biochemistry, 2009, 394(2): 269–274
Shen H B, Chou K C. Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. Journal of Biomolecular Structure & Dynamics, 2010, 28(2): 175–186
Shen H B, Chou K C. Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. Journal of Theoretical Biology, 2010, 264(2): 326–333
Chou K C, Wu Z C, Xiao X. ILoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE, 2011, 6(3): e18258
Chou K C, Wu Z C, Xiao X. ILoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular BioSystems, 2012, 8(2): 629–641
Wu Z C, Xiao X, Chou K C. ILoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular BioSystems, 2011, 7(12): 3287–3297
Wu Z C, Xiao X, Chou K C. ILoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Grampositive bacterial proteins. Protein and Peptide Letters, 2012, 19(1): 4–14
Xiao X, Wu Z C, Chou K C. ILoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. Journal of Theoretical Biology, 2011, 284(1): 42–51
Lin W Z, Fang J A, Xiao X, Chou K C. ILoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Molecular BioSystems, 2013, 9(4): 634–644
Xu Y Y, Yang F, Zhang Y, Shen H B. An image-based multi-label human protein subcellular localization predictor (iLocator) reveals protein mislocalizations in cancer tissues. Bioinformatics, 2013, 29(16): 2032–2040
Du P, Wang L. Predicting human protein subcellular locations by the ensemble of multiple predictors via protein-protein interaction network with edge clustering coefficients. PLoS ONE, 2014, 9(1): e86879
Cheng X, Zhao S G, Lin W Z, Xiao X, Chou K C. PLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics, 2017, 33(22): 3524–3531
Zhou H, Yang Y, Shen H B. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics, 2017, 33(6): 843–853
Wang Z, Zou Q, Jiang Y, Ju Y, Zeng X. Review of protein subcellular localization prediction. Current Bioinformatics, 2014, 9(3): 331–342
Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Review of Proteomics, 2011, 8(3): 391–404
Shen H B, Chou K C. Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Engineering, Design & Selection: PEDS, 2007, 20(11): 561–567
Han G S, Yu Z G, Anh V, Krishnajith A P D, Tian Y C. An ensemble method for predicting subnuclear localizations from primary protein structures. PLoS ONE, 2013, 8(2): e57225
Jiao Y S, Du P F. Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou’s general pseudo-amino acid compositions. Journal of Theoretical Biology, 2017, 416: 81–87
Du P F. Predicting protein submitochondrial locations: the 10th anniversary. Current Genomics, 2017, 18(4): 316–321
Du P, Li Y. Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics, 2006, 7: 518
Ahmad K, Waris M, Hayat M. Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition. The Journal of Membrane Biology, 2016, 249(3): 293–304
Zhao W, Li G P, Wang J, Zhou Y K, Gao Y, Du P F. Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. Journal of Theoretical Biology, 2019, 473: 38–43
Jiao Y S, Du P F. Prediction of Golgi-resident protein types using general form of Chou’s pseudo-amino acid compositions: approaches with minimal redundancy maximal relevance feature selection. Journal of Theoretical Biology, 2016, 402: 38–44
Jiao Y S, Du P F. Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. Journal of Theoretical Biology, 2016, 391: 35–42
Ding H, Guo S H, Deng E Z, Yuan L F, Guo F B, Huang J, Rao N, Chen W, Lin H. Prediction of Golgi-resident protein types by using feature selection technique. Chemometrics and Intelligent Laboratory Systems, 2013, 124: 9–13
Ding H, Liu L, Guo F B, Huang J, Lin H. Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition. Protein and Peptide Letters, 2011, 18(1): 58–63
Rahman M S, Rahman M K, Kaykobad M, Rahman M S. IsGPT: an optimized model to identify sub-Golgi protein types using SVM and Random forest based feature selection. Artificial Intelligence in Medicine, 2018, 84: 90–100
Chou W C, Yin Y, Xu Y. GolgiP: prediction of Golgi-resident proteins in plants. Bioinformatics, 2010, 26(19): 2464–2465
van Dijk A D J, Bosch D, ter Braak C J F, van der Krol A R, van Ham R C H J. Predicting sub-Golgi localization of type II membrane proteins. Bioinformatics, 2008, 24(16): 1779–1786
Du P, Cao S, Li Y. SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. Journal of Theoretical Biology, 2009, 261(2): 330–335
Denoeux T. A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Transactions on Systems, Man, and Cybernetics, 1995, 25(5): 804–813
Wang X, Zhang W, Zhang Q, Li G Z. MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier. Bioinformatics, 2015, 31(16): 2639–2645
Savojardo C, Martelli P L, Fariselli P, Casadio R. SChloro: directing viridiplantae proteins to six chloroplastic sub-compartments. Bioinformatics, 2017, 33(3): 347–353
Chou K C. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology, 2011, 273(1): 236–247
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Research, 2015, 43(D1): D204–D212
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012, 28(23): 3150–3152
Lin H, Chen W, Yuan L F, Li Z Q, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. ActaBiotheoretica, 2013, 61(2): 259–268
Tung C W, Liaw C, Ho S J, Ho S Y. Prediction of protein subchloroplast locations using random forests. World Academy of Science, Engineering and Technology, 2010, 65: 903–907
Hu J, Yan X. BS-KNN: an effective algorithm for predicting protein subchloroplast localization. Evolutionary Bioinformatics Online, 2012, 8: 79–87
Saravanan V, Lakshmi P T V. SCLAP: an adaptive boosting method for predicting subchloroplast localization of plant proteins. OMICS: A Journal of Integrative Biology, 2013, 17(2): 106–115
Wang G, Dunbrack Jr R L. PISCES: a protein sequence culling server. Bioinformatics, 2003, 19(12): 1589–1591
Chou K C, Shen H B. Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. Journal of Proteome Research, 2007, 6(5): 1728–1734
Zhao W, Wang L, Zhang T X, Zhao Z N, Du P F. A brief review on software tools in generating chou’s pseudo-factor representations for all types of biological sequences. Protein and Peptide Letters, 2018, 25(9): 822–829
Lin H, Ding C, Yuan L F, Chen W, Ding H, Li Z Q, Guo F B, Huang J, Rao N N. Predicting subchloroplast locations of proteins based on the general form of chou’s pseudo amino acid composition: approached from optimal tripeptide composition. International Journal of Biomathematics, 2013, 6(2): 1350003
Du P, Xu C. Predicting multisite protein subcellular locations: progress and challenges. Expert Review of Proteomics, 2013, 10(3): 227–237
Huang C, Yuan J Q. Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou’s pseudo amino acid compositions. Journal of Theoretical Biology, 2013, 335: 205–212
Wan S, Duan Y, Zou Q. HPSLPred: an ensemble multi-label classifier for human protein subcellular location prediction with unbalanced source. Proteomics, 2017, 17(17–18): 1700262
Hussain W, Khan Y D, Rasool N, Khan S A, Chou K C. SPalmitoylC-PseAAC: a sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. Analytical Biochemistry, 2019, 568: 14–23
Le N Q K, Yapp E K Y, Ho Q T, Nagasundaram N, Ou Y Y, Yeh H Y. IEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Analytical Biochemistry, 2019, 571: 53–61
Chou K C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 2001, 43(3): 246–255
Chen J, Long R, Wang X L, Liu B, Chou K C. DRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Scientific Reports, 2016, 6: 32333
Chen Q Y, Tang J, Du P F. Predicting protein lysine phosphoglycerylation sites by hybridizing many sequence based features. Molecular Biosystems, 2017, 13(5): 874–882
Huang Y A, You Z H, Chen X, Yan G Y. Improved protein-protein interactions prediction via weighted sparse representation model combining continuous wavelet descriptor and PseAA composition. BMC Systems Biology, 2016, 10(4): 485–494
Jia J, Zhang L, Liu Z, Xiao X, Chou K C. PSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics, 2016, 32(20): 3133–3141
Lei G C, Tang J, Du P F. Predicting S-sulfenylation sites using physicochemical properties differences. Letters in Organic Chemistry, 2017, 14(9): 665–672
Du P, Wang X, Xu C, Gao Y. PseAAC-Builder: a cross-platform standalone program for generating various special Chou’s pseudo-amino acid compositions. Analytical Biochemistry, 2012, 425(2): 117–119
Du P, Gu S, Jiao Y. PseAAC-General: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets. International Journal of Molecular Sciences, 2014, 15(3): 3495–3506
Du P F, Zhao W, Miao Y Y, Wei L Y, Wang L. UltraPse: a universal and extensible software platform for representing biological sequences. International Journal of Molecular Sciences, 2017, 18(11): 2400
Cao D S, Xu Q S, Liang Y Z. Propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics, 2013, 29(7): 960–962
Liu B, Liu F, Wang X, Chen J, Fang L, Chou K C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research, 2015, 43(W1): W65–W71
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago T T, Wang Y, Webb G I, Smith A I, Daly R J, Chou K C, Song J. IFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 2018, 34(14): 2499–2502
Chou K C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics, 2009, 6(4): 262–274
Chou K C. Some remarks on predicting multi-label attributes in molecular biosystems. Molecular BioSystems, 2013, 9(6): 1092–1100
Chou K C. Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry, 2015, 11(3): 218–234
Du P, Yu Y. SubMito-PSPCP: predicting protein submitochondrial locations by hybridizing positional specific physicochemical properties with pseudoamino acid compositions. Biomed Research International, 2013, 2013: 263829
Miao Y Y, Zhao W, Li G P, Gao Y, Du P F. Predicting endoplasmic reticulum resident proteins using auto-cross covariance transformation with a U-shaped residue weight-transfer function. Frontiers in Genetics, 2019, 10: 1231
Du P, Li T, Wang X, Xu C. SubChlo-GO: predicting protein subchloroplast locations with weighted gene ontology scores. Current Bioinformatics, 2013, 8(2): 193–199
Carr K, Murray E, Armah E, He R L, Yau S S T. A rapid method for characterization of protein relatedness using feature vectors. PLoS ONE, 2010, 5(3): e9550
Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim S H. Recognition of a protein fold in the context of the structural classification of proteins (SCOP) classification. Proteins, 1999, 35(4): 401–407
Altschul S F, Madden T L, Schäfer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389–3402
Shi S P, Qiu J D, Sun X Y, Huang J H, Huang S Y, Suo S B, Liang R-P, Zhang L. Identify submitochondria and subchloroplast locations with pseudo amino acid composition: approach from the strategy of discrete wavelet transform feature extraction. Biochimica Et Biophysica Acta, 2011, 1813(3): 424–430
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Research, 2008, 36 (Database issue): D202–D205
Li X, Wu X, Wu G. Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model. Journal of Theoretical Biology, 2014, 347: 84–94
Kyte J, Doolittle R F. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 1982, 157(1): 105–132
Wan S, Mak M W, Kung S Y. Transductive learning for multi-label protein subchloroplast localization prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017, 14(1): 212–224
Wan S, Mak M W, Kung S Y. Ensemble linear neighborhood propagation for predicting subchloroplast localization of multi-location proteins. Journal of Proteome Research, 2016, 15(12): 4755–4762
Chou K C, Shen H B. Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research, 2006, 5(8): 1888–1897
Chou K C, Shen H B. Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochemical and Biophysical Research Communications, 2006, 347(1): 150–157
Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Sciences, 1999, 24(1): 34–36
Zybailov B, Rutschow H, Friso G, Rudella A, Emanuelsson O, Sun Q, van Wijk K J. Sorting signals, N-terminal modifications and abundance of the chloroplast proteome. PLoS ONE, 2008, 3(4): e1994
Andrade M A, O’Donoghue S I, Rost B. Adaptation of protein surfaces to subcellular location. Journal of Molecular Biology, 1998, 276(2): 517–525
Cedano J, Aloy P, Péez-Pons J A, Querol E. Relation between amino acid composition and cellular location of proteins. Journal of Molecular Biology, 1997, 266(3): 594–600
Lv Z, Jin S, Ding H, Zou Q. A random forest sub-golgi protein classifier optimized via dipeptide and amino acid composition features. Frontiers in Bioengineering and Biotechnology, 2019, 7: 215
Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quantitative Biology, 2016, 4(4): 320–330
Cabarle F G C, de la Cruz R T A, Cailipan D P P, Zhang D, Liu X, Zeng X. On solutions and representations of spiking neural P systems with rules on synapses. Information Sciences, 2019, 501: 30–49
Xu H, Zeng W, Zhang D, Zeng X. MOEA/HD: a multiobjective evolutionary algorithm based on hierarchical decomposition. IEEE Transactions on Cybernetics, 2019, 49(2): 517–526
Zou Q, Lin G, Jiang X, Liu X, Zeng X. Sequence clustering in bioinformatics: an empirical study. Briefings in Bioinformatics, 2020, 21(1): 1–10
Zeng X, Liu L, Lü L, Zou Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics, 2018, 34(14): 2425–2432
Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Computational Biology, 2017, 13(6): e1005420
Acknowledgements
This work was supported by National Key R&D Program of China (2018YFC0910405), The National Natural Science Foundation of China (NSFC, Grant No. 61872268); and Open Project Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences (CASNDST201705).
Author information
Authors and Affiliations
Corresponding author
Additional information
Jian Sun is a master student in the College of Intelligence and Computing, Tianjin University, China. He received his bachelor’s degree in Chemical Engineering from Qingdao University of Science and Technology, China. He expects to receive his master’s degree in 2021.
Pu-Feng Du is an associate professor in the College of Intelligence and Computing, Tianjin University, China. He received his bachelor’s degree and PhD from Tsinghua University, China in 2005 and 2010, respectively. Dr. Du’s research interests include bioinformatics and machine learning.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Sun, J., Du, PF. Predicting protein subchloroplast locations: the 10th anniversary. Front. Comput. Sci. 15, 152901 (2021). https://doi.org/10.1007/s11704-020-9507-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-020-9507-0