Abstract
Essential proteins are assumed to be an indispensable element in sustaining normal physiological function and crucial to drug design and disease diagnosis. The discovery of essential proteins is of great importance in revealing the molecular mechanisms and biological processes. Owing to the tedious biological experiment, many numerical methods have been developed to discover key proteins by mining the features of the high throughput data. Appropriate integration of differential biological information based on protein–protein interaction (PPI) network has been proven useful in predicting essential proteins. The main intention of this research is to provide a comprehensive study and a review on identifying essential proteins by integrating multi-source data and provide guidance for researchers. Detailed analysis and comparison of current essential protein prediction algorithms have been carried out and tested on benchmark PPI networks. In addition, based on the previous method TEGS (short for the network Topology, gene Expression, Gene ontology, and Subcellular localization), we improve the performance of predicting essential proteins by incorporating known protein complex information, the gene expression profile, Gene Ontology (GO) terms information, subcellular localization information, and protein’s orthology data into the PPI network, named CEGSO. The simulation results show that CEGSO achieves more accurate and robust results than other compared methods under different test datasets with various evaluation measurements.
Similar content being viewed by others
References
Glass JI, Hutchison CA III, Smith HO, Venter JC (2009) A systems biology tour de force for a near-minimal bacterium. Mol Syst Biol 5(1):330. https://doi.org/10.1038/msb.2009.89
Hu W, Sillaots S, Lemieux S, Davison J, Kauffman S, Breton A, Linteau A, Xin C, Bowman J, Becker J, Jiang B, Roemer T (2007) Essential gene identification and drug target prioritization in Aspergillus fumigatus. Plos Pathog 3(3):e24. https://doi.org/10.1371/journal.ppat.0030024
Cullen LM, Arndt GM (2005) Genome-wide screening for gene function using RNAi in mammalian cells. Immunol Cell Biol 83(3):217–223. https://doi.org/10.1111/j.1440-1711.2005.01332.x
Giaever G, Chu AM, Ni L, Connelly C, Riles L, Véronneau S et al (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418(6896):387–391. https://doi.org/10.1038/nature00935
Scholtens D, Gentleman R (2004) Making sense of high-throughput protein-protein interaction data. Stat Appl Genetics Mol Biol. 3(1), Article no. 39. https://doi.org/10.2202/1544-6115.1107
Braun P, LaBaer J (2003) High throughput protein production for functional proteomics. Trends Biotechnol 21(9):383–388. https://doi.org/10.1016/S0167-7799(03)00189-6
Gardiner-Garden M, Littlejohn T (2001) A comparison of microarray databases. Brief Bioinf 2(2):143–158. https://doi.org/10.1093/bib/2.2.143
Dong C, Jin YT, Hua HL, Wen QF, Luo S, Zheng WX, Guo FB (2020) Comprehensive review of the identification of essential genes using computational methods: focusing on feature implementation and assessment. Brief Bioinf 21(1):171–181. https://doi.org/10.1093/bib/bby116
Vallabhajosyula RR, Chakravarti D, Lutfeali S, Ray A, Raval A (2009) Identifying hubs in protein interaction networks. PloS One. https://doi.org/10.1371/journal.pone.0005344
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41. https://doi.org/10.2307/3033543
Joy MP, Brock A, Ingber DE, Huang S (2005) High-betweenness proteins in the yeast protein interaction network. J Biomed Biotechnol 2:96–103 https://scite.ai/reports/10.1155/jbb.2005.96
Bonacich P (1987) Power and centrality: A family of measures. Am J Sociol 92(5):1170-1182. http://www.jstor.org/stable/2780000
Stephenson K, Zelen M (1989) Rethinking centrality: methods and examples. Soc Netw 11(1):1–37. https://doi.org/10.1016/0378-8733(89)90016-6
Wuchty S, Stadler PF (2003) Centers of complex networks. J Theor Biol 223(1):45–53. https://doi.org/10.1016/S0022-5193(03)00071-7
Estrada E, Rodriguez-Velazquez JA (2005) Subgraph centrality in complex networks. Phys Rev E 71(5):056103. https://doi.org/10.1103/PhysRevE.71.056103
Wang J, Li M, Wang H, Pan Y (2012) Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans Comput Biol Bioinf 9(4):1070–1080. https://doi.org/10.1109/TCBB.2011.147
Li M, Wang J, Chen X, Wang H, Pan Y (2011) A local average connectivity-based method for identifying essential proteins from the network level. Comput Biol Chem 35(3):143–150. https://doi.org/10.1016/j.compbiolchem.2011.04.002
Sprinzak E, Sattath S, Margalit H (2003) How reliable are experimental protein-protein interaction data? J Mol Biol 327(5):919–923. https://doi.org/10.1016/S0022-2836(03)00239-0
Kuchaiev O, Rašajski M, Higham DJ, Pržulj N (2009) Geometric de-noising of protein-protein interaction networks. PLoS Comput Biol 5(8):e1000454. https://doi.org/10.1371/journal.pcbi.1000454
Li M, Zhang H, Wang J, Pan Y (2012) A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data. BMC Syst Biol 6(1):15. https://doi.org/10.1186/1752-0509-6-15
Zhang X, Xu J, Xiao W (2013) A new method for the discovery of essential proteins. PloS One 8(3):e58763. https://doi.org/10.1371/journal.pone.0058763
Tang X, Wang J, Zhong J, Pan Y (2014) Predicting essential proteins based on weighted degree centrality. IEEE/ACM Trans Comput Biol Bioinf 11(2):407–418. https://doi.org/10.1109/TCBB.2013.2295318
Peng W, Wang J, Cheng Y, Lu Y, Wu F, Pan Y (2015) UDoNC: an algorithm for identifying essential proteins based on protein domains and protein-protein interaction networks. IEEE/ACM Trans Comput Biol Bioinf 12(2):276–288. https://doi.org/10.1109/TCBB.2014.2338317
Zhang W, Xu J, Li X, Zou X (2016) A new method for identifying essential proteins by measuring co-expression and functional similarity. IEEE Trans Nanobioscie 15(8):939-945. https://ieeexplore.ieee.org/document/7736043
Peng W, Wang J, Wang W, Liu Q, Wu FX, Pan Y (2012) Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks. BMC Syst Biol 6(1):87. https://doi.org/10.1186/1752-0509-6-87
Shang X, Wang Y, Chen B (2016) Identifying essential proteins based on dynamic protein-protein interaction networks and RNA-Seq datasets. Sci China Inf Sci. 59(7), Article no. 070106.https://doi.org/10.1007/s11432-016-5583-z
Li M, Lu Y, Niu Z, Wu FX (2017) United complex centrality for identification of essential proteins from PPI networks. IEEE/ACM Trans Comput Biol Bioinf 14(2):370–380. https://doi.org/10.1109/TCBB.2015.2394487
Luo J, Qi Y (2015) Identification of essential proteins based on a new combination of local interaction density and protein complexes. PloS One 10(6):e0131418. https://doi.org/10.1371/journal.pone.0131418
Qin C, Sun Y, Dong Y (2016) A new method for identifying essential proteins based on network topology properties and protein complexes. PloS One 11(8):e0161042. https://doi.org/10.1371/journal.pone.0161042
Zhang W, Xu J, Li Y, Zou X (2018) Detecting essential proteins based on network topology, gene expression data, and gene ontology information. IEEE/ACM Trans Comput Biol Bioinf 15(1):109–116. https://doi.org/10.1109/TCBB.2016.2615931
Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW (2002) Evolutionary rate in the protein interaction network. Science 296(5568):750–752. https://doi.org/10.1126/science.1068696
Zhong J, Wang J, Peng W, Zhang Z, Pan Y (2013) Prediction of essential proteins based on gene expression programming. BMC Genom 14(S4):S7. https://doi.org/10.1186/1471-2164-14-S4-S7
Li G, Li M, Wang J, Wu J, Wu FX, Pan Y (2016) Predicting essential proteins based on subcellular localization, orthology and PPI networks. BMC Bioinf. 17(8), Article no. 279. https://doi.org/10.1186/s12859-016-1115-5
Zhang X, Xiao W, Hu X (2018) Predicting essential proteins by integrating orthology, gene expressions, and PPI networks. PloS One 13(4):e0195410. https://doi.org/10.1371/journal.pone.0195410
Fan Y, Tang X, Hu X, Wu W, Ping Q (2017) Prediction of essential proteins based on subcellular localization and gene expression correlation. BMC Bioinf. 18(13), Article no. 470.https://doi.org/10.1186/s12859-017-1876-5
Li M, Li W, Wu FX, Pan Y, Wang J (2018) Identifying essential proteins based on sub-network partition and prioritization by integrating subcellular localization information. J Theor Biol 447:65–73. https://doi.org/10.1016/j.jtbi.2018.03.029
Lei X, Zhao J, Fujita H, Zhang A (2018) Predicting essential proteins based on RNA-Seq, subcellular localization and GO annotation datasets. Knowl-Based Syst 151:136–148. https://doi.org/10.1016/j.knosys.2018.03.027
Peng X, Wang J, Zhong J, Luo J, Pan Y (2015) An efficient method to identify essential proteins for different species by integrating protein subcellular localization information. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on. IEEE, 2015:277–280. https://doi.org/10.1109/BIBM.2015.7359693
Zhang W, Xu J, Zou X (2020) Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and GO annotation data. IEEE/ACM Trans Comput Biol Bioinf 17(6):2053–2061. https://ieeexplore.ieee.org/document/8713910
Zhao B, Zhao Y, Zhang X, Zhang Z, Zhang F, Wang L (2019) An iteration method for identifying yeast essential proteins from heterogeneous network. BMC Bioinf 20(1):355. https://doi.org/10.1186/s12859-019-2930-2
Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and identifying communities in networks. Proc Nat Acad Sci 101(9):2658–2663. https://doi.org/10.1073/pnas.0400054101
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281. https://doi.org/10.1093/bioinformatics/btm087
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305. https://doi.org/10.1093/nar/30.1.303
Zhang W, Zou X (2015) A new method for detecting protein complexes based on the three node cliques. IEEE/ACM Trans Comput Biol Bioinf 12(4):879–886. https://doi.org/10.1109/TCBB.2014.2386314
Gene Ontology Consortium (2013) Gene Ontology annotations and resources. Nucleic Acids Res 41(D1):D530–D535. https://doi.org/10.1093/nar/gks1050
Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods 9(5):471–472. https://doi.org/10.1038/nmeth.1938
Binder JX, Pletscher-Frankild S, Tsafou K, Stolte C, O’Donoghue SI, Schneider R, Jensen LJ (2014) COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database 2014:bau012. https://doi.org/10.1093/database/bau012
Storn R, Price K (1997) Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11(4):341–359. https://doi.org/10.1023/A:1008202821328
Holman AG, Davis PJ, Foster JM, Carlow CKS, Kumar S (2009) Computational prediction of essential genes in an unculturable endosymbiotic bacterium. Wolbachia of Brugia malayi. BMC Microbiol 9(1):243. https://doi.org/10.1186/1471-2180-9-243
Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning, 2006: 233–240. https://doi.org/10.1145/1143844.1143874
Héberger K (2010) Sum of ranking differences compares methods or models fairly. TrAC Trends Anal Chem 29(1):101–109. https://doi.org/10.1016/j.trac.2009.09.009
Kollár-Hunek K, Héberger K (2013) Method and model comparison by sum of ranking differences in cases of repeated observations (ties). Chemometr Intell Lab Syst 127:139–146. https://doi.org/10.1016/j.chemolab.2013.06.007
Acknowledgements
This work was partly supported by the National Natural Science Foundation (No. 61802125, No. 61763010, No. 61862026 and No. 61862025), the Natural Science Foundation of Jiangxi Province (No. 20181BAB202006 and No. 20161BAB211022), the Guangxi “BAGUI Scholar” Program under Grant [2016] 127, and the Science and Technology Major Project of Guangxi under Grand AA18118047. The authors would like to thank Dr. John Kalivas from Chemistry Department of Idaho State University for providing simulation code of SRD.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Zhang, W., Xue, X., Xie, C. et al. CEGSO: Boosting Essential Proteins Prediction by Integrating Protein Complex, Gene Expression, Gene Ontology, Subcellular Localization and Orthology Information. Interdiscip Sci Comput Life Sci 13, 349–361 (2021). https://doi.org/10.1007/s12539-021-00426-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-021-00426-7