Abstract
Enhancers are short DNA cis-elements that can be bound by proteins (activators) to increase the possibility that transcription of a particular gene will occur. The Enhancers perform a significant role in the formation of proteins and regulating the gene transcription process. Human diseases such as cancer, inflammatory bowel disease, Parkinson’s, addiction, and schizophrenia are due to genetic variation in enhancers. In the current study, we have made an effort by building, a more robust and novel computational a bi-layered model. The representative feature vector was constructed over a linear combination of six features. The optimum Hybrid feature vector was obtained via the Novel Cascade Multi-Level Subset Feature selection (CM-SFS) algorithm. The first layer predicts the enhancer, and the secondary layer carries the prediction of their subtypes. The baseline model obtained 87.88% of accuracy, 95.29% of sensitivity, 80.47% of specificity, 0.766 of MCC, and 0.9603 of a roc value on Layer-1. Similarly, the model obtained 68.24%, 65.54%, 70.95%, 0.3654, and 0.7568 as an Accuracy, sensitivity, specificity, MCC, and ROC values on layer-2 respectively. Over an independent dataset on layer-1, the piEnPred secured 80.4% accuracy, 82.5% of sensitivity, 78.4% of specificity, and 0.6099 as MCC, respectively. Subsequently, the proposed predictor obtained 72.5% of accuracy, 70.0% of sensitivity, 75% of specificity, and 0.4506 of MCC on layer-2, respectively. The proposed method remarkably performed in contrast to other state-of-the-art predictors. For the convenience of most experimental scientists, a user-friendly and publicly freely accessible web server @/bienhancer dot pythonanywhere dot com/has been developed.
Similar content being viewed by others
References
Blackwood E M, Kadonaga J T. Going the distance: a current view of enhancer action. Science, 1998, 281(5373): 60–63
Roeder R G. The role of general initiation factors in transcription by RNA polymerase II. Trends in Biochemical Sciences, 1996, 21(9): 327–335
Nikolov D B, Burley S K. RNA polymerase II transcription initiation: a structural view. Proceedings of the National Academy of Sciences, 1997, 94(1): 15–22
Lee T I, Young R A. Transcription of eukaryotic protein-coding genes. Annual Review of Genetics, 2000, 34(1): 77–137
Pennacchio L A, Bickmore W, Dean A, Nobrega M A, Bejerano G. Enhancers: five essential questions. Nature Reviews Genetics, 2013, 14(4): 288–295
Kulaeva O I, Nizovtseva E V, Polikanov Y S, Ulianov S V, Studitsky V M. Distant activation of transcription: mechanisms of enhancer action. Molecular and Cellular Biology, 2012, 32(24): 4892–4897
Civas A, Génin P, Morin P, Lin R, Hiscott J. Promoter organization of the interferon-A genes differentially affects virus-induced expression and responsiveness to TBK1 and IKKϵ. Journal of Biological Chemistry, 2006, 281(8): 4856–4866
Sharan R, Karni S, Felder Y. Analysis of biological networks: transcriptional networks-promoter sequence analysis. Tel Aviv University, 2007, 1–5
Li M, Marin-Muller C, Bharadwaj U, Chow K H, Yao Q, Chen C. MicroRNAs: control and loss of control in human physiology and disease. World Journal of Surgery, 2009, 33(4): 667–684
Ong C T, Corces V G. Enhancer function: new insights into the regulation of tissue-specific gene expression. Nature Reviews Genetics, 2011, 12(4): 283–293
Wittkopp P J, Kalay G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nature Reviews Genetics, 2012, 13(1): 59–69
Gagniuc P, Ionescu-Tirgoviste C. Gene promoters show chromosome-specificity and reveal chromosome territories in humans. BMC Genomics, 2013, 14(1): 1–13
Corradin O, Scacheri P C. Enhancer variants: evaluating functions in common disease. Genome Medicine, 2014, 6(10): 1–4
Boyd M, Thodberg M, Vitezic M, Bornholdt J, Vitting-Seerup K, Chen Y, Coskun M, Li Y, Lo B Z S, Klausen P. Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies. Nature Communications, 2018, 9(1): 1–9
Herz H. Enhancer deregulation in cancer and other diseases. BioEssays, 2016, 38(10): 1003–1015
Zhang G, Shi J, Zhu S, Lan Y, Xu L, Yuan H, Liao G, Liu X, Zhang Y, Xiao Y. DiseaseEnhancer: a resource of human disease-associated enhancer catalog. Nucleic Acids Research, 2017, 46(D1): D78–D84
Whyte W A, Orlando D A, Hnisz D, Abraham B J, Lin C Y, Kagey M H, Rahl P B, Lee T I, Young R A. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell, 2013, 153(2): 307–319
Parker S C, Stitzel M L, Taylor D L, Orozco J M, Erdos M R, Akiyama J A, van Bueren K L, Chines P S, Narisu N, Black B L, Visel A. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proceedings of the National Academy of Sciences, 2013, 110(44): 17921–17926
Chatterjee B, Banoth B, Mukherjee T, Taye N, Vijayaragavan B, Chattopadhyay S, Gomes J, Basak S. Late-phase synthesis of IκBα insulates the TLR4-activated canonical NF-κB pathway from noncanonical NF-κB signaling in macrophages. Science Signaling, 2016, 9(457): ra120–ra120
Niederriter A R, Varshney A, Parker S C, Martin D M. Super enhancers in cancers, complex disease, and developmental disorders. Genes, 2015, 6(4): 1183–1200
Schmidt S F, Larsen B D, Loft A, Nielsen R, Madsen J G S, Mandrup S. Acute TNF-induced repression of cell identity genes is mediated by NFκB-directed redistribution of cofactors from super-enhancers. Genome Research, 2015, 25(9): 1281–1294
Vahedi G, Kanno Y, Furumoto Y, Jiang K, Parker S C J, Erdos M R, Davis S R, Roychoudhuri R, Restifo N P, Gadina M. Super-enhancers delineate disease-associated regulatory nodes in T cells. Nature, 2015, 520(7548): 558–562
Brown J D, Lin C Y, Duan Q, Griffin G, Federation A J, Paranal R M, Bair S, Newton G, Lichtman A H, Kung A L. NF-κB directs dynamic super enhancer formation in inflammation and atherogenesis. Molecular Cell, 2014, 56(2): 219–231
Vlahopoulos S A, Cen O, Hengen N, Agan J, Moschovi M, Critselis E, Adamaki M, Bacopoulou F, Copland J A, Boldogh I. Dynamic aberrant NF-kB spurs tumorigenesis: a new model encompassing the microenvironment. Cytokine & Growth Factor Reviews, 2015, 26(4): 389–403
Zou Z, Huang B, Wu X, Zhang H, Qi J, Bradner J, Nair S, Chen L F. Brd4 maintains constitutively active NF-κB in cancer cells by binding to acetylated RelA. Oncogene, 2014, 33(18): 2395–2404
Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genome-wide predictions. Nature Reviews Genetics, 2014, 15(4): 272–286
Tahir M, Hayat M, Khan S A. A two-layer computational model for discrimination of enhancer and their types using hybrid features pace of pseudo k-tuple nucleotide composition. Arabian Journal for Science and Engineering, 2018, 43(12): 6719–6727
Visel A, Blow M J, Li Z, Zhang T, Akiyama J A, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature, 2009, 457(7231): 854–858
Visel A, Prabhakar S, Akiyama J A, Shoukry M, Lewis K D, Holt A, Plajzer-Frick I, Afzal V, Rubin E M, Pennacchio L A. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nature Genetics, 2008, 40(2): 158–160
Kulakovskiy I V, Medvedeva Y A, Schaefer U, Kasianov A S, Vorontsov I E, Bajic V B, Makeev V J. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Research, 2012, 41(D1): 195–202
Bryne J C, Valen E, Tang M H E, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research, 2007, 36(suppl_1): 102–106
Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods, 2012, 9(3): 215–216
Hoffman M M, Buske O J, Wang J, Weng Z, Bilmes J A, Noble W S. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods, 2012, 9(5): 473–480
Firpi H A, Ucar D, Tan K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics, 2010, 26(13): 1579–1586
Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, Ernst J, Kellis M, Ren B. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLoS Computational Biology, 2013, 9(3): e1002968
Erwin G D, Oksenberg N, Truty R M, Kostka D, Murphy K K, Ahituv N, Pollard K S, Capra J A. Integrating diverse datasets improves developmental enhancer prediction. PLoS Computational Biology, 2014, 10(6): e1003677
Lu Y, Qu W, Shan G, Zhang C. DELTA: a distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications. PLoS ONE, 2015, 10(6): e0130622
Bu H, Gan Y, Wang Y, Zhou S, Guan J. A new method for enhancer prediction based on deep belief network. BMC Bioinformatics, 2017, 18(12): 418–430
Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, Shu W. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics, 2017, 33(13): 1930–1936
Kleftogiannis D, Kalnis P, Bajic V B. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Research, 2014, 43(1): e6–e6
Shao J, Xu D, Tsai S N, Wang Y, Ngai S M. Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE, 2009, 4(3): e4920
Chen W, Lei T Y, Jin D C, Lin H, Chou K C. PseKNC: a flexible web server for generating pseudo k-tuple nucleotide composition. Analytical Biochemistry, 2014, 456(1): 53–60
Jia C, He W. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Scientific Reports, 2016, 6: 38741
Liu B, Fang L, Long R, Lan X, Chou K C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics, 2015, 32(3): 362–369
Liu B, Li K, Huang D S, Chou K C. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics, 2018, 34(22): 3835–3842
Le N Q K, Yapp E K Y, Ho Q T, Nagasundaram N, Ou Y Y, Yeh H Y. iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Analytical Biochemistry, 2019, 571: 53–61
Zeng X, Yuan S, Huang X, Zou Q. Identification of cytokine via an improved genetic algorithm. Frontiers of Computer Science, 2015, 9(4): 643–651
Zhao W, Wang L, Zhang T X, Zhao Z N, Du P F. A brief review on software tools in generating Chou’s pseudo-factor representations for all types of biological sequences. Protein and Peptide Letters, 2018, 25(9): 822–829
Akbar S, Hayat M, Iqbal M, Tahir M. iRNA-PseTNC: identification of RNA 5-methylcytosine sites using hybrid vector space of pseudo nucleotide composition. Frontiers of Computer Science, 2020, 14(2): 451–460
Ali F, Hayat M. Classification of membrane protein types using voting feature interval in combination with Chou’s pseudo amino acid composition. Journal of Theoretical Biology, 2015, 384: 78–83
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006, 22(13): 1658–1659
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012, 28(23): 3150–3152
Liu B, Liu Y, Huang D. Recombination hotspot/coldspot identification combining three different pseudocomponents via an ensemble learning approach. BioMed Research International, 2016, 10(1): 100–120
Khan Z U, Ali F, Ahmad I, Hayat M, Pi D. iPredCNC: computational prediction model for cancerlectins and non-cancerlectins using novel cascade features subset selection. Chemometrics and Intelligent Laboratory Systems, 2019, 195: 103876
Chen Z, Zhao P, Li F, Marquez-Lago T T, Leier A, Revote J, Zhu Y, Powell D R, Akutsu T, Webb G I, Chou K C. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Briefings in Bioinformatics, 2020, 21(3): 1047–1057
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago T T, Wang Y, Webb G I, Smith A I, Daly R J, Chou K C. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 2018, 34(14): 2499–2502
Zhang S, Zhuang W, Xu Z. Prediction of DNase I hypersensitive sites in plant genome using multiple modes of pseudo components. Analytical Biochemistry, 2018, 549: 149–156
Chen W, Ding H, Zhou X, Lin H, Chou K C. iRNA(m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical Biochemistry, 2018, 561: 59–65
Chen W, Feng P M, Lin H, Chou K C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 2013, 41(6): e68–e74
Khan Z U, Ali F, Khan I A, Hussain Y, Pi D. iRSpot-SPI: deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou’s 5-step rule and pseudo components. Chemometrics and Intelligent Laboratory Systems, 2019, 189: 169–180
Lin H, Deng E Z, Ding H, Chen W, Chou K C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Research, 2014, 42(21): 12961–12972
Feng P, Yang H, Ding H, Lin H, Chen W, Chou K C. iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics, 2019, 111(1): 96–102
Yang H, Qiu W R, Liu G, Guo F B, Chen W, Chou K C, Lin H. iRSpot-Pse6NC: identifying recombination spots in saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. International Journal of Biological Sciences, 2018, 14(8): 883
Khan Z U, Hayat M, Khan M A. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. Journal of Theoretical Biology, 2015, 365: 197–203
Ali F, Kabir M, Arif M, Khan Swati Z N, Khan Z U, Ullah M, Yu D J. DBPPred-PDSD: machine learning approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space. Chemometrics and Intelligent Laboratory Systems, 2018, 182: 21–30
Hayat M, Khan A. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. Journal of Theoretical Biology, 2011, 271(1): 10–17
Chou K C, Shen H B. Recent progress in protein subcellular location prediction. Analytical Biochemistry, 2007, 370(1): 1–16
Gheyas I A, Smith L S. Feature subset selection in large dimensionality domains. Pattern Recognition, 2010, 43(1): 5–13
Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1–2): 273–324
Chokka A, Sandhua Rani K. AdaBoost with feature selection using IoT to bring the paths for somatic mutations evaluation in cancer. In: Internet of Things and Personalized Healthcare Systems. Springer, Singapore, 2019, 51–63
Maldonado S, Weber R. A wrapper method for feature selection using Support Vector Machines. Information Sciences, 2009, 179(13): 2208–2217
Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 74–81
Hsu H H, Hsieh C W, Lu M D. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications, 2011, 38(7): 8144–8150
Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering, 2014, 40(1): 16–28
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226–1238
Yang R, Zhang C, Zhang L, Gao R. A two-step feature selection method to predict cancerlectins by multiview features and synthetic minority over-sampling technique. BioMed Research International, 2018, 2018(1): 1–10
AL-barakati H J, McConnell E W, Hicks L M, Poole L B, Newman R H. SVM-SulfoSite: a support vector machine based predictor for sulfenylation sites. Scientific Reports, 2018, 8(1): 11288
Ding Y, Wilkins D. Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics, 2006, 7(2): S12
Javed F, Hayat M. Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou’s PseAAC. Genomics, 2019, 111(6): 1325–1332
Liu B, Liu Y, Jin X, Wang X, Liu B. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Scientific Reports, 2016, 6(1): 1–9
Jia C, Zuo Y. S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. Journal of Theoretical Biology, 2017, 422: 84–89
Chou K C. Some remarks on predicting multi-label attributes in molecular biosystems. Molecular Biosystems, 2013, 9: 1092–1100
Chou K C. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology, 2011, 273(1): 236–247
Liu B, Wang S, Long R, Chou K C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics, 2017, 33(1): 35–41
Tahir M, Tayara H, Chong K T. iRNA-PseKNC (2methyl): identify RNA 2’-o-methylation sites by convolution neural network and chou’s pseudo components. Journal of Theoretical Biology, 2019, 465: 1–6
Tayara H, Tahir M, Chong K T. Identification of prokaryotic promoters and their strength by integrating heterogeneous features. Genomics, 2020, 112(2): 1396–1403
Acknowledgements
The work was supported by the National Natural Science Foundation of China (Grant No. U1433116).
Author information
Authors and Affiliations
Corresponding author
Additional information
Supporting information
The supporting information is asvailable online at journal.hep.com.cn and link.springer.com.
Zaheer Ullah Khan received his master’s degree in computer science from the University of Peshawar, Pakistan in 2008, and the MS degree from Abdul Wali Khan University Mardan, Pakistan in 2017. He has vast experience in the IT sector and in software development industry. He is currently pursuing his PhD degree in Nanjing University of Aeronautics and Astronautics, China. His research area includes computational biology and bioinformatics. His research interest includes predictive models for RNA/DNA sequences and generative models. He is also working with Jiangsu Key laboratory of NUAA.
Dechang Pi received the BEng and MEng degrees and the PhD degree in computer engineering from the Nanjing University of Aeronautics and Astronautics (NUAA), China in 1994, 1997, and 2002, respectively, where he is currently a professor and a PhD Supervisor. He has authored over 100 journals and conference papers. His research interests include data mining and privacy, intelligent optimization methods, and security issues about moving objects. He presided over 30 research projects of the National Natural Science Foundation of China, the National 863 Program, the National Technical Foundation, the Civil Aerospace Foundation, and the Aviation Science Foundation.
Shuanglong Yao currently perusing PhD from Nanjing University of Aeronautics and Astronautics, China. His main research areas is related to Knowledge Graphs and Knowledge Representations.
Asif Nawaz received the MS degree in software engineering from the National University of Sciences and Technology, Pakistan in 2010. He is currently pursuing the PhD degree with the Nanjing University of Aeronautics and Astronautics, China. His main interests include software engineering, machine learning, geographical information systems, data analysis, and decision support systems
Farman Ali received his BS and MS degrees in Computer Science from University of Peshawar and Abdul Wali Khan University Mardan, Pakistan in 2009 and 2016, respectively. At present he is a PhD scholar in Computer Science and Technology with research areas of Bioinformatics and Machine Learning at Nanjing University of Science and Technology, China. He is a member of CSBIO group under the supervision of Prof. Dong-Jun Yu.
Shaukat Ali received his PhD degree in Computer Science from University of Peshawar, Pakistan. He got his BSc and MS degrees in computer science from the Same University in 2007 and 2010 respectively. Apart from this, He is also working as a lecturer at Department of Computer Science, Islamia College Peshawar, Pakistan. His area of interest is information security, privacy, big data, and data analytics.
Electronic Supplementary Material
11704_2020_9504_MOESM2_ESM.pdf
piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm
Rights and permissions
About this article
Cite this article
Khan, Z.U., Pi, D., Yao, S. et al. piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm. Front. Comput. Sci. 15, 156904 (2021). https://doi.org/10.1007/s11704-020-9504-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-020-9504-3