Recognition of splice-junction genetic sequences using random forest and Bayesian optimization

Baareh, Abdel Karim; Elsayad, Alaa; Al-Dhaifallah, Mujahed

doi:10.1007/s11042-021-10944-7

Recognition of splice-junction genetic sequences using random forest and Bayesian optimization

1155T: Advanced machine learning algorithms for biomedical data and imaging
Published: 30 April 2021

Volume 80, pages 30505–30522, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Abdel Karim Baareh¹,
Alaa Elsayad ORCID: orcid.org/0000-0001-8053-9759^2,3 &
Mujahed Al-Dhaifallah⁴

321 Accesses
3 Citations
Explore all metrics

Abstract

Recently, Bayesian Optimization (BO) provides an efficient technique for selecting the hyperparameters of machine learning models. The BO strategy maintains a surrogate model and an acquisition function to efficiently optimize the computation-intensive functions with a few iterations. In this paper, we demonstrate the utility of the BO to fine-tune the hyperparameters of a Random Forest (RF) model for a problem related to the recognition of splice-junction genetic sequences. Locating these splice-junctions prompts further understanding of the DNA splicing process. Specifically, the BO algorithm optimizes four RF hyperparameters: number of trees, number of splitting features, splitting criterion, and leaf size. The optimized RF model automatically selects the most predictive features of the training data. The dataset is obtained from the UCI machine learning repository where half of the records represent two different types of splice-junctions and the other half does not represent any splice-junction. Experimental results proved the advantage of the BO-RF with 99.96% and 97.34% training and test classification accuracies respectively. The results also demonstrated the ability of the RF model to select the most important features, ensuring the best possible results using Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and decision tree (DT) models. Some practical procedures in model development and evaluation such as out-of-bag error and cross-validation approaches are also referred to.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random Forest in Splice Site Prediction of Human Genome

Prediction of donor splice sites using random forest with a new sequence encoding approach

Article Open access 22 January 2016

Prabina Kumar Meher, Tanmaya Kumar Sahu & Atmakuri Ramakrishna Rao

Splice site identification in human genome using random forest

Article 02 December 2016

Elham Pashaei, Mustafa Ozen & Nizamettin Aydin

References

Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Rev Data Min Knowl Discov 2(6):493–507
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Brochu E, Cora VM, De Freitas N (2010) A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599
Cervantes J, Chau AL, Espinoza A T, Castilla JSR (2011) Fast Splice Site Classification Using Support Vector Machines in Imbalanced Data-sets. In Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP), p. 1. The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)
Cooper TA, Wan L, Dreyfuss G (2009) RNA and disease. Cell 136(4):777–793
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cox DD, John S (1997) SDO: A statistical method for global optimization. In: Alexandrov NM, Hussaini MY (eds) Multidisciplinary Design Optimization: State of the Art, pp. 315–329
Damaševicius R (2008) Splice site recognition in DNA sequences using k-mer frequency based mapping for support vector machine with power series kernel. In 2008 International Conference on Complex, Intelligent and Software Intensive Systems, pp. 687–692. IEEE
Dewancker I, McCourt M, Clark S (2016) Bayesian optimization for machine learning: A practical guidebook. arXiv preprint arXiv:1612.04858
Elyan E, Gaber MM (2017) A genetic algorithm approach to optimising random forests applied to class engineered data. Inf Sci 384:220–234
Article Google Scholar
Faris H, Aljarah I, Al-Shboul B (2016) A hybrid approach based on particle swarm optimization and random forests for e-mail spam filtering. In International Conference on Computational Collective Intelligence, pp. 498–508. Springer, Cham
Htike ZZ, Win SL (2013) Classification of eukaryotic splice-junction genetic sequences using averaged one-dependence estimators with subsumption resolution. Procedia Comput Sci 23:36–43
Article Google Scholar
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y (2017) Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol 2(4):230–243
Article Google Scholar
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492
Article MathSciNet Google Scholar
Kaur P, Kumar R, Kumar M (2019) A healthcare monitoring system using random forest and internet of things (IoT). Multimed Tools Appl 78(14):19905–19916
Article Google Scholar
Kushner HJ (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J Basic Eng 86(1):97–106
Article Google Scholar
Lévesque J-C (2018) Bayesian hyperparameter optimization: overfitting, ensembles and conditional spaces
Lin N, Noe D, He X, Phoam H (2006) Tree-based methods and their applications. Springer Handb Eng Stat London: Springer-Verlag:551–570
Lorena A C, Batista GEAPA, de Leon Ferreira ACP, Monard MC (2002) Splice Junction Recognition using Machine Learning Techniques. In WOB, pp. 32–39
Luts J, Ojeda F, Van de Plas R, De Moor B, Van Huffel S, Suykens JAK (2010) A tutorial on support vector machine-based methods for classification problems in chemometrics. Anal Chim Acta 665(2):129–145
Article Google Scholar
Mathworks C (2018) MATLAB documentation
Meher PK, Sahu TK, Rao AR (2016) Prediction of donor splice sites using random forest with a new sequence encoding approach. BioData Min 9(1):4
Article Google Scholar
Meher PK, Sahu TK, Rao AR, Wahi SD (2016) Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms Mol Biol 11(1):16
Article Google Scholar
Minasny B, McBratney AB (2005) The Matérn function as a general model for soil variograms. Geoderma 128(3–4):192–207
Article Google Scholar
Pashaei E, Ozen M, Aydin N (2017) Splice site identification in human genome using random forest. Heal Technol 7(1):141–152
Article Google Scholar
Probst P (2019) Hyperparameters, tuning and meta-learning for random forest and other machine learning algorithms. PhD diss, lmu
Rácz A, Bajusz D, Héberger K (2018) Modelling methods and cross-validation variants in QSAR: a multi-level analysis$. SAR QSAR Environ Res 29(9):661–674
Article Google Scholar
Rasmussen CE (2006) CKI Williams Gaussian processes for machine learning
Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959
Stranger BE, Dermitzakis ET (2006) From DNA to RNA to disease and back: the'central dogma'of regulatory disease variation. Hum Genomics 2(6):1–8
Article Google Scholar
The Machine Learning Database Repository (n.d.) https://archive.ics.uci.edu/ml/datasets/ Molecular+Biology+(Splice-junction+Gene+Sequences)
Zeng Y, Yuan H, Yuan Z, Chen Y (2019) A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples. Biol Direct 14(1):6
Article Google Scholar
Zhang S (2020) Cost-sensitive KNN classification. Neurocomputing 391:234–242
Article Google Scholar
Zhang Y, Liu X, MacLeod J, Liu J (2018) Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 19(1):971
Article Google Scholar
Ziegler A, König IR (2014) Mining data with random forests: current options for real-world applications. Wiley Interdisciplinary Rev Data Min Knowl Discov 4(1):55–63
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computers Science Department, Al-Balqa Applied University, Ajloun College, Ajloun, Jordan
Abdel Karim Baareh
College of Engineering, Prince Sattam Bin Abdulaziz University, Wadi Eldawasir, Kingdom of Saudi Arabia
Alaa Elsayad
Computers and Systems Department, Electronics Research Institute, Giza, 12622, Egypt
Alaa Elsayad
Systems Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, 31261, Kingdom of Saudi Arabia
Mujahed Al-Dhaifallah

Authors

Abdel Karim Baareh
View author publications
You can also search for this author in PubMed Google Scholar
Alaa Elsayad
View author publications
You can also search for this author in PubMed Google Scholar
Mujahed Al-Dhaifallah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alaa Elsayad.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baareh, A.K., Elsayad, A. & Al-Dhaifallah, M. Recognition of splice-junction genetic sequences using random forest and Bayesian optimization. Multimed Tools Appl 80, 30505–30522 (2021). https://doi.org/10.1007/s11042-021-10944-7

Download citation

Received: 03 February 2020
Revised: 09 February 2021
Accepted: 14 April 2021
Published: 30 April 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11042-021-10944-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognition of splice-junction genetic sequences using random forest and Bayesian optimization

Abstract

Access this article

Similar content being viewed by others

Random Forest in Splice Site Prediction of Human Genome

Prediction of donor splice sites using random forest with a new sequence encoding approach

Splice site identification in human genome using random forest

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recognition of splice-junction genetic sequences using random forest and Bayesian optimization

Abstract

Access this article

Similar content being viewed by others

Random Forest in Splice Site Prediction of Human Genome

Prediction of donor splice sites using random forest with a new sequence encoding approach

Splice site identification in human genome using random forest

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation