当前位置: X-MOL 学术Multimed. Tools Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Recognition of splice-junction genetic sequences using random forest and Bayesian optimization
Multimedia Tools and Applications ( IF 3.6 ) Pub Date : 2021-04-30 , DOI: 10.1007/s11042-021-10944-7
Abdel Karim Baareh , Alaa Elsayad , Mujahed Al-Dhaifallah

Recently, Bayesian Optimization (BO) provides an efficient technique for selecting the hyperparameters of machine learning models. The BO strategy maintains a surrogate model and an acquisition function to efficiently optimize the computation-intensive functions with a few iterations. In this paper, we demonstrate the utility of the BO to fine-tune the hyperparameters of a Random Forest (RF) model for a problem related to the recognition of splice-junction genetic sequences. Locating these splice-junctions prompts further understanding of the DNA splicing process. Specifically, the BO algorithm optimizes four RF hyperparameters: number of trees, number of splitting features, splitting criterion, and leaf size. The optimized RF model automatically selects the most predictive features of the training data. The dataset is obtained from the UCI machine learning repository where half of the records represent two different types of splice-junctions and the other half does not represent any splice-junction. Experimental results proved the advantage of the BO-RF with 99.96% and 97.34% training and test classification accuracies respectively. The results also demonstrated the ability of the RF model to select the most important features, ensuring the best possible results using Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and decision tree (DT) models. Some practical procedures in model development and evaluation such as out-of-bag error and cross-validation approaches are also referred to.



中文翻译:

基于随机森林和贝叶斯优化的剪接点遗传序列识别

最近,贝叶斯优化(BO)提供了一种用于选择机器学习模型的超参数的有效技术。BO策略维护一个代理模型和一个采集函数,以通过几次迭代有效地优化计算密集型函数。在本文中,我们证明了BO可以用于微调随机森林(RF)模型的超参数,以解决与剪接点遗传序列识别有关的问题。定位这些剪接点可促进对DNA剪接过程的进一步了解。具体而言,BO算法优化了四个RF超参数:树的数量,分裂特征的数量,分裂准则和叶大小。优化的RF模型自动选择训练数据中最具预测性的特征。该数据集是从UCI机器学习存储库中获得的,其中一半的记录代表两种不同类型的接合点,而另一半则不代表任何接合点。实验结果证明了BO-RF具有99.96%和97.34%的训练和测试分类准确性的优势。结果还证明了RF模型能够选择最重要的功能,并使用支持向量机(SVM),K最近邻(KNN)和决策树(DT)模型来确保最佳结果。还提到了模型开发和评估中的一些实用程序,例如袋外误差和交叉验证方法。实验结果证明了BO-RF具有99.96%和97.34%的训练和测试分类准确性的优势。结果还证明了RF模型能够选择最重要的功能,并使用支持向量机(SVM),K最近邻(KNN)和决策树(DT)模型来确保最佳结果。还提到了模型开发和评估中的一些实用程序,例如袋外误差和交叉验证方法。实验结果证明了BO-RF具有99.96%和97.34%的训练和测试分类准确性的优势。结果还证明了RF模型能够选择最重要的功能,并使用支持向量机(SVM),K最近邻(KNN)和决策树(DT)模型来确保最佳结果。还提到了模型开发和评估中的一些实用程序,例如袋外误差和交叉验证方法。

更新日期:2021-05-02
down
wechat
bug