当前位置: X-MOL 学术PeerJ Comput. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
PeerJ Computer Science ( IF 3.5 ) Pub Date : 2021-02-09 , DOI: 10.7717/peerj-cs.365
Nikita Bhandari 1 , Satyajeet Khare 2 , Rahee Walambe 3, 4 , Ketan Kotecha 1, 3
Affiliation  

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.

中文翻译:


机器学习和深度学习技术在不同物种启动子预测中的比较



基因启动子是位于转录起始位点周围的关键DNA调控元件,负责调控基因转录过程。报道了各种基于比对、基于信号和基于内容的方法来预测启动子。然而,由于所有启动子序列不显示明确的特征,因此这些技术的预测性能很差。因此,许多机器学习和深度学习模型被提出用于启动子预测。在这项工作中,我们研究了使用三种不同的高等真核生物的基因组序列进行载体编码和启动子分类的方法。酵母(酿酒酵母)、拟南芥(植物)和人类(智人)。我们将 one-hot 向量编码方法与基于频率的标记化 (FBT) 进行比较,以在一维卷积神经网络 (CNN) 模型上进行数据预处理。我们发现 FBT 提供了更短的输入维度,减少了训练时间,而不影响分类的敏感性和特异性。我们采用深度学习技术,主要是 CNN 和带有长短期记忆 (LSTM) 和随机森林 (RF) 分类器的循环神经网络,用于 k 聚体大小为 2、4 和 8 的启动子分类。我们发现 CNN 在以下方面表现出色:启动子与非启动子序列的分类(二元分类)以及启动子序列的物种特异性分类(多类分类)。总之,这项工作的贡献在于使用合成的洗牌负数据集和基于频率的标记化进行预处理。这项研究为基因组应用中的分类任务提供了一个全面且通用的框架,并且可以扩展到各种分类问题。
更新日期:2021-02-09
down
wechat
bug