当前位置: X-MOL 学术Chemometr. Intell. Lab. Systems › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data
Chemometrics and Intelligent Laboratory Systems ( IF 3.9 ) Pub Date : 2020-04-01 , DOI: 10.1016/j.chemolab.2020.103976
Jian He , Xuemei Pu , Menglong Li , Chuan Li , Yanzhi Guo

Abstract Transcription factors are proteins that could bind to specific DNA sequences so as to regulate gene expressions. Currently, identification of transcription factor binding sites locating in DNA sequences is very important for building regulatory model in biological systems and identifying pathogenic variations. Traditional machine-learning methods have been successfully used for biological prediction problems based on DNA or protein sequences, but they all need to manually extract numerical features, which is not only tedious, but also would ignore effective information of first-order sequences. In this paper, based on the principle of deep learning (DL), we constructed prediction model for transcription factor binding sites only from DNA original base sequences. Here, a DL method based on convolutional neural network (CNN) and long short-term memory (LSTM) were proposed to investigate four leukemia categories from the perspective of transcription factor binding sites using four large non-redundant datasets for acute, chronic, myeloid and lymphatic leukemia, respectively. Compared with three widely used machine-learning methods of artificial neural network (ANN), support vector machine (SVM) and random forest (RF), our DL method exhibits significant superiority in terms of prediction performance, since the prediction accuracy of three machine-learning models either based on sequence feature or k-mer feature extraction are all lower than that of DL model. The available DL models for four leukemia categories gives an average prediction accuracy of 75% based only on sequence segments with 101 bases, which indicates that the DL based method is promising with unique advantages over the traditional machine learning methods. But focusing on leukemia-related transcription factor binding site prediction, further improvements would be implemented such as optimizing base segment length and CNN architecture, in order to improve the current prediction accuracy.

中文翻译:

用于从 DNA 序列数据预测白血病相关转录因子结合位点的深度卷积神经网络

摘要 转录因子是能够与特定的DNA序列结合从而调控基因表达的蛋白质。目前,识别位于 DNA 序列中的转录因子结合位点对于构建生物系统调控模型和识别致病变异非常重要。传统的机器学习方法已经成功用于基于DNA或蛋白质序列的生物预测问题,但它们都需要手动提取数值特征,不仅繁琐,而且会忽略一阶序列的有效信息。在本文中,我们基于深度学习(DL)的原理,仅从 DNA 原始碱基序列构建了转录因子结合位点的预测模型。这里,提出了一种基于卷积神经网络 (CNN) 和长短期记忆 (LSTM) 的 DL 方法,从转录因子结合位点的角度使用急性、慢性、骨髓和淋巴的四个大型非冗余数据集来研究四种白血病类别。分别是白血病。与人工神经网络 (ANN)、支持向量机 (SVM) 和随机森林 (RF) 三种广泛使用的机器学习方法相比,我们的 DL 方法在预测性能方面表现出显着优势,因为三种机器的预测精度 -基于序列特征或k-mer特征提取的学习模型都低于DL模型。四种白血病类别的可用 DL 模型仅基于具有 101 个碱基的序列段就给出了 75% 的平均预测准确率,这表明基于 DL 的方法与传统机器学习方法相比具有独特的优势。但专注于白血病相关转录因子结合位点预测,将进一步改进,例如优化碱基片段长度和 CNN 架构,以提高当前的预测准确性。
更新日期:2020-04-01
down
wechat
bug