当前位置: X-MOL 学术Interdiscip. Sci. Comput. Life Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction.
Interdisciplinary Sciences: Computational Life Sciences ( IF 3.9 ) Pub Date : 2018-12-27 , DOI: 10.1007/s12539-018-0313-4
Amani Al-Ajlan 1 , Achraf El Allali 1
Affiliation  

Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.

中文翻译:


CNN-MGP:用于宏基因组基因预测的卷积神经网络。



由于数据的读长短、不完整和碎片化,宏基因组片段中的准确基因预测是一项具有计算挑战性的任务。大多数基因预测程序都是基于提取大量特征,然后应用统计方法或监督分类方法来预测基因。在我们的研究中,我们引入了一种用于宏基因组基因预测(CNN-MGP)程序的卷积神经网络,该程序可以直接从原始DNA序列中预测宏基因组片段中的基因,而不需要手动特征提取和特征选择阶段。 CNN-MGP能够学习编码和非编码区域的特征并区分编码和非编码开放阅读框(ORF)。我们根据预定义的 GC 内容范围在 10 个互斥数据集上训练 10 个 CNN 模型。我们从每个片段中提取 ORF;然后,对 ORF 进行数字编码,并根据片段 GC 内容将其输入到适当的 CNN 模型中。 CNN 的输出是 ORF 编码基因的概率。最后,使用贪心算法选择最终的基因列表。总体而言,CNN-MGP 是有效的,在测试数据集上达到了 91% 的准确率。 CNN-MGP 展示了深度学习预测宏基因组片段中基因的能力,其精度高于或相当于使用预定义特征的最先进的基因预测程序。
更新日期:2019-11-01
down
wechat
bug