CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction.,Interdisciplinary Sciences: Computational Life Sciences

当前位置： X-MOL 学术 › Interdiscip. Sci. Comput. Life Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction.
Interdisciplinary Sciences: Computational Life Sciences ( IF 3.9 ) Pub Date : 2018-12-27 , DOI: 10.1007/s12539-018-0313-4
Amani Al-Ajlan ₁ , Achraf El Allali ₁

Affiliation

Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.

中文翻译：

CNN-MGP：用于宏基因组基因预测的卷积神经网络。

由于数据的读长短、不完整和碎片化，宏基因组片段中的准确基因预测是一项具有计算挑战性的任务。大多数基因预测程序都是基于提取大量特征，然后应用统计方法或监督分类方法来预测基因。在我们的研究中，我们引入了一种用于宏基因组基因预测（CNN-MGP）程序的卷积神经网络，该程序可以直接从原始DNA序列中预测宏基因组片段中的基因，而不需要手动特征提取和特征选择阶段。 CNN-MGP能够学习编码和非编码区域的特征并区分编码和非编码开放阅读框（ORF）。我们根据预定义的 GC 内容范围在 10 个互斥数据集上训练 10 个 CNN 模型。我们从每个片段中提取 ORF；然后，对 ORF 进行数字编码，并根据片段 GC 内容将其输入到适当的 CNN 模型中。 CNN 的输出是 ORF 编码基因的概率。最后，使用贪心算法选择最终的基因列表。总体而言，CNN-MGP 是有效的，在测试数据集上达到了 91% 的准确率。 CNN-MGP 展示了深度学习预测宏基因组片段中基因的能力，其精度高于或相当于使用预定义特征的最先进的基因预测程序。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文