当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SpliceFinder: ab initio prediction of splice sites using convolutional neural network.
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2019-12-27 , DOI: 10.1186/s12859-019-3306-3
Ruohan Wang 1 , Zishuai Wang 1 , Jianping Wang 1 , Shuaicheng Li 1
Affiliation  

BACKGROUND Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. RESULT We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. CONCLUSION Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

中文翻译:

SpliceFinder:使用卷积神经网络从头开始预测剪接位点。

背景技术鉴定剪接位点是分析基因的位置和结构的必要步骤。GT和AG这两个二核苷酸在剪接位点非常频繁,并且在具有重要生物学功能的剪接位点也有许多其他模式。同时,二核苷酸经常出现在没有剪接位点的序列上,这使得预测容易产生假阳性。大多数现有的工具会选择带有两个二聚体的所有序列,然后着重于将真正的剪接位点与那些伪剪接位点区分开。这种方法将减少误报;但是,它将导致非规范的剪接位点丢失。结果我们设计了基于卷积神经网络(CNN)的SpliceFinder来预测接头位置。为了实现从头算起的预测,我们使用人类基因组数据来训练我们的神经网络。采用迭代的方法重建数据集,解决了数据不平衡的问题,迫使模型学习更多的剪接位点特征。提出的CNN的分类精度为90.25%,比现有算法高10%。该方法在接收器工作特性(AUC)下的面积,召回率,精度和F1得分方面优于其他现有方法。此外,SpliceFinder可以通过滑动窗口在长基因组序列上找到剪接位点的确切位置。与其他最新的接头位置预测工具相比,SpliceFinder产生的结果假阳性率低约一半,而查全率却保持在0.8以上。此外,SpliceFinder会捕获非规范的拼接位点。此外,SpliceFinder在果蝇,小家鼠,家鼠和里约热内卢的基因组序列上表现良好,无需重新训练。结论基于CNN,我们提出了一种新的从头开始剪接位点预测工具SpliceFinder,该工具可产生较少的假阳性并可以检测非规范的剪接位点。此外,SpliceFinder可无需再培训即可转移到其他物种。源代码和其他材料可在https://gitlab.deepomics.org/wangruohan/SpliceFinder中获得。SpliceFinder可以直接转移到其他物种,而无需重新培训。源代码和其他材料可从https://gitlab.deepomics.org/wangruohan/SpliceFinder获得。SpliceFinder可无需再培训即可转移到其他物种。源代码和其他材料可从https://gitlab.deepomics.org/wangruohan/SpliceFinder获得。
更新日期:2019-12-27
down
wechat
bug