当前位置: X-MOL 学术Biol. Direct › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples.
Biology Direct ( IF 5.7 ) Pub Date : 2019-04-11 , DOI: 10.1186/s13062-019-0236-y
Ying Zeng 1, 2 , Hongjie Yuan 1 , Zheming Yuan 1, 3 , Yuan Chen 4
Affiliation  

BACKGROUND Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT-AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. RESULTS Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. CONCLUSIONS Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. REVIEWERS This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.

中文翻译:

一种基于短窗口大小和不平衡大样本预测供体剪接位点的高性能方法。

背景技术剪接位点预测一直是生物信息学中长期存在的问题。尽管为拼接位点预测开发的许多计算方法均已达到令人满意的精度,但是预测精度的进一步提高是重要的,因为它有助于更​​准确地预测基因结构。在预测之前确定适当的窗口大小是必要的。窗口大小过长可能会引入一些不相关的功能,这会降低预测精度,而在信息准确性和时间成本方面,使用具有最大信息的短窗口大小可能会表现更好。此外,遵循GT-AG规则的错误剪接位点的数量远远超过了真正的剪接位点,使用不平衡的大样本准确快速地预测剪接位点一直是一个挑战。因此,基于较短的窗口大小和不平衡的大样本,我们开发了一种新的计算方法,称为卡方决策表(χ2-DT),用于供体剪接位点预测。结果χ2-DT使用11 bp的短窗口大小,基于卡方检验提取了改进的位置特征和组成特征,然后基于信息增益逐个引入特征,并构造了一个旨在实现不平衡模式的平衡决策表分类。通过2000:271,132(真实位置:错误位置)训练集,与两个分类器(随机森林,人工神经网络和松弛变量核密度估计器)相比,χ2-DT可获得最高的独立测试准确度(93.34%)。计算时间短(89 s)。χ2-DT还具有良好的独立测试准确度(92.40%),当使用带有移码错误(核苷酸插入和缺失)的BG-570突变序列进行验证时。此外,将χ2-DT与基于长窗口大小的方法和基于短窗口大小的方法进行比较,发现在预测准确性方面,它们的性能优于所有方法。结论基于短窗口大小和不平衡的大样本,该方法不仅比现有方法具有更高的预测精度,而且计算速度快,对核苷酸的插入和缺失具有良好的鲁棒性。审阅者本文由Ryan McGinty博士审阅。还有德克·沃尔瑟(Dirk Walther)。并且在预测准确性方面比所有其他方法都有更好的表现。结论基于短窗口大小和不平衡的大样本,该方法不仅比现有方法具有更高的预测精度,而且计算速度快,对核苷酸的插入和缺失具有良好的鲁棒性。审阅者本文由Ryan McGinty博士审阅。还有德克·沃尔瑟(Dirk Walther)。并且在预测准确性方面比所有其他方法都有更好的表现。结论基于短窗口大小和不平衡的大样本,该方法不仅比现有方法具有更高的预测精度,而且计算速度快,对核苷酸的插入和缺失具有良好的鲁棒性。审阅者本文由Ryan McGinty博士审阅。和德克·沃尔瑟(Dirk Walther)。
更新日期:2020-04-22
down
wechat
bug