当前位置: X-MOL 学术IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Predicting Coding Potential of RNA Sequences by Solving Local Data Imbalance
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 3.6 ) Pub Date : 2020-09-04 , DOI: 10.1109/tcbb.2020.3021800
Xian-gan Chen 1 , Shuai Liu 2 , Wen Zhang 2
Affiliation  

Non-coding RNAs (ncRNAs)play an important role in various biological processes and are associated with diseases. Distinguishing between coding RNAs and ncRNAs, also known as predicting coding potential of RNA sequences, is critical for downstream biological function analysis. Many machine learning-based methods have been proposed for predicting coding potential of RNA sequences. Recent studies reveal that most existing methods have poor performance on RNA sequences with short Open Reading Frames (sORF, ORF length<303nt). In this work, we analyze the distribution of ORF length of RNA sequences, and observe that the number of coding RNAs with sORF is inadequate and coding RNAs with sORF are much less than ncRNAs with sORF. Thus, there exists the problem of local data imbalance in RNA sequences with sORF. We propose a coding potential prediction method CPE-SLDI, which uses data oversampling techniques to augment samples for coding RNAs with sORF so as to alleviate local data imbalance. Compared with existing methods, CPE-SLDI produces the better performances, and studies reveal that data augmentation by various data oversampling techniques can enhance the performance of coding potential prediction, especially for RNA sequences with sORF. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPESLDI .

中文翻译:

通过解决局部数据不平衡预测 RNA 序列的编码潜力

非编码 RNA (ncRNA) 在各种生物过程中发挥重要作用,并与疾病相关。区分编码 RNA 和 ncRNA,也称为预测 RNA 序列的编码潜力,对于下游生物学功能分析至关重要。已经提出了许多基于机器学习的方法来预测 RNA 序列的编码潜力。最近的研究表明,大多数现有方法在具有短开放阅读框(sORF,ORF 长度<303nt)的 RNA 序列上表现不佳。在这项工作中,我们分析了 RNA 序列的 ORF 长度分布,观察到编码 sORF 的 RNA 数量不足,编码 sORF 的 RNA 远少于 sORF 的 ncRNA。因此,使用 sORF 的 RNA 序列中存在局部数据不平衡的问题。我们提出了一种编码潜力预测方法 CPE-SLDI,该方法使用数据过采样技术来增加样本以使用 sORF 编码 RNA,从而缓解局部数据不平衡。与现有方法相比,CPE-SLDI 产生了更好的性能,研究表明,通过各种数据过采样技术进行数据增强可以提高编码潜力预测的性能,尤其是对于具有 sORF 的 RNA 序列。提出的方法的实现可在 特别是对于带有 sORF 的 RNA 序列。提出的方法的实现可在 特别是对于带有 sORF 的 RNA 序列。提出的方法的实现可在https://github.com/chenxgscuec/CPESLDI .
更新日期:2020-09-04
down
wechat
bug