Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce.,BMC Structural Biology

当前位置： X-MOL 学术 › BMC Struct. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce.
BMC Structural Biology Pub Date : 2013-11-08 , DOI: 10.1186/1472-6807-13-s1-s3
Boyu Zhang , Daniel T Yehdego , Kyle L Johnson , Ming-Ying Leung , Michela Taufer

BACKGROUND Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment. RESULTS On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance. CONCLUSIONS By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.

中文翻译：

通过序列分割和 MapReduce 提高 RNA 二级结构预测的准确性和效率。

背景核糖核酸(RNA)分子在包括基因表达和调控在内的许多生物过程中起重要作用。它们的二级结构对 RNA 功能至关重要，二级结构的预测得到了广泛的研究。我们之前的研究表明，将长序列切割成更短的块，使用热力学方法独立预测块的二级结构，并从预测的块结构重建整个二级结构，比使用整体 RNA 序列预测二级结构可以产生更好的准确性. 分块、预测和重建过程可以使用不同的方法和参数，其中一些比其他方法产生更准确的预测。在本文中，我们使用七种流行的二级结构预测程序研究了三种不同组块方法的预测准确性和效率，这些程序适用于具有已知二级结构的两个 RNA 数据集，其中包括假结和非假结序列，以及病毒基因组 RNA 家族其结构以前没有被预测过。我们基于 Hadoop 的模块化 MapReduce 框架使我们能够在并行且健壮的环境中研究问题。结果平均而言，对于我们的分块方法和超过 50 个非伪结序列的 7 个预测程序，最大准确度保留值大于 1，这意味着使用分块预测的二级结构比预测的二级结构更类似于真实结构。使用整个序列。我们观察到 23 个伪结序列的类似结果，除了使用中心分块方法的 NUPACK 程序。对来自诺达病毒科病毒家族的 14 个长 RNA 序列的性能分析概述了 MapReduce 框架中分块和预测的粗粒度映射如何为短 RNA 序列展示更短的周转时间。然而，随着RNA序列长度的增加，细粒度映射在性能上可以超过粗粒度映射。结论通过使用我们的 MapReduce 框架以及对准确度保留结果的统计分析，我们观察了基于反转的分块方法如何优于使用整个序列的预测。我们基于块的方法还使我们能够预测非常长的 RNA 序列的二级结构，

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11