当前位置: X-MOL 学术Algorithms Mol. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
External memory BWT and LCP computation for sequence collections with applications.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2019-03-08 , DOI: 10.1186/s13015-019-0140-0
Lavinia Egidi 1 , Felipe A Louza 2 , Giovanni Manzini 1, 3 , Guilherme P Telles 4
Affiliation  

BACKGROUND Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. RESULTS We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs. CONCLUSIONS We prove that our algorithm performs O ( n maxlcp ) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

中文翻译:

用于应用程序的序列集合的外部存储器 BWT 和 LCP 计算。

背景技术测序技术产生越来越大的生物序列集合,这些生物序列必须存储在支持快速搜索操作的压缩索引中。许多压缩索引基于 Burrows-Wheeler 变换 (BWT) 和最长公共前缀 (LCP) 数组。由于输入的庞大规模,重要的是在外部存储器中构建这些数据结构,并以最佳方式使用可用 RAM。结果 我们提出了一种节省空间的算法来计算外部或半外部存储器设置中序列集合的 BWT 和 LCP 数组。我们的算法将输入集合拆分为足够小的子集合,以便它可以使用最佳线性时间算法在 RAM 中计算它们的 BWT。下一个,它将部分 BWT 合并到外部或半外部存储器中,并且在此过程中它还计算 LCP 值。我们的算法可以修改为输出两个额外的数组,结合 BWT 和 LCP 数组,为生物信息学中的三个众所周知的问题提供简单的、基于扫描的外部存储器算法:最大重复的计算、所有对后缀-前缀重叠,以及简洁的 de Bruijn 图的构造。结论 我们证明我们的算法执行 O ( n maxlcp ) 顺序 I/O,其中 n 是集合的总长度,maxlcp 是最大 LCP 值。实验结果表明,对于短序列,我们的算法仅比现有技术稍慢,但对于较长序列或可用 RAM 至少等于输入大小时,它的速度快了 40 倍。
更新日期:2019-11-01
down
wechat
bug