Prefix-free parsing for building big BWTs.,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Prefix-free parsing for building big BWTs.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2019-05-24 , DOI: 10.1186/s13015-019-0148-5
Christina Boucher ₁ , Travis Gagie _{2,

3} , Alan Kuhnle _{1,

4} , Ben Langmead ₅ , Giovanni Manzini _{6,

7} , Taher Mun ₅

Affiliation

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

中文翻译：

用于构建大型 BWT 的无前缀解析。

高通量测序技术导致基因组数据库爆炸式增长；其中之一很快将达到数百 TB。对于许多应用程序，我们希望构建和存储这些数据库的索引，但构建此类索引是一个挑战。幸运的是，许多基因组数据库都具有高度重复性，可以利用这一特性来简化 Burrows-Wheeler 变换 (BWT) 的计算，而 Burrows-Wheeler 变换 (BWT) 是许多流行索引的基础。在本文中，我们介绍了一种称为无前缀解析的预处理算法，该算法以文本 T 作为输入，并一次性生成字典 D 和 T 的解析 P，其属性为 T 的 BWT 可以使用与其总大小和 O(|T|) 时间成比例的工作空间从 D 和 P 构建。我们的实验表明，实际上 D 和 P 明显小于 T，因此即使 T 很大，也可以适合合理的内存。特别是，我们表明，通过无前缀解析，我们可以使用 21 GB 内存在 2 小时内为 1000 个人类 19 号染色体拷贝构建 131 MB 运行长度压缩 FM 索引（仅限于仅支持计数而不支持定位），这表明我们可以使用大约 1 TB 内存在大约 102 小时内为 1000 个完整人类基因组单倍型构建 6.73 GB 索引。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11