当前位置: X-MOL 学术Algorithms Mol. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Prefix-free parsing for building big BWTs.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2019-05-24 , DOI: 10.1186/s13015-019-0148-5
Christina Boucher 1 , Travis Gagie 2, 3 , Alan Kuhnle 1, 4 , Ben Langmead 5 , Giovanni Manzini 6, 7 , Taher Mun 5
Affiliation  

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

中文翻译:


用于构建大型 BWT 的无前缀解析。



高通量测序技术导致基因组数据库爆炸式增长;其中之一很快将达到数百 TB。对于许多应用程序,我们希望构建和存储这些数据库的索引,但构建此类索引是一个挑战。幸运的是,许多基因组数据库都具有高度重复性,可以利用这一特性来简化 Burrows-Wheeler 变换 (BWT) 的计算,而 Burrows-Wheeler 变换 (BWT) 是许多流行索引的基础。在本文中,我们介绍了一种称为无前缀解析的预处理算法,该算法以文本 T 作为输入,并一次性生成字典 D 和 T 的解析 P,其属性为 T 的 BWT 可以使用与其总大小和 O(|T|) 时间成比例的工作空间从 D 和 P 构建。我们的实验表明,实际上 D 和 P 明显小于 T,因此即使 T 很大,也可以适合合理的内存。特别是,我们表明,通过无前缀解析,我们可以使用 21 GB 内存在 2 小时内为 1000 个人类 19 号染色体拷贝构建 131 MB 运行长度压缩 FM 索引(仅限于仅支持计数而不支持定位) ,这表明我们可以使用大约 1 TB 内存在大约 102 小时内为 1000 个完整人类基因组单倍型构建 6.73 GB 索引。
更新日期:2019-11-01
down
wechat
bug