当前位置: X-MOL 学术J. ACM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
Journal of the ACM ( IF 2.5 ) Pub Date : 2020-01-16 , DOI: 10.1145/3375890
Travis Gagie 1 , Gonzalo Navarro 2 , Nicola Prezza 3
Affiliation  

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r , the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O ( r ) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O ( m log log n ) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r . In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O ( occ log log n ) time) within O ( r ) space. By raising the space to O ( r log log n ), our index counts the occurrences in optimal time, O ( m ), and locates them in optimal time as well, O ( m + occ ). By further raising the space by an O ( w / log σ) factor, where σ is the alphabet size and w = Ω (log n ) is the RAM machine size in bits, we support count and locate in O (⌈ m log (σ)/ w ⌉) and O (⌈ m log (σ)/ w ⌉ + occ ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O ( r log ( n / r )) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O (log ( n / r )+ℓ log (σ)/ w ). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O (log ( n / r )), and extend these capabilities to full suffix tree functionality, typically in O (log ( n / r )) time per operation. Our experiments show that our O ( r )-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.

中文翻译:

BWT-Runs 有界空间中的全功能后缀树和最优文本搜索

自世纪之交以来,索引高度重复的文本(例如基因组数据库、软件存储库和版本化的文本集合)已成为一个重要问题。重复文本的相关可压缩性度量是r,他们的 Burrows-Wheeler 变换 (BWT) 中的运行次数。最早的重复收集索引之一,运行长度 FM 索引,使用(r) 空间,并且能够有效地计算长度模式的出现次数在一段长度的文本中n(在(日志n)时间,使用当前技术)。然而,它无法在一个有界的空间内有效地定位这些事件的位置r. 在本文中,我们解决了这个长期存在的问题,展示了如何扩展 Run-Length FM-index 以便它可以定位occ有效地发生(在(occ日志n) 时间) 内(r) 空间。通过将空间提升到(r日志n),我们的索引计算最佳时间的出现次数,(),并在最佳时间定位它们,(+occ)。通过进一步提高空间(w/ log σ) 因子,其中 σ 是字母大小和w= Ω(对数n) 是以位为单位的 RAM 机器大小,我们支持计数和定位(⌈对数 (σ)/w⌉) 和(⌈对数 (σ)/w⌉ +occ) 时间,这在打包设置中是最佳的,之前在压缩空间中没有获得。我们还描述了一个结构使用(r日志 (n/r)) 替换文本并在几乎最佳时间提取长度为 ℓ 的任何文本子串的空间(日志 (n/r)+ℓ log (σ)/w)。在该空间内,我们同样提供对任意后缀数组、逆后缀数组和最长公共前缀数组单元的及时访问(日志 (n/r)),并将这些功能扩展到完整的后缀树功能,通常在(日志 (n/r)) 每次操作的时间。我们的实验表明我们的(r)-空间指数在时间上优于具有空间竞争力的替代品 1--2 个数量级。原始 FM 指数的竞争性实现在空间上优于 1--2 个数量级和/或在时间上优于 2--3 个数量级。
更新日期:2020-01-16
down
wechat
bug