Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space,Journal of the ACM

当前位置： X-MOL 学术 › J. ACM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
Journal of the ACM ( IF 2.5 ) Pub Date : 2020-01-16 , DOI: 10.1145/3375890
Travis Gagie ₁ , Gonzalo Navarro ₂ , Nicola Prezza ₃

Affiliation

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r , the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O ( r ) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O ( m log log n ) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r . In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O ( occ log log n ) time) within O ( r ) space. By raising the space to O ( r log log n ), our index counts the occurrences in optimal time, O ( m ), and locates them in optimal time as well, O ( m + occ ). By further raising the space by an O ( w / log σ) factor, where σ is the alphabet size and w = Ω (log n ) is the RAM machine size in bits, we support count and locate in O (⌈ m log (σ)/ w ⌉) and O (⌈ m log (σ)/ w ⌉ + occ ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O ( r log ( n / r )) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O (log ( n / r )+ℓ log (σ)/ w ). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O (log ( n / r )), and extend these capabilities to full suffix tree functionality, typically in O (log ( n / r )) time per operation. Our experiments show that our O ( r )-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.

中文翻译：

BWT-Runs 有界空间中的全功能后缀树和最优文本搜索

自世纪之交以来，索引高度重复的文本（例如基因组数据库、软件存储库和版本化的文本集合）已成为一个重要问题。重复文本的相关可压缩性度量是r，他们的 Burrows-Wheeler 变换 (BWT) 中的运行次数。最早的重复收集索引之一，运行长度 FM 索引，使用○(r) 空间，并且能够有效地计算长度模式的出现次数米在一段长度的文本中n（在○(米日志n）时间，使用当前技术）。然而，它无法在一个有界的空间内有效地定位这些事件的位置r. 在本文中，我们解决了这个长期存在的问题，展示了如何扩展 Run-Length FM-index 以便它可以定位occ有效地发生（在○(occ日志n) 时间) 内○(r）空间。通过将空间提升到○(r日志n)，我们的索引计算最佳时间的出现次数，○(米)，并在最佳时间定位它们，○(米+occ）。通过进一步提高空间○(w/ log σ) 因子，其中 σ 是字母大小和w= Ω（对数n) 是以位为单位的 RAM 机器大小，我们支持计数和定位○(⌈米对数 (σ)/w⌉）和○(⌈米对数 (σ)/w⌉ +occ) 时间，这在打包设置中是最佳的，之前在压缩空间中没有获得。我们还描述了一个结构使用○(r日志（n/r)) 替换文本并在几乎最佳时间提取长度为 ℓ 的任何文本子串的空间○（日志（n/r)+ℓ log (σ)/w）。在该空间内，我们同样提供对任意后缀数组、逆后缀数组和最长公共前缀数组单元的及时访问○（日志（n/r))，并将这些功能扩展到完整的后缀树功能，通常在○（日志（n/r)) 每次操作的时间。我们的实验表明我们的○(r)-空间指数在时间上优于具有空间竞争力的替代品 1--2 个数量级。原始 FM 指数的竞争性实现在空间上优于 1--2 个数量级和/或在时间上优于 2--3 个数量级。

更新日期：2020-01-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>