Optimal-Time Dictionary-Compressed Indexes,ACM Transactions on Algorithms

当前位置： X-MOL 学术 › ACM Trans. Algorithms › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimal-Time Dictionary-Compressed Indexes
ACM Transactions on Algorithms ( IF 0.9 ) Pub Date : 2020-12-31 , DOI: 10.1145/3426473
Anders Roy Christiansen ₁ , Mikko Berggren Ettienne ₁ , Tomasz Kociumaka ₂ , Gonzalo Navarro ₃ , Nicola Prezza ₄

Affiliation

We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result, we combine several recent findings, including string attractors —new combinatorial objects encompassing most known compressibility measures for highly repetitive texts—and grammars based on locally consistent parsing . More in detail, letγ be the size of the smallest attractor for a text T of length n . The measureγ is an (asymptotic) lower bound to the size of dictionary compressors based on Lempel–Ziv, context-free grammars, and many others. The smallest known text representations in terms of attractors use space O (γ log ( n /γ)), and our lightest indexes work within the same asymptotic space. Let ε > 0 be a suitably small constant fixed at construction time, m be the pattern length, and occ be the number of its text occurrences. Our index counts pattern occurrences in O ( m +log 2+ε n ) time and locates them in O ( m +( occ +1)log ε n ) time. These times already outperform those of most dictionary-compressed indexes, while obtaining the least asymptotic space for any index searching within O (( m + occ ),polylog, n ) time. Further, by increasing the space to O (γ log ( n /γ)log ε n ), we reduce the locating time to the optimal O ( m + occ ), and within O (γ log ( n /γ)log n ) space we can also count in optimal O ( m ) time. No dictionary-compressed index had obtained this time before. All our indexes can be constructed in O ( n ) space and O ( n log n ) expected time. As a by-product of independent interest, we show how to build, in O ( n ) expected time and without knowing the sizeγ of the smallest attractor (which is NP-hard to find), a run-length context-free grammar of size O (γ log ( n /γ)) generating (only) T . As a result, our indexes can be built without knowingγ.

中文翻译：

最优时间字典压缩索引

我们描述了第一个能够在最流行的字典压缩器大小限制的空间内以最佳时间计数和定位模式出现的自索引。为了实现这一结果，我们结合了一些最近的发现，包括弦吸引子——新的组合对象，包括大多数已知的高度重复文本的可压缩性度量——以及基于局部一致解析. 更详细地说，设 γ 是文本的最小吸引子的大小吨长度n. 度量γ 是基于 Lempel-Ziv、上下文无关文法和许多其他文法的字典压缩器大小的（渐近）下限。就吸引子而言，最小的已知文本表示使用空间○(γ日志 (n/γ))，我们最轻的索引在同一个渐近空间内工作。令 ε > 0 是在施工时固定的适当小的常数，米是模式长度，并且occ是其文本出现的次数。我们的索引计算模式出现在○(米+日志2+ε n) 时间并将它们定位在○(米+(occ+1)日志ε n）时间。这些时间已经超过了大多数字典压缩索引，同时为其中的任何索引搜索获得了最小的渐近空间○((米+occ),多对数,n）时间。此外，通过增加空间○(γ日志 (n/γ)日志ε n)，我们将定位时间减少到最优○(米+occ)，并且在○(γ日志 (n/γ)日志n) 空间我们也可以算入最优○(米）时间。这次没有获得字典压缩索引。我们所有的索引都可以构建在○(n) 空间和○(n日志n) 预计时间。作为独立利益的副产品，我们展示了如何在○(n) 预期时间并且不知道最小吸引子的大小γ（这是 NP 难以找到的），大小的游程无上下文文法○(γ日志 (n/γ)) 生成（仅）吨. 因此，我们的索引可以在不知道 γ 的情况下构建。

更新日期：2020-12-31

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11