I/O-efficient data structures for non-overlapping indexing,Theoretical Computer Science

当前位置： X-MOL 学术 › Theor. Comput. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

I/O-efficient data structures for non-overlapping indexing
Theoretical Computer Science ( IF 0.9 ) Pub Date : 2020-12-10 , DOI: 10.1016/j.tcs.2020.12.006
Sahar Hooshmand , Paniz Abedin , M. Oğuzhan Külekci , Sharma V. Thankachan

The non-overlapping indexing problem is defined as follows: pre-process a given text $T [1, n]$ of length n into a data structure such that whenever a pattern $P [1, m]$ comes as an input, we can efficiently report the largest set of non-overlapping occurrences of P in $T$ . The best-known solution is by Cohen and Porat [ISAAC 2009]. The size of their structure is $O (n)$ words and the query time is optimal $O (m + nocc)$ , where $nocc$ is the output size. Later, Ganguly et al. [CPM 2015 and Algorithmica 2020] proposed a compressed space solution. We study this problem in the cache-oblivious model and present a new data structure of size $O (n \log n)$ words. It can answer queries in optimal $O (\frac{m}{B} + \log_{B} n + \frac{nocc}{B})$ I/O operations, where B is the block size. The space can be improved to $O (n \log_{M / B} n)$ in the cache-aware model, where M is the size of main memory. Additionally, we study a generalization of this problem with an additional range $[s, e]$ constraint. Here the task is to report the largest set of non-overlapping occurrences of P in $T$ , that are within the range $[s, e]$ . We present an $O (n \log^{2} n)$ space data structure in the cache-aware model that can answer queries in optimal $O (\frac{m}{B} + \log_{B} n + \frac{{nocc}_{[s, e]}}{B})$ I/O operations, where ${nocc}_{[s, e]}$ is the output size.

中文翻译：

I / O高效的数据结构，用于非重叠索引

非重叠索引问题定义如下：预处理给定文本 $Ť [1个， ñ]$ 长度为n的数据结构 $P [1个，米]$ 作为输入，我们可以有效地报告P中最大的一组非重叠出现 $Ť$ 。最著名的解决方案是Cohen和Porat [ISAAC 2009]。其结构的大小是 $Ø （ ñ ）$ 单词和查询时间最佳 $Ø （米 + Nocc ）$ ，在哪里 $Nocc$ 是输出大小。后来，Ganguly等人。[CPM 2015和Algorithmica 2020]提出了一种压缩空间解决方案。我们在忽略缓存的模型中研究了此问题，并提出了一个新的大小数据结构 $Ø （ ñ 日志 ñ ）$ 话。它可以最佳地回答查询 $Ø （ \frac{米}{乙} + {日志}_{乙} ñ + \frac{Nocc}{乙} ）$ I / O操作，其中B是块大小。空间可以改善为 $Ø （ ñ {日志}_{中号 / 乙} ñ ）$ 在支持缓存的模型中，其中M是主内存的大小。此外，我们还研究了此问题的一般性 $[s ， Ë]$ 约束。这里的任务是报告P in中最大的一组非重叠出现 $Ť$ ，在范围内 $[s ， Ë]$ 。我们提出一个 $Ø （ ñ {日志}^{2} ñ ）$ 缓存感知模型中的空间数据结构，可以以最佳方式回答查询 $Ø （ \frac{米}{乙} + {日志}_{乙} ñ + \frac{{Nocc}_{[s ， Ë]}}{乙} ）$ I / O操作 ${Nocc}_{[s ， Ë]}$ 是输出大小。

更新日期：2021-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11