Constructing Antidictionaries of Long Texts in Output-Sensitive Space,Theory of Computing Systems

当前位置： X-MOL 学术 › Theory Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Constructing Antidictionaries of Long Texts in Output-Sensitive Space
Theory of Computing Systems ( IF 0.6 ) Pub Date : 2020-12-14 , DOI: 10.1007/s00224-020-10018-5
Lorraine A.K. Ayad , Golnaz Badkobeh , Gabriele Fici , Alice Héliou , Solon P. Pissis

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y₁, … , y_k over an alphabet Σ, we are asked to compute the set \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) of minimal absent words of length at most ℓ of the collection {y₁, … , y_k}. The set \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of y_i and the maximal proper prefix of x is a factor of y_j. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set \(\mathrm {M}^{\ell }_{y}\) of minimal absent words of a word y is equal to \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) for any decomposition of y into a collection of words y₁, … , y_k such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available \(\mathcal {O}(n)\)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when \(\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)\), for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 10⁹ but \(\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}\). We consider a constant-sized alphabet for stating our results. We show that all \(\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) can be computed in \(\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )\) total time using \(\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})\) space, where MaxIn is the length of the longest word in {y₁, … , y_k} and \(\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}\). Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

中文翻译：

在输出敏感空间中构造长篇小说的词典

一个字X是从字缺席ÿ被称为最小的，如果出现在所有适当的因素ÿ。给定一个字母Σ上的k个单词y ₁，…，y _k的集合，我们被要求计算集合\（\ mathrm {M} ^ {\ ell} _ {\ {y_1，\ ldots，y_k \}} \）集合{ y ₁，…，y _k }中最多ℓ的最小缺席单词。集合\（\ mathrm {M} ^ {\ ell} _ {\ {y_1，\ ldots，y_k \}} \\）包含所有单词x使得x在存在i，j的情况下，集合的所有单词中都不存在，因此x的最大正确后缀是y _i的因数，而x的最大正确前缀是y _j的因数。在数据压缩中，这对应于计算k个文档的字典。在生物信息学中，它对应于计算k染色体基因组中缺少的单词。实际上，单词y的最小缺席单词的集合\（\ mathrm {M} ^ {\ ell} _ {y} \）等于\（\ mathrm {M} ^ {\ ell} _ {\ {y_1 ，\ ldots，y_k \}} \）分解为将y放入单词y ₁，…，y _k的集合中，使得该集合中的任何两个连续单词之间的长度至少重叠ℓ− 1。对于n = |，此计算通常需要Ω（n）空间。y | 使用大量可用的\（\ mathcal {O}（n）\）时间算法。这是因为在y上构造了一个Ω（n）大小的文本索引，这对于大n而言是不切实际的。我们使用输出敏感空间逐步进行相同的计算。此目标在以下情况下是合理的\（\ | \ mathrm {M} ^ {\ ELL} _ {\ {Y_1，\ ldots，y_N \}} \ | = O（N）\），对于所有Ñ ∈[1，ķ ]，其中∥小号∥表示集合S中单词长度的总和。例如，在人类基因组，Ñ ≈3×10 ⁹，但\（\ | \ mathrm {M} ^ {12} _ {\ {Y_1，\ ldots，Y_K \}} \ | \约10 ^ {6} \）。我们考虑一个大小固定的字母来说明我们的结果。我们显示所有 \（\ mathrm {M} ^ {\ ell} _ {y_ {1}}，\ ldots，\ mathrm {M} ^ {\ ell} _ {\ {y_1，\ ldots，y_k \}} \）可以用\（\ mathcal {O}（kn + {\ sum} ^ {k} _ {N = 1} \ | \ mathrm {M} ^ {\ ell} _ {\ {y_1，\ ldots， y_N \}} \ |）\）总时间，使用\（\ mathcal {O}（\ textsc {MaxIn} + \ textsc {MaxOut}）\）空间，其中MaxIn是{ y ₁，…，y _k }和\（\ textsc {MaxOut } = \ max \ limits \ {\ | \ mathrm {M} ^ {\ ell} _ {\ {y_1，\ ldots，y_N \}} \ |：N \在[1，k] \} \）中。还提供了概念验证实验结果，证实了我们的理论发现并证明了我们的贡献。

更新日期：2020-12-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11