当前位置: X-MOL 学术Theory Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Constructing Antidictionaries of Long Texts in Output-Sensitive Space
Theory of Computing Systems ( IF 0.6 ) Pub Date : 2020-12-14 , DOI: 10.1007/s00224-020-10018-5
Lorraine A.K. Ayad , Golnaz Badkobeh , Gabriele Fici , Alice Héliou , Solon P. Pissis

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) of minimal absent words of length at most of the collection {y1, … , yk}. The set \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set \(\mathrm {M}^{\ell }_{y}\) of minimal absent words of a word y is equal to \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available \(\mathcal {O}(n)\)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when \(\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)\), for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but \(\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}\). We consider a constant-sized alphabet for stating our results. We show that all \(\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) can be computed in \(\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )\) total time using \(\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})\) space, where MaxIn is the length of the longest word in {y1, … , yk} and \(\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}\). Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.



中文翻译:

在输出敏感空间中构造长篇小说的词典

一个字X是从字缺席ÿ被称为最小的,如果出现在所有适当的因素ÿ。给定一个字母Σ上k个单词y 1,…,y k的集合,我们被要求计算集合\(\ mathrm {M} ^ {\ ell} _ {\ {y_1,\ ldots,y_k \}} \)集合{ y 1,…,y k }中最多的最小缺席单词。集合\(\ mathrm {M} ^ {\ ell} _ {\ {y_1,\ ldots,y_k \}} \\)包含所有单词x使得x在存在ij的情况下,集合的所有单词中都不存在,因此x的最大正确后缀是y i的因数,而x的最大正确前缀是y j的因数。在数据压缩中,这对应于计算k个文档的字典。在生物信息学中,它对应于计算k染色体基因组中缺少的单词。实际上,单词y的最小缺席单词的集合\(\ mathrm {M} ^ {\ ell} _ {y} \)等于\(\ mathrm {M} ^ {\ ell} _ {\ {y_1 ,\ ldots,y_k \}} \)分解为将y放入单词y 1,…,y k的集合中,使得该集合中的任何两个连续单词之间的长度至少重叠ℓ− 1。对于n = |,此计算通常需要Ωn)空间。y | 使用大量可用的\(\ mathcal {O}(n)\)时间算法。这是因为在y上构造了一个Ωn)大小的文本索引,这对于大n而言是不切实际的。我们使用输出敏感空间逐步进行相同的计算。此目标在以下情况下是合理的\(\ | \ mathrm {M} ^ {\ ELL} _ {\ {Y_1,\ ldots,y_N \}} \ | = O(N)\),对于所有Ñ ∈[1,ķ ],其中∥小号∥表示集合S中单词长度的总和。例如,在人类基因组,Ñ ≈3×10 9,但\(\ | \ mathrm {M} ^ {12} _ {\ {Y_1,\ ldots,Y_K \}} \ | \约10 ^ {6} \)。我们考虑一个大小固定的字母来说明我们的结果。我们显示所有 \(\ mathrm {M} ^ {\ ell} _ {y_ {1}},\ ldots,\ mathrm {M} ^ {\ ell} _ {\ {y_1,\ ldots,y_k \}} \)可以用\(\ mathcal {O}(kn + {\ sum} ^ {k} _ {N = 1} \ | \ mathrm {M} ^ {\ ell} _ {\ {y_1,\ ldots, y_N \}} \ |)\)总时间,使用\(\ mathcal {O}(\ textsc {MaxIn} + \ textsc {MaxOut})\)空间,其中MaxIn是{ y 1,…,y k }和\(\ textsc {MaxOut } = \ max \ limits \ {\ | \ mathrm {M} ^ {\ ell} _ {\ {y_1,\ ldots,y_N \}} \ |:N \在[1,k] \} \)中。还提供了概念验证实验结果,证实了我们的理论发现并证明了我们的贡献。

更新日期:2020-12-14
down
wechat
bug