On the Approximation Ratio of Ordered Parsings,IEEE Transactions on Information Theory

当前位置： X-MOL 学术 › IEEE Trans. Inform. Theory › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On the Approximation Ratio of Ordered Parsings
IEEE Transactions on Information Theory ( IF 2.5 ) Pub Date : 2021-02-01 , DOI: 10.1109/tit.2020.3042746
Gonzalo Navarro , Carlos Ochoa , Nicola Prezza

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is

$\boldsymbol {b}$

, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing

$\boldsymbol {b}$

is NP-complete, a popular gold standard is

$\boldsymbol {z}$

, the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of

$\boldsymbol {z}$

with respect to

$\boldsymbol {b}$

. In this paper we prove that

$z=O(b\log (n/b))$

, where

$n$

is the text length. We also show that the bound is tight as a function of

$n$

, by exhibiting a text family where

$z = \Omega (b\log n)$

. Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating

$\boldsymbol {b}$

with

$r$

, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses–meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step–, and of ordered parses–meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size

$v$

of the optimal lexicographical parse is also obtained greedily in

$O(n)$

time, that

$v=O(b\log (n/b))$

, and that there exists a text family where

$v = \Omega (b\log n)$

. Interestingly, we also show that

$v = O(r)$

because

$r$

also induces a lexicographical parse, whereas

$z = \Omega (r\log n)$

holds on some text families. We obtain some results on parsing complexity and size that hold on some general classes of greedy ordered parses. In our way, we also prove other relevant bounds between compressibility measures, especially with those related to smallest grammars of various types generating (only) the text.

中文翻译：

关于有序解析的逼近比

香农熵是统计压缩的明确下限。对于基于字典的压缩，这种情况不太好理解。一个合理的下限是

$\boldsymbol {b}$

，文本的一般双向解析的最少短语数，其中短语可以从文本中的任何其他地方复制。由于计算

$\boldsymbol {b}$

是 NP 完全的，流行的黄金标准是

$\boldsymbol {z}$

，文本的 Lempel-Ziv 解析中的短语数，它是在线性时间内计算的，并且当这些只能从左侧复制时产生最少的短语数。几十年来，几乎没有人知道

$\boldsymbol {z}$

关于

$\boldsymbol {b}$

. 在本文中，我们证明

$z=O(b\log (n/b))$

，在哪里

$n$

是文本长度。我们还表明，作为一个函数的边界是紧的

$n$

，通过展示一个文本系列，其中

$z = \Omega (b\log n)$

. 我们的上限是通过基于文本的局部一致解析构建运行长度上下文无关文法来获得的。我们的下界是通过关联获得的

$\boldsymbol {b}$

和

$r$

，文本的 Burrows-Wheeler 变换中的等字母游程数。我们继续观察到 Lempel-Ziv 只是一种特殊情况贪婪的解析——意味着它通过扫描文本并在每一步最大化短语长度来获得最小的解析——以及订购解析——意思是短语在某种顺序下比它们的来源大。作为有序贪婪分析的一个新例子，我们引入辞典parses，其中短语只能从字典序较小的文本位置复制。我们证明大小

$v$

也可以贪婪地获得最佳词典分析的

$O(n)$

时间，那个

$v=O(b\log (n/b))$

，并且存在一个文本系列，其中

$v = \Omega (b\log n)$

. 有趣的是，我们还表明

$v = O(r)$

因为

$r$

也诱导字典序解析，而