当前位置: X-MOL 学术arXiv.cs.DS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Grammar Compression By Induced Suffix Sorting
arXiv - CS - Data Structures and Algorithms Pub Date : 2020-11-25 , DOI: arxiv-2011.12898
Daniel S. N. Nunes, Felipe A. Louza, Simon Gog, Mauricio Ayala-Rincón, Gonzalo Navarro

A grammar compression algorithm, called GCIS, is introduced in this work. GCIS is based on the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed solution builds on the factorization performed by SAIS during suffix sorting. A context-free grammar is used to replace factors by non-terminals. The algorithm is then recursively applied on the shorter sequence of non-terminals. The resulting grammar is encoded by exploiting some redundancies, such as common prefixes between right-hands of rules, sorted according to SAIS. GCIS excels for its low space and time required for compression while obtaining competitive compression ratios. Our experiments on regular and repetitive, moderate and very large texts, show that GCIS stands as a very convenient choice compared to well-known compressors such as Gzip, 7-Zip, and RePair, the gold standard in grammar compression. In exchange, GCIS is slow at decompressing. Yet, grammar compressors are more convenient than Lempel-Ziv compressors in that one can access text substrings directly in compressed form, without ever decompressing the text. We demonstrate that GCIS is an excellent candidate for this scenario because it shows to be competitive among its RePair based alternatives. We also show, how GCIS relation with SAIS makes it a good intermediate structure to build the suffix array and the LCP array during decompression of the text.

中文翻译:

归纳后缀排序的语法压缩

这项工作中引入了一种语法压缩算法,称为GCIS。GCIS基于Nong等人提出的归纳后缀排序算法SAIS。2009年提出的解决方案基于SAIS在后缀排序过程中执行的分解。上下文无关的语法用于通过非终结符替换因子。然后将该算法递归应用于较短的非终结序列。通过利用一些冗余来编码生成的语法,例如根据SAIS排序的规则右手之间的通用前缀。GCIS的优点是压缩所需的空间和时间短,同时具有竞争优势的压缩比。我们对常规和重复性,中度和超大型文本进行的实验表明,与著名的Gzip,7-Zip和RePair压缩器相比,GCIS是非常方便的选择。语法压缩的黄金标准。作为交换,GCIS的解压缩速度很慢。但是,语法压缩器比Lempel-Ziv压缩器更方便,因为它可以直接以压缩形式访问文本子字符串,而无需解压缩文本。我们证明了GCIS是这种情况的理想选择,因为它在基于RePair的替代方案中显示出竞争优势。我们还展示了GCIS与SAIS的关系如何使其成为在文本解压缩期间构建后缀数组和LCP数组的良好中间结构。我们证明了GCIS是这种情况的理想选择,因为它在基于RePair的替代方案中显示出竞争优势。我们还展示了GCIS与SAIS的关系如何使其成为在文本解压缩期间构建后缀数组和LCP数组的良好中间结构。我们证明了GCIS是这种情况的理想选择,因为它在基于RePair的替代方案中显示出竞争优势。我们还展示了GCIS与SAIS的关系如何使其成为在文本解压缩期间构建后缀数组和LCP数组的良好中间结构。
更新日期:2020-11-27
down
wechat
bug