Faster & strong: string dictionary compression using sampling and fast vectorized decompression,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Faster & strong: string dictionary compression using sampling and fast vectorized decompression
The VLDB Journal ( IF 4.2 ) Pub Date : 2020-07-20 , DOI: 10.1007/s00778-020-00620-x
Robert Lasch , Ismail Oukid , Roman Dementiev , Norman May , Suleyman S. Demirsoy , Kai-Uwe Sattler

String dictionaries constitute a large portion of the memory footprint of database applications. While strong string dictionary compression algorithms exist, these come with impractical access and compression times. Therefore, lightweight algorithms such as front coding (PFC) are favored in practice. This paper endeavors to make strong string dictionary compression practical. We focus on Re-Pair Front Coding (RPFC), a grammar-based compression algorithm, since it consistently offers better compression ratios than other algorithms in the literature. To accelerate compression times, we propose block-based RPFC (BRPFC) which consists in independently compressing small blocks of the dictionary. For further accelerated compression times especially on large string dictionaries, we also propose an alternative version of BRPFC that uses sampling to speed up compression. Moreover, to accelerate access times, we devise a vectorized access method, using \(\hbox {Intel}^{\circledR }\) Advanced Vector Extensions 512 (\(\hbox {Intel}^{\circledR }\) AVX-512). Our experimental evaluation shows that sampled BRPFC offers compression times up to 190 \(\times \) faster than RPFC, and random string lookups 2.3 \(\times \) faster than RPFC on average. These results move our modified RPFC into a practical range for use in database systems because the overhead of Re-Pair-based compression for access times can be reduced by 2 \(\times \).

中文翻译：

更快，更强大：使用采样和快速矢量化解压缩的字符串字典压缩

字符串字典占数据库应用程序内存占用的很大一部分。尽管存在强大的字符串字典压缩算法，但这些算法具有不切实际的访问和压缩时间。因此，在实践中偏爱诸如前编码（PFC）之类的轻量级算法。本文努力使强字符串字典压缩实用。我们专注于基于语法的压缩算法Re-Pair Front Coding（RPFC），因为它始终提供比文献中其他算法更好的压缩率。为了加快压缩时间，我们提出了基于块的RPFC（BRPFC），它包含独立压缩字典的小块。为了进一步加快压缩时间，尤其是在大型字符串字典上，我们还提出了BRPFC的替代版本，该版本使用采样来加快压缩速度。此外，为了加快访问时间，我们设计了一种矢量化访问方法，\（\ hbox {Intel} ^ {\ circledR} \） Advanced Vector Extensions 512（\（\ hbox {Intel} ^ {\ circledR} \） AVX-512）。我们的实验评估表明，采样的BRPFC提供的压缩时间比RPFC快190 \（\ times \），平均随机字符串查找比RPFC快2.3 \（\ times \）。这些结果将我们修改后的RPFC移入数据库系统的实用范围，因为基于Re-Pair的访问时间压缩开销可以减少2 \（\ times \）。

更新日期：2020-07-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>