Revisiting Data Compression in Column-Stores,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Revisiting Data Compression in Column-Stores
arXiv - CS - Performance Pub Date : 2021-05-19 , DOI: arxiv-2105.09058
Alexander Slesarev, Evgeniy Klyuchikov, Kirill Smirnov, George Chernishev

Data compression is widely used in contemporary column-oriented DBMSes to lower space usage and to speed up query processing. Pioneering systems have introduced compression to tackle the disk bandwidth bottleneck by trading CPU processing power for it. The main issue of this is a trade-off between the compression ratio and the decompression CPU cost. Existing results state that light-weight compression with small decompression costs outperforms heavy-weight compression schemes in column-stores. However, since the time these results were obtained, CPU, RAM, and disk performance have advanced considerably. Moreover, novel compression algorithms have emerged. In this paper, we revisit the problem of compression in disk-based column-stores. More precisely, we study the I/O-RAM compression scheme which implies that there are two types of pages of different size: disk pages (compressed) and in-memory pages (uncompressed). In this scheme, the buffer manager is responsible for decompressing pages as soon as they arrive from disk. This scheme is rather popular as it is easy to implement: several modern column and row-stores use it. We pose and address the following research questions: 1) Are heavy-weight compression schemes still inappropriate for disk-based column-stores?, 2) Are new light-weight compression algorithms better than the old ones?, 3) Is there a need for SIMD-employing decompression algorithms in case of a disk-based system? We study these questions experimentally using a columnar query engine and Star Schema Benchmark.

中文翻译：

重新访问列存储中的数据压缩

数据压缩已广泛用于当代的面向列的DBMS中，以减少空间使用并加快查询处理。领先的系统已引入压缩功能，以通过交易CPU处理能力来解决磁盘带宽瓶颈。这样做的主要问题是在压缩率和解压缩CPU成本之间进行权衡。现有结果表明，轻量级压缩和较小的解压缩成本要优于列存储中的轻量级压缩方案。但是，自获得这些结果以来，CPU，RAM和磁盘性能有了很大提高。此外，出现了新颖的压缩算法。在本文中，我们将重新探讨基于磁盘的列存储中的压缩问题。更确切地说，我们研究了I / O-RAM压缩方案，这意味着存在两种大小不同的页面类型：磁盘页面（压缩）和内存中页面（未压缩）。在此方案中，缓冲区管理器负责在页面从磁盘到达后立即对其进行解压缩。由于易于实施，因此该方案颇受欢迎：几个现代的列和行存储都在使用它。我们提出并解决以下研究问题：1）重量级压缩方案是否仍不适合基于磁盘的列存储？，2）新的轻量级压缩算法是否优于旧的压缩算法？，3）是否有必要？基于磁盘的系统中使用SIMD的解压缩算法？我们使用列式查询引擎和Star Schema Benchmark实验性地研究了这些问题。在此方案中，缓冲区管理器负责在页面从磁盘到达后立即对其进行解压缩。由于易于实施，因此该方案颇受欢迎：几个现代的列和行存储都在使用它。我们提出并解决以下研究问题：1）重量级压缩方案是否仍不适合基于磁盘的列存储？，2）新的轻量级压缩算法是否优于旧的压缩算法？，3）是否有必要？基于磁盘的系统中使用SIMD的解压缩算法？我们使用列式查询引擎和Star Schema Benchmark实验性地研究了这些问题。在此方案中，缓冲区管理器负责在页面从磁盘到达后立即对其进行解压缩。由于易于实施，因此该方案颇受欢迎：几个现代的列和行存储都在使用它。我们提出并解决以下研究问题：1）重量级压缩方案是否仍不适合基于磁盘的列存储？，2）新的轻量级压缩算法是否优于旧的压缩算法？，3）是否有必要？基于磁盘的系统中使用SIMD的解压缩算法的方法是什么？我们使用列式查询引擎和Star Schema Benchmark实验性地研究了这些问题。我们提出并解决以下研究问题：1）重量级压缩方案是否仍不适合基于磁盘的列存储？，2）新的轻量级压缩算法是否优于旧的压缩算法？，3）是否有必要？基于磁盘的系统中使用SIMD的解压缩算法？我们使用列式查询引擎和Star Schema Benchmark实验性地研究了这些问题。我们提出并解决以下研究问题：1）重量级压缩方案是否仍不适合基于磁盘的列存储？，2）新的轻量级压缩算法是否优于旧的压缩算法？，3）是否有必要？基于磁盘的系统中使用SIMD的解压缩算法？我们使用列式查询引擎和Star Schema Benchmark实验性地研究了这些问题。

更新日期：2021-05-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文