Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation
arXiv - CS - Databases Pub Date : 2021-09-13 , DOI: arxiv-2109.05877
Yuxing Han, Ziniu Wu, Peizhi Wu, Rong Zhu, Jingyi Yang, Liang Wei Tan, Kai Zeng, Gao Cong, Yanzhao Qin, Andreas Pfadler, Zhengping Qian, Jingren Zhou, Jiangneng Li, Bin Cui

Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans for a query optimizer in DBMS. In the last decade, an increasing number of advanced CardEst methods (especially ML-based) have been proposed with outstanding estimation accuracy and inference latency. However, there exists no study that systematically evaluates the quality of these methods and answer the fundamental problem: to what extent can these methods improve the performance of query optimizer in real-world settings, which is the ultimate goal of a CardEst method. In this paper, we comprehensively and systematically compare the effectiveness of CardEst methods in a real DBMS. We establish a new benchmark for CardEst, which contains a new complex realworld dataset STATS and a diverse query workload STATS-CEB. We integrate multiple most representative CardEst methods into an open-source database system PostgreSQL, and comprehensively evaluate their true effectiveness in improving query plan quality, and other important aspects affecting their applicability, ranging from inference latency, model size, and training time, to update efficiency and accuracy. We obtain a number of key findings for the CardEst methods, under different data and query settings. Furthermore, we find that the widely used estimation accuracy metric(Q-Error) cannot distinguish the importance of different sub-plan queries during query optimization and thus cannot truly reflect the query plan quality generated by CardEst methods. Therefore, we propose a new metric P-Error to evaluate the performance of CardEst methods, which overcomes the limitation of Q-Error and is able to reflect the overall end-to-end performance of CardEst methods. We have made all of the benchmark data and evaluation code publicly available at https://github.com/Nathaniel-Han/Endto-End-CardEst-Benchmark.

中文翻译：

DBMS 中的基数估计：综合基准评估

基数估计 (CardEst) 在为 DBMS 中的查询优化器生成高质量查询计划方面发挥着重要作用。在过去十年中，越来越多的高级 CardEst 方法（尤其是基于 ML 的）被提出，具有出色的估计精度和推理延迟。然而，还没有研究系统地评估这些方法的质量并回答基本问题：这些方法可以在多大程度上提高现实世界中查询优化器的性能，这是 CardEst 方法的最终目标。在本文中，我们全面系统地比较了 CardEst 方法在真实 DBMS 中的有效性。我们为 CardEst 建立了一个新的基准，它包含一个新的复杂的现实世界数据集 STATS 和一个多样化的查询工作负载 STATS-CEB。我们将多个最具代表性的 CardEst 方法集成到开源数据库系统 PostgreSQL 中，并综合评估它们在提高查询计划质量方面的真实有效性，以及影响其适用性的其他重要方面，从推理延迟、模型大小、训练时间到更新效率和准确性。在不同的数据和查询设置下，我们获得了 CardEst 方法的许多关键发现。此外，我们发现广泛使用的估计精度指标（Q-Error）在查询优化过程中无法区分不同子计划查询的重要性，因此无法真实反映 CardEst 方法生成的查询计划质量。因此，我们提出了一个新的度量 P-Error 来评估 CardEst 方法的性能，它克服了 Q-Error 的局限性，能够反映 CardEst 方法的整体端到端性能。我们已在 https://github.com/Nathaniel-Han/Endto-End-CardEst-Benchmark 公开提供所有基准数据和评估代码。

更新日期：2021-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>