当前位置:
X-MOL 学术
›
arXiv.cs.DB
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Flow-Loss: Learning Cardinality Estimates That Matter
arXiv - CS - Databases Pub Date : 2021-01-13 , DOI: arxiv-2101.04964 Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh
arXiv - CS - Databases Pub Date : 2021-01-13 , DOI: arxiv-2101.04964 Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh
Previous approaches to learned cardinality estimation have focused on
improving average estimation error, but not all estimates matter equally. Since
learned models inevitably make mistakes, the goal should be to improve the
estimates that make the biggest difference to an optimizer. We introduce a new
loss function, Flow-Loss, that explicitly optimizes for better query plans by
approximating the optimizer's cost model and dynamic programming search
algorithm with analytical functions. At the heart of Flow-Loss is a reduction
of query optimization to a flow routing problem on a certain plan graph in
which paths correspond to different query plans. To evaluate our approach, we
introduce the Cardinality Estimation Benchmark, which contains the ground truth
cardinalities for sub-plans of over 16K queries from 21 templates with up to 15
joins. We show that across different architectures and databases, a model
trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost
model) and query runtimes despite having worse estimation accuracy than a model
trained with Q-Error. When the test set queries closely match the training
queries, both models improve performance significantly over PostgreSQL and are
close to the optimal performance (using true cardinalities). However, the
Q-Error trained model degrades significantly when evaluated on queries that are
slightly different (e.g., similar but not identical query templates), while the
Flow-Loss trained model generalizes better to such situations. For example, the
Flow-Loss model achieves up to 1.5x better runtimes on unseen templates
compared to the Q-Error model, despite leveraging the same model architecture
and training data.
中文翻译:
流量损失:学习重要的基数估计
先前学习的基数估计的方法集中于改善平均估计误差,但并非所有估计都具有同等重要性。由于学习的模型不可避免地会犯错误,因此目标应该是改善对优化器影响最大的估计。我们引入了一个新的损失函数Flow-Loss,该函数通过使用分析函数近似优化器的成本模型和动态规划搜索算法来显式优化更好的查询计划。Flow-Loss的核心是将查询优化简化为某个计划图上的流路由问题,在该计划图中,路径对应于不同的查询计划。为了评估我们的方法,我们引入了基数估计基准,该基准包含从21个模板(最多包含15个连接)中的超过16K查询子计划的基本真相基数。我们表明,在不同的体系结构和数据库之间,使用Flow-Loss训练的模型可以提高计划成本(使用PostgreSQL成本模型)和查询运行时,尽管其估计准确性比使用Q-Error训练的模型差。当测试集查询与训练查询紧密匹配时,这两种模型都比PostgreSQL显着提高了性能,并且接近最佳性能(使用真实基数)。但是,当对略有不同的查询(例如,相似但不相同的查询模板)进行评估时,Q-Error训练的模型会大大降低性能,而Flow-Loss训练的模型则更好地概括了这种情况。例如,与Q-Error模型相比,Flow-Loss模型在看不见的模板上实现的运行时间提高了1.5倍,
更新日期:2021-01-14
中文翻译:
流量损失:学习重要的基数估计
先前学习的基数估计的方法集中于改善平均估计误差,但并非所有估计都具有同等重要性。由于学习的模型不可避免地会犯错误,因此目标应该是改善对优化器影响最大的估计。我们引入了一个新的损失函数Flow-Loss,该函数通过使用分析函数近似优化器的成本模型和动态规划搜索算法来显式优化更好的查询计划。Flow-Loss的核心是将查询优化简化为某个计划图上的流路由问题,在该计划图中,路径对应于不同的查询计划。为了评估我们的方法,我们引入了基数估计基准,该基准包含从21个模板(最多包含15个连接)中的超过16K查询子计划的基本真相基数。我们表明,在不同的体系结构和数据库之间,使用Flow-Loss训练的模型可以提高计划成本(使用PostgreSQL成本模型)和查询运行时,尽管其估计准确性比使用Q-Error训练的模型差。当测试集查询与训练查询紧密匹配时,这两种模型都比PostgreSQL显着提高了性能,并且接近最佳性能(使用真实基数)。但是,当对略有不同的查询(例如,相似但不相同的查询模板)进行评估时,Q-Error训练的模型会大大降低性能,而Flow-Loss训练的模型则更好地概括了这种情况。例如,与Q-Error模型相比,Flow-Loss模型在看不见的模板上实现的运行时间提高了1.5倍,