Flow-Loss: Learning Cardinality Estimates That Matter,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Flow-Loss: Learning Cardinality Estimates That Matter
arXiv - CS - Databases Pub Date : 2021-01-13 , DOI: arxiv-2101.04964
Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh

Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data.

中文翻译：

流量损失：学习重要的基数估计

先前学习的基数估计的方法集中于改善平均估计误差，但并非所有估计都具有同等重要性。由于学习的模型不可避免地会犯错误，因此目标应该是改善对优化器影响最大的估计。我们引入了一个新的损失函数Flow-Loss，该函数通过使用分析函数近似优化器的成本模型和动态规划搜索算法来显式优化更好的查询计划。Flow-Loss的核心是将查询优化简化为某个计划图上的流路由问题，在该计划图中，路径对应于不同的查询计划。为了评估我们的方法，我们引入了基数估计基准，该基准包含从21个模板（最多包含15个连接）中的超过16K查询子计划的基本真相基数。我们表明，在不同的体系结构和数据库之间，使用Flow-Loss训练的模型可以提高计划成本（使用PostgreSQL成本模型）和查询运行时，尽管其估计准确性比使用Q-Error训练的模型差。当测试集查询与训练查询紧密匹配时，这两种模型都比PostgreSQL显着提高了性能，并且接近最佳性能（使用真实基数）。但是，当对略有不同的查询（例如，相似但不相同的查询模板）进行评估时，Q-Error训练的模型会大大降低性能，而Flow-Loss训练的模型则更好地概括了这种情况。例如，与Q-Error模型相比，Flow-Loss模型在看不见的模板上实现的运行时间提高了1.5倍，

更新日期：2021-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>