当前位置: X-MOL 学术Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Query-centric regression
Information Systems ( IF 3.0 ) Pub Date : 2021-02-14 , DOI: 10.1016/j.is.2021.101736
Qingzhi Ma , Peter Triantafillou

Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real-world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions.



中文翻译:

以查询为中心的回归

回归模型(RM)和机器学习模型(ML)通常旨在提供高预测准确性,即使对于不可预见的查询/数据集也是如此。这取决于他们的概括能力。但是,相对于当前数据库状态过度拟合模型可能最适合于提供出色的准确性。这种过度拟合和普遍化的鸿沟具有数据分析师所面临的许多实际含义。本文将使用大量实际数据集和大量RM揭示,阐明并量化这种鸿沟。它将表明,不同的RM在此划分中占据不同的位置,这导致不同的RM更适合于回答同一数据集不同部分的查询(因为查询通常针对使用属性上的选择运算符定义的特定数据子空间)。它将从TPC-DS基准中详细研究8个真实数据集,并在其中进行各种维度的实验。它将采用新的适当指标来揭示RM的性能差异,并将在各种流行的RM上证实该问题,从简单的线性模型到先进的,先进的集成(拥有出色的泛化性能) )。它将提出并研究新的以查询为中心的模型可以解决此问题,提高每次查询的准确性,同时还提供出色的整体准确性。最后,它将研究规模对问题及其解决方案的影响。

更新日期:2021-02-15
down
wechat
bug