Accelerating Approximate Aggregation Queries with Expensive Predicates,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accelerating Approximate Aggregation Queries with Expensive Predicates
arXiv - CS - Databases Pub Date : 2021-08-13 , DOI: arxiv-2108.06313
Daniel Kang, John Guibas, Peter Bailis, Tatsunori Hashimoto, Yi Sun, Matei Zaharia

Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates. To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABae). ABae must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, ABae uses pilot sampling and plugin estimates to sample according to the optimal allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3x.

中文翻译：

使用昂贵的谓词加速近似聚合查询

研究人员和行业分析师对使用昂贵的深度神经网络 (DNN) 计算具有选择性谓词的大型非结构化数据集计算聚合查询越来越感兴趣。由于这些 DNN 很昂贵，而且许多应用程序可以容忍近似答案，因此分析师有兴趣通过近似来加速这些查询。不幸的是，用于加速此类查询的标准近似查询处理技术并不适用，因为它们假设谓词的结果提前可用。此外，最近使用廉价近似（即代理）的工作不支持带有谓词的聚合查询。为了使用昂贵的谓词加速聚合查询，我们开发并分析了一种利用代理 (ABae) 的查询处理算法。ABae 必须考虑到它可能对不满足谓词的记录进行采样的关键挑战。为了解决这个挑战，我们首先使用代理将记录分组到层中，以便满足谓词的记录理想地分组到几个层中。给定这些层，ABae 使用试点抽样和插件估计根据最佳分配进行抽样。我们表明 ABae 在分层抽样的新分析中以最佳速率收敛，抽签可能不满足谓词。我们进一步表明 ABae 在六个真实世界数据集的基线上表现优于基线，将标记成本降低了 2.3 倍。我们首先使用代理将记录分组到层中，以便满足谓词的记录被理想地分组到几个层中。给定这些层，ABae 使用试点抽样和插件估计根据最佳分配进行抽样。我们表明 ABae 在分层抽样的新分析中以最佳速率收敛，抽签可能不满足谓词。我们进一步表明 ABae 在六个真实世界数据集的基线上表现优于基线，将标记成本降低了 2.3 倍。我们首先使用代理将记录分组到层中，以便满足谓词的记录被理想地分组到几个层中。给定这些层，ABae 使用试点抽样和插件估计根据最佳分配进行抽样。我们表明 ABae 在分层抽样的新分析中以最佳速率收敛，抽签可能不满足谓词。我们进一步表明 ABae 在六个真实世界数据集的基线上表现优于基线，将标记成本降低了 2.3 倍。

更新日期：2021-08-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文