EntropyDB: a probabilistic approach to approximate query processing,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

EntropyDB: a probabilistic approach to approximate query processing
The VLDB Journal ( IF 2.8 ) Pub Date : 2019-11-02 , DOI: 10.1007/s00778-019-00582-9
Laurel Orr , Magdalena Balazinska , Dan Suciu

We present, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the USA and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins.

中文翻译：

EntropyDB：一种近似查询处理的概率方法

我们提出了一种交互式的数据浏览系统，该系统使用概率方法来生成数据集的较小的，可查询的摘要。与传统的汇总技术不同，我们使用最大熵原理来生成数据的概率表示，该数据可用于给出近似的查询答案。我们开发了概率表示的理论框架和公式，并展示了如何使用它来回答查询。然后，我们介绍解决技术，给出两项关键的优化措施以缩短预处理时间和查询执行时间，并探索减少查询错误的方法。最后，我们使用美国境内5 GB的飞行数据集和来自天文学粒子模拟的210 GB数据集实验性地评估了我们的工作。虽然我们目前的工作仅支持线性查询，我们证明，我们的技术可以比采样更快地成功回答查询，同时平均引入的错误不比采样多，并且可以更好地区分稀有值和不存在值。我们还将讨论扩展，这些扩展可允许通过联接进行数据更新和线性查询。

更新日期：2019-11-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文