当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mosaic: A Sample-Based Database System for Open World Query Processing
arXiv - CS - Databases Pub Date : 2019-12-17 , DOI: arxiv-1912.07777
Laurel Orr, Samuel Ainsworth, Walter Cai, Kevin Jamieson, Magda Balazinska, Dan Suciu

Data scientists have relied on samples to analyze populations of interest for decades. Recently, with the increase in the number of public data repositories, sample data has become easier to access. It has not, however, become easier to analyze. This sample data is arbitrarily biased with an unknown sampling probability, meaning data scientists must manually debias the sample with custom techniques to avoid inaccurate results. In this vision paper, we propose Mosaic, a database system that treats samples as first-class citizens and allows users to ask questions over populations represented by these samples. Answering queries over biased samples is non-trivial as there is no existing, standard technique to answer population queries when the sampling probability is unknown. In this paper, we show how our envisioned system solves this problem by having a unique sample-based data model with extensions to the SQL language. We propose how to perform population query answering using biased samples and give preliminary results for one of our novel query answering techniques.

中文翻译:

Mosaic:用于开放世界查询处理的基于样本的数据库系统

数十年来,数据科学家一直依靠样本来分析感兴趣的人群。最近,随着公共数据存储库数量的增加,样本数据变得更容易访问。然而,它并没有变得更容易分析。该样本数据以未知的抽样概率任意偏倚,这意味着数据科学家必须使用自定义技术手动对样本进行去偏,以避免产生不准确的结果。在这篇愿景论文中,我们提出了 Mosaic,这是一个将样本视为一等公民的数据库系统,并允许用户对这些样本所代表的人群提出问题。回答有偏见的样本的查询并非易事,因为当抽样概率未知时,没有现有的标准技术来回答总体查询。在本文中,我们展示了我们设想的系统如何通过拥有一个独特的基于样本的数据模型和 SQL 语言扩展来解决这个问题。我们提出了如何使用有偏见的样本执行总体查询回答,并为我们的一种新颖的查询回答技术提供初步结果。
更新日期:2020-01-14
down
wechat
bug