当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Approximate Selection with Guarantees using Proxies
arXiv - CS - Databases Pub Date : 2020-04-02 , DOI: arxiv-2004.00827
Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia

Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30x for both the precision and recall targets in both real and synthetic datasets.

中文翻译:

使用代理进行担保的近似选择

由于数据获取和存储成本的下降,研究人员和行业分析师通常希望在大型数据集中找到所有罕见事件的实例。例如,科学家可以廉价地捕获数千小时的视频,但受限于需要手动检查长视频以识别相关对象和事件。为了降低这种成本,最近的工作建议使用廉价的代理模型(例如图像分类器)来识别满足数据选择过滤器的近似数据点集。不幸的是,最近的这项工作没有提供科学和生产环境中所需的统计准确性保证。在这项工作中,我们引入了具有统计准确性保证的近似选择查询的新算法。也就是说,给定数量有限的来自预言机的准确标识,通常是人类或昂贵的机器学习模型,我们的算法以高概率满足最小精度或召回目标。相比之下,现有方法在满足这些召回和精确目标方面可能会发生灾难性的失败。我们表明,对于真实和合成数据集中的精度和召回率目标,我们的算法可以将查询结果质量提高多达 30 倍。
更新日期:2020-07-27
down
wechat
bug