当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DIFF: a relational interface for large-scale data explanation
The VLDB Journal ( IF 4.2 ) Pub Date : 2020-09-30 , DOI: 10.1007/s00778-020-00633-6
Firas Abuzaid , Peter Kraft , Sahaana Suri , Edward Gan , Eric Xu , Atul Shenoy , Asvin Ananthanarayan , John Sheu , Erik Meijer , Xi Wu , Jeff Naughton , Peter Bailis , Matei Zaharia

A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.



中文翻译:

DIFF:用于大规模数据解释的关系界面

一系列的解释引擎可通过对日益增长的高容量和高维度数据执行功能选择,对数据点之间的共性进行分组和突出显示来帮助数据分析人员。尽管在各种任务(例如用户行为分析,操作事件处理和根本原因分析)中​​很有用,但当今的解释引擎被设计为独立的数据处理工具,无法与基于SQL的传统分析工作流程互操作。这限制了这些引擎的适用性和可扩展性。作为响应,我们提出了DIFF运算符,这是一个关系聚合运算符,它通过声明性关系查询处理统一了这些引擎的核心功能。我们同时实现了DIFF的单节点版本和分布式版本MB SQL的运算符,它是MacroBase的扩展,并演示了DIFF如何提供与现有解释引擎相同的语义,同时捕获包括Microsoft和Facebook在内的行业中广泛的生产用例。此外,我们说明了这种声明性的数据解释方法如何实现新的逻辑和物理查询优化。我们在几个实际的生产应用程序上评估了这些优化,发现MB SQL中的DIFF可以比最先进的引擎高出一个数量级。

更新日期:2020-10-02
down
wechat
bug