Adding data provenance support to Apache Spark.,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Adding data provenance support to Apache Spark.
The VLDB Journal ( IF 2.8 ) Pub Date : 2017-08-07 , DOI: 10.1007/s00778-017-0474-5
Matteo Interlandi ₁ , Ari Ekmekji ₂ , Kshitij Shah ₃ , Muhammad Ali Gulzar ₃ , Sai Deep Tetali ₃ , Miryung Kim ₃ , Todd Millstein ₃ , Tyson Condie ₃

Affiliation

Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders of magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

中文翻译：

向 Apache Spark 添加数据来源支持。

调试数据密集型可扩展计算 (DISC) 系统中的数据处理逻辑是一项困难且耗时的工作。当今的 DISC 系统提供的用于调试程序的工具非常少，因此，程序员花费无数时间收集证据（例如，从日志文件中）并执行试错调试。为了帮助实现这一目标，我们构建了Titian ，这是一个在 Apache Spark 中支持数据来源（通过转换跟踪数据）的库。使用 Titian Spark 扩展的数据科学家将能够快速识别潜在错误或异常结果的根本原因的输入数据。 Titian 直接内置于 Spark 平台中，并以交互速度（比替代解决方案快几个数量级）提供数据来源支持，同时对 Spark 作业性能的影响最小；观察到的捕获数据沿袭的开销很少超过基线作业执行时间的 30%。

更新日期：2017-08-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文