SEIZE: Runtime Inspection for Parallel Dataflow Systems,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SEIZE: Runtime Inspection for Parallel Dataflow Systems
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-11-02 , DOI: 10.1109/tpds.2020.3035170
Youfu Li , Matteo Interlandi , Fotis Psallidas , Wei Wang , Carlo Zaniolo

Many Data-Intensive Scalable Computing (DISC) Systems provide easy-to-use functional APIs, and efficient scheduling and execution strategies allowing users to build concise data-parallel programs. In these systems, data transformations are concealed by exposed APIs, and intermediate execution states are masked under dataflow transitions. Consequently, many crucial features and optimizations (e.g., debugging, data provenance, runtime skew detection), which require runtime datafow states, are not well-supported. Inspired by our experience in implementing features and optimizations over DISC systems, we present SEIZE, a unified framework that enables dataflow inspection-wiretapping the data-path with listening logic-in MapReduce-style programming model. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. We demonstrate the generality and flexibility of the approach by deploying SEIZE in both Apache Spark and Apache Flink, and by implementing a prototype runtime query optimizer for Spark. Our experiments show that, the overhead introduced by the inspection logic is most of the time negligible (less than 5 percent in Spark and 10 percent in Flink).

中文翻译：

SEIZE：并行数据流系统的运行时检查

许多数据密集型可扩展计算（DISC）系统提供易于使用的功能API以及高效的调度和执行策略，允许用户构建简洁的数据并行程序。在这些系统中，数据转换被公开的 API 隐藏，中间执行状态被数据流转换掩盖。因此，许多需要运行时数据流状态的关键功能和优化（例如，调试、数据来源、运行时偏差检测）没有得到很好的支持。受到我们在 DISC 系统上实现功能和优化的经验的启发，我们提出了 SEIZE，这是一个统一的框架，可以在 MapReduce 风格的编程模型中使用监听逻辑进行数据流检查（窃听数据路径）。我们通过提供一组定义数据流检查的原语、不同检查粒度的编排选项以及用于数据流干预的运算符分解和数据流标点策略来概括我们学到的经验教训。我们通过在 Apache Spark 和 Apache Flink 中部署 SEIZE，并为 Spark 实现原型运行时查询优化器，展示了该方法的通用性和灵活性。我们的实验表明，检查逻辑引入的开销大多数时候可以忽略不计（Spark 中不到 5%，Flink 中不到 10%）。

更新日期：2020-11-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11