当前位置: X-MOL 学术ACM Trans. Database Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Representations and Optimizations for Embedded Parallel Dataflow Languages
ACM Transactions on Database Systems ( IF 1.8 ) Pub Date : 2019-01-29 , DOI: 10.1145/3281629
Alexander Alexandrov 1 , Georgi Krastev 1 , Volker Markl 1
Affiliation  

Parallel dataflow engines such as Apache Hadoop, Apache Spark, and Apache Flink are an established alternative to relational databases for modern data analysis applications. A characteristic of these systems is a scalable programming model based on distributed collections and parallel transformations expressed by means of second-order functions such as map and reduce. Notable examples are Flink’s DataSet and Spark’s RDD programming abstractions. These programming models are realized as EDSLs—domain specific languages embedded in a general-purpose host language such as Java, Scala, or Python. This approach has several advantages over traditional external DSLs such as SQL or XQuery. First, syntactic constructs from the host language (e.g., anonymous functions syntax, value definitions, and fluent syntax via method chaining) can be reused in the EDSL. This eases the learning curve for developers already familiar with the host language. Second, it allows for seamless integration of library methods written in the host language via the function parameters passed to the parallel dataflow operators. This reduces the effort for developing analytics dataflows that go beyond pure SQL and require domain-specific logic. At the same time, however, state-of-the-art parallel dataflow EDSLs exhibit a number of shortcomings. First, one of the main advantages of an external DSL such as SQL—the high-level, declarative Select-From-Where syntax—is either lost completely or mimicked in a non-standard way. Second, execution aspects such as caching, join order, and partial aggregation have to be decided by the programmer. Optimizing them automatically is very difficult due to the limited program context available in the intermediate representation of the DSL. In this article, we argue that the limitations listed above are a side effect of the adopted type-based embedding approach. As a solution, we propose an alternative EDSL design based on quotations. We present a DSL embedded in Scala and discuss its compiler pipeline, intermediate representation, and some of the enabled optimizations. We promote the algebraic type of bags in union representation as a model for distributed collections and its associated structural recursion scheme and monad as a model for parallel collection processing. At the source code level, Scala’s comprehension syntax over a bag monad can be used to encode Select-From-Where expressions in a standard way. At the intermediate representation level, maintaining comprehensions as a first-class citizen can be used to simplify the design and implementation of holistic dataflow optimizations that accommodate for nesting and control-flow. The proposed DSL design therefore reconciles the benefits of embedded parallel dataflow DSLs with the declarativity and optimization potential of external DSLs like SQL.

中文翻译:

嵌入式并行数据流语言的表示和优化

Apache Hadoop、Apache Spark 和 Apache Flink 等并行数据流引擎是现代数据分析应用程序的关系数据库的成熟替代方案。这些系统的一个特点是基于分布式集合和并行转换的可扩展编程模型,这些并行转换通过 map 和 reduce 等二阶函数表示。值得注意的例子是 Flink 的 DataSet 和 Spark 的 RDD 编程抽象。这些编程模型被实现为 EDSL——嵌入在 Java、Scala 或 Python 等通用宿主语言中的领域特定语言。与 SQL 或 XQuery 等传统的外部 DSL 相比,这种方法有几个优点。首先,来自宿主语言的语法结构(例如,匿名函数语法、值定义和通过方法链接的流畅语法)可以在 EDSL 中重用。这简化了已经熟悉宿主语言的开发人员的学习曲线。其次,它允许通过传递给并行数据流运算符的函数参数无缝集成以宿主语言编写的库方法。这减少了开发超出纯 SQL 并需要特定领域逻辑的分析数据流的工作量。然而,与此同时,最先进的并行数据流 EDSL 也表现出许多缺点。首先,外部 DSL(如 SQL)的主要优势之一——高级的、声明性的 Select-From-Where 语法——要么完全丢失,要么以非标准方式被模仿。其次,缓存、连接顺序和部分聚合等执行方面必须由程序员决定。由于 DSL 的中间表示中可用的程序上下文有限,因此自动优化它们非常困难。在本文中,我们认为上面列出的限制是所采用的基于类型的嵌入方法的副作用。作为一种解决方案,我们提出了一种基于报价的替代 EDSL 设计。我们展示了嵌入在 Scala 中的 DSL,并讨论了它的编译器管道、中间表示以及一些启用的优化。我们将联合表示中的代数类型的包作为分布式集合及其相关的结构递归方案的模型,并将 monad 作为并行集合处理的模型。在源代码级别,Scala 对 bag monad 的理解语法可用于以标准方式对 Select-From-Where 表达式进行编码。在中间表示级别,将理解保持为一等公民可用于简化适应嵌套和控制流的整体数据流优化的设计和实现。因此,所提出的 DSL 设计将嵌入式并行数据流 DSL 的优势与 SQL 等外部 DSL 的声明性和优化潜力相协调。
更新日期:2019-01-29
down
wechat
bug