Translation of Array-Based Loops to Distributed Data-Parallel Programs,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Translation of Array-Based Loops to Distributed Data-Parallel Programs
arXiv - CS - Databases Pub Date : 2020-03-21 , DOI: arxiv-2003.09769
Leonidas Fegaras and Md Hasanuzzaman Noor

Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But, as datasets grow larger, new frameworks in distributed Big Data analytics have become essential tools to large-scale scientific computing. Scientists, who are typically comfortable with numerical analysis tools but are not familiar with the intricacies of Big Data analytics, must now learn to convert their loop-based programs to distributed data-parallel programs. We present a novel framework for translating programs expressed as array-based loops to distributed data parallel programs that is more general and efficient than related work. Although our translations are over sparse arrays, we extend our framework to handle packed arrays, such as tiled matrices, without sacrificing performance. We report on a prototype implementation on top of Spark and evaluate the performance of our system relative to hand-written programs.

中文翻译：

将基于数组的循环转换为分布式数据并行程序

科学实验和模拟生成的大量数据以数组的形式出现，而分析这些数据的程序通常以基于循环的命令式语言中的数组操作来表达。但是，随着数据集变得越来越大，分布式大数据分析中的新框架已成为大规模科学计算的重要工具。通常熟悉数值分析工具但不熟悉大数据分析的复杂性的科学家现在必须学习将基于循环的程序转换为分布式数据并行程序。我们提出了一种新颖的框架，用于将表示为基于数组的循环的程序转换为分布式数据并行程序，该框架比相关工作更通用、更高效。虽然我们的翻译是在稀疏数组上，我们扩展了我们的框架来处理打包数组，例如平铺矩阵，而不会牺牲性能。我们报告了基于 Spark 的原型实现，并评估了我们系统相对于手写程序的性能。

更新日期：2020-03-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>