当前位置: X-MOL 学术J. Comput. Sci. Tech. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bigflow: A General Optimization Layer for Distributed Computing Frameworks
Journal of Computer Science and Technology ( IF 1.9 ) Pub Date : 2020-03-01 , DOI: 10.1007/s11390-020-9702-3
Yun-Cong Zhang , Xiao-Yang Wang , Cong Wang , Yao Xu , Jian-Wei Zhang , Xiao-Dong Lin , Guang-Yu Sun , Gong-Lin Zheng , Shan-Hui Yin , Xian-Jin Ye , Li Li , Zhan Song , Dong-Dong Miao

As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of parallelism. However, most of them follow the single-layer partitioning method, which limits developers to express a multi-level partitioning operation succinctly. To overcome the problem, we present the NDD (Nested Distributed Dataset) data model. It is a more compact and expressive extension of Spark RDD (Resilient Distributed Dataset), in order to remove the burden on developers to manually write the logic for multi-level partitioning cases. Base on the NDD model, we develop an open-source framework called Bigflow, which serves as an optimization layer over computation engines from most widely used processing frameworks. With the help of Bigflow, some advanced optimization techniques, which may only be applied by experienced programmers manually, are enabled automatically in a distributed data processing job. Currently, Bigflow is processing about 3 PB data volumes daily in the data-centers of Baidu. According to customer experience, it can significantly save code length and improve performance over the intuitive programming style.

中文翻译:

Bigflow:分布式计算框架的通用优化层

随着数据量的快速增长,分布式计算被广泛应用于数据中心,以提供廉价且高效的方法来处理大规模并行数据集。已经提出了各种计算模型来改进分布式数据集的抽象并隐藏并行性的细节。然而,它们大多遵循单层分区的方法,这限制了开发人员简洁地表达多级分区操作。为了克服这个问题,我们提出了 NDD(嵌套分布式数据集)数据模型。它是 Spark RDD(弹性分布式数据集)的更紧凑和更具表现力的扩展,以减轻开发人员为多级分区情况手动编写逻辑的负担。基于 NDD 模型,我们开发了一个名为 Bigflow 的开源框架,它作为来自最广泛使用的处理框架的计算引擎的优化层。在 Bigflow 的帮助下,一些只能由有经验的程序员手动应用的高级优化技术在分布式数据处理作业中自动启用。目前,Bigflow 在百度的数据中心每天处理大约 3 PB 的数据量。根据客户的经验,它可以比直观的编程风格显着节省代码长度并提高性能。Bigflow 每天在百度的数据中心处理大约 3 PB 的数据量。根据客户的经验,它可以比直观的编程风格显着节省代码长度并提高性能。Bigflow 每天在百度的数据中心处理大约 3 PB 的数据量。根据客户的经验,它可以比直观的编程风格显着节省代码长度并提高性能。
更新日期:2020-03-01
down
wechat
bug