Evaluation of distributed data processing frameworks in hybrid clouds,Journal of Network and Computer Applications

当前位置： X-MOL 学术 › J. Netw. Comput. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluation of distributed data processing frameworks in hybrid clouds
Journal of Network and Computer Applications ( IF 8.7 ) Pub Date : 2024-02-03 , DOI: 10.1016/j.jnca.2024.103837
Faheem Ullah , Shagun Dhingra , Xiaoyu Xia , M. Ali Babar

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed data processing frameworks hosted in private and public clouds. However, there is a paucity of research on evaluating the performance of these frameworks hosted in a hybrid cloud, which is an emerging cloud model that integrates private and public clouds to use the best of both worlds. Therefore, in this paper, we evaluate the performance of Hadoop, Spark, and Flink in a hybrid cloud in terms of execution time, resource utilization, horizontal scalability, vertical scalability, and cost. For this study, our hybrid cloud consists of OpenStack (private cloud) and MS Azure (public cloud). We use both batch and iterative workloads for the evaluation. Our results show that in a hybrid cloud (i) the execution time increases as more nodes are borrowed by the private cloud from the public cloud, (ii) Flink outperforms Spark, which in turn outperforms Hadoop in terms of execution time, (iii) Hadoop transfers the largest amount of data among the nodes during the workload execution while Spark transfers the least amount of data, (iv) all three frameworks horizontally scale better as compared to vertical scaling, and (v) Spark is found to be least expensive in terms of $ cost for data processing while Hadoop is found the most expensive.

中文翻译：

混合云中分布式数据处理框架的评估

分布式数据处理框架（例如Hadoop、Spark和Flink）被广泛用于在云的计算节点之间分发数据。最近，人们越来越多地致力于评估私有云和公共云中托管的分布式数据处理框架的性能。然而，关于评估托管在混合云中的这些框架的性能的研究很少，混合云是一种新兴的云模型，它将私有云和公共云集成在一起，以充分利用两者的优点。因此，在本文中，我们从执行时间、资源利用率、水平可扩展性、垂直可扩展性和成本等方面评估混合云中 Hadoop、Spark 和 Flink 的性能。在本研究中，我们的混合云由 OpenStack（私有云）和 MS Azure（公共云）组成。我们使用批量和迭代工作负载进行评估。我们的结果表明，在混合云中 (i) 随着私有云从公有云借用更多节点，执行时间会增加，(ii) Flink 的性能优于 Spark，而 Spark 的执行时间又优于 Hadoop，(iii) Hadoop 在工作负载执行期间在节点之间传输最大量的数据，而 Spark 传输最少的数据量，(iv) 与垂直扩展相比，所有三个框架的水平扩展能力更好，(v) Spark 在以下方面成本最低：就数据处理的成本而言，Hadoop 是最昂贵的。

更新日期：2024-02-03

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>