runData: Re-distributing Data via Piggybacking for Geo-distributed Data Analytics over Edges,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

runData: Re-distributing Data via Piggybacking for Geo-distributed Data Analytics over Edges
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-06-03 , DOI: 10.1109/tpds.2021.3086274
Yibo Jin , Zhuzhong Qian , Song Guo , Sheng Zhang , Lei Jiao , Sanglu Lu

Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of current job alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a sequence of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding ε-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema run Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, run Data is proved concentrated on its optimum with high probability. We implement run Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that run Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

中文翻译：

runData：通过捎带重新分配数据，以实现边缘上的地理分布式数据分析

有效分析地理分布式数据集正在成为云边缘系统的主要需求。由于数据集通常是在更接近最终用户的地方生成的，因此传统的工作主要集中于将适当的任务从这些热点边缘卸载到数据中心，以一次性地减少提交作业的总体完成时间。然而，从长远来看，仅优化当前作业的完成时间是不够的，因为某些数据集会被多次使用。相反，优化数据分布效率更高，并且可以直接使即将到来的作业受益，尽管它可能会推迟当前作业的执行。不幸的是，由于数据获取器的一次性特性，现有的数据分析系统在执行数据分析后无法将相应的数据重新分配到热点边缘之外。为了最小化一系列作业的总体完成时间并保证当前作业的性能，我们建议在任务卸载的同时重新分配数据，并制定相应的 ε 有界数据驱动任务调度问题考虑边缘异构性的广域网。我们设计了一个在线模式运行数据，它根据精心计算的概率通过搭载到数据中心来卸载适当的任务和相关数据。通过严格的理论分析，证明运行数据以高概率集中在最优值。我们基于Spark和HDFS实现运行数据。测试台结果和跟踪驱动的模拟都表明，运行数据通过捎带重新分配正确的数据，与最先进的模式相比，平均响应时间减少了 37%。

更新日期：2021-06-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11