当前位置: X-MOL 学术Adv. Eng. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
IPC: Resource and network cost-aware distributed stream scheduling on skewed streams
Advanced Engineering Informatics ( IF 8.0 ) Pub Date : 2020-09-17 , DOI: 10.1016/j.aei.2020.101165
Muhammad Mudassar Qureshi , Hanhua Chen , Fan Zhang , Hai Jin

The performance of distributed stream processing engines is significantly compromised when processing stream data with skewed distribution. Current stream partitioning schemes are not able to meet the rigorous requirements of distributed stream processing. We show that network cost is an essential factor for partitioning data, and this factor should be considered when designing a stream partitioning scheme. Additionally, we should efficiently utilize resources in the data partitioning process. Current stream partitioning schemes either use a shuffle grouping approach that efficiently manages workload but faces scalability issues in terms of memory or uses hash-based key grouping schemes that suffer from load balancing issues.

We argue that network cost and resource utilization are two crucial factors for stream partitioning schemes. We propose and implement a distributed stream partitioning scheme call IPC that minimizes the network cost and efficiently utilizes resources by leveraging two techniques: process near source and process at local. It also utilizes key splitting and local load estimation techniques to achieve load balancing. We implement the IPC on top of Apache Storm. Experiment results using large scale real-time datasets show that IPC achieves an up to 4.2x improvement in throughput and reduces processing latency by 97% compared to state-of-the-art designs.



中文翻译:

IPC:在偏斜流上感知资源和网络成本的分布式流调度

当处理具有偏斜分布的流数据时,分布式流处理引擎的性能将大大降低。当前的流分区方案不能满足分布式流处理的严格要求。我们表明,网络成本是划分数据的重要因素,在设计流划分方案时应考虑这一因素。此外,我们应该在数据分区过程中有效地利用资源。当前的流分区方案或者使用可以有效管理工作负载,但是在内存方面面临可伸缩性问题的混洗分组方法,或者使用遭受负载平衡问题的基于哈希的密钥分组方案。

我们认为网络成本和资源利用率是流分配方案的两个关键因素。我们提出并实现了一种分布式流分区方案IPC,该方案通过利用以下两种技术来最大程度地降低网络成本并有效利用资源:靠近源的处理位于本地的处理。它还利用密钥拆分本地负载估计技术来实现负载平衡。我们在Apache Storm之上实现IPC。使用大规模实时数据集的实验结果表明,与最新设计相比,IPC的吞吐量提高了多达4.2倍,处理延迟减少了97%。

更新日期:2020-09-17
down
wechat
bug