Pre-filtering based summarization for data partitioning in distributed stream processing,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pre-filtering based summarization for data partitioning in distributed stream processing
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2021-04-30 , DOI: 10.1002/cpe.6338
Adeel Aslam ₁ , Hanhua Chen _{2,

3} , Hai Jin ₃

Affiliation

Load balancing among the processing elements (PEs) of distributed stream processing system (DSPS) is a key issue in the presence of data skewness. Existing data partitioning schemes for DSPS suffer from the scalability problem and system in-efficiency. Non-key based partitioning strategies raise prohibitively high memory overhead for the stateful operations with a large number of keys and high data parallelism, while the key-based schemes introduce load imbalance for highly skewed data. Predicting the nature of stream data in advance can help to reduce the load imbalance among the PEs of DSPS. For this purpose, the heavy hitter algorithms approximate the hot items of streaming data. However, existing designs suffer from unsatisfied prediction accuracy. In this work, we propose an efficient algorithm to filter hot items in a stream of incoming data. The proposed scheme dynamically monitors the items of a stream and greatly improves the accuracy of estimation by keeping the actual key-value pair for the frequent items. On one hand, to ensure better load balancing for the skewed data streams, the detected hot keys are directed to more than two PEs randomly from the limited workers. On the other hand, for less frequent keys, the proposed scheme explores the principle of the power of two choices to distribute load. We conduct extensive experiments on both real-world and synthetic data sets. The results show that the proposed pre-filtering approach significantly outperforms existing designs in terms of prediction accuracy. The results also show that our design achieves a more balanced load as compared to the existing designs.

中文翻译：

分布式流处理中基于预过滤的数据分区汇总

分布式流处理系统 (DSPS) 的处理元件 (PE) 之间的负载平衡是存在数据偏斜的关键问题。现有的 DSPS 数据分区方案存在可扩展性问题和系统效率低下的问题。对于具有大量键和高数据并行性的有状态操作，基于非键的分区策略会增加过高的内存开销，而基于键的方案会为高度倾斜的数据引入负载不平衡。提前预测流数据的性质可以帮助减少DSPS的PE之间的负载不平衡。为此，重量级算法近似于流数据的热门项目。然而，现有设计的预测准确性不令人满意。在这项工作中，我们提出了一种有效的算法来过滤传入数据流中的热门项目。所提出的方案动态监控流中的项目，通过保留频繁项目的实际键值对，大大提高了估计的准确性。一方面，为了确保对倾斜数据流更好的负载平衡，检测到的热键从有限的工作人员中随机定向到两个以上的 PE。另一方面，对于不太频繁的密钥，所提出的方案探索了两种选择的幂分配负载的原理。我们对现实世界和合成数据集进行了广泛的实验。结果表明，所提出的预过滤方法在预测精度方面明显优于现有设计。结果还表明，与现有设计相比，我们的设计实现了更平衡的负载。

更新日期：2021-04-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文