当前位置: X-MOL 学术Mobile Netw. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications
Mobile Networks and Applications ( IF 3.8 ) Pub Date : 2021-01-06 , DOI: 10.1007/s11036-020-01729-7
Zhan Zhang , Wenhao Li , Xiao Qing , Xian Liu , Hongwei Liu

Nowadays various distributed stream processing systems (DSPSs) are employed to process the ever-expanding real-time data. The DSPSs are highly susceptible to system failure, and the fault-tolerance issue is a major problem, which is getting lot of attention nowadays. Flink is a popular streaming computing framework that implements a lightweight, asynchronous checkpoint technique based on the barrier mechanism to ensure high efficiency in analysing the data. In a checkpoint-based fault-tolerance mechanism, a shorter checkpoint interval can increase runtime cost of streaming applications, while a longer one will increase recovery time of failure recovery. So, selecting an optimal checkpoint interval is critical to attain high efficiency of the streaming applications. Traditional optimal checkpoint interval mechanisms usually assume that the checkpointing delay and the fault recovery time are fixed. However, both factors have a strong relation to the intensity of the application’s workload. To obtain more optimal checkpoint interval under different workload intensities, this paper proposes a performance model to estimate the tuples processing latency and a recovery model to estimate the fault recovery time. With these two models, an optimal checkpoint interval can be arrived. These models and the interval optimisation interval are verified experimentally on Flink. The results show that the proposed model can recommend an optimal checkpoint interval according to the system reliability related indicators. This proposed system optimised recovery time and performs efficiently in applications with delay constraints.



中文翻译:

Flink流处理应用中最佳检查点间隔的研究

如今,各种分布式流处理系统(DSPS)用于处理不断扩展的实时数据。DSPS极易受到系统故障的影响,而容错问题是一个主要问题,如今已引起人们的广泛关注。Flink是一种流行的流计算框架,它基于屏障机制实现了轻量级的异步检查点技术,以确保高效地分析数据。在基于检查点的容错机制中,较短的检查点间隔会增加流应用程序的运行时成本,而较长的检查点间隔则会增加故障恢复的恢复时间。因此,选择最佳检查点间隔对于获得流应用程序的高效率至关重要。传统的最佳检查点间隔机制通常假定检查点延迟和故障恢复时间是固定的。但是,这两个因素都与应用程序工作负载的强度密切相关。为了在不同的工作负载强度下获得更多的最佳检查点间隔,本文提出了一种估计元组处理等待时间的性能模型和一种估计故障恢复时间的恢复模型。使用这两个模型,可以达到最佳检查点间隔。这些模型和间隔优化间隔在Flink上进行了实验验证。结果表明,该模型可以根据系统可靠性相关指标推荐最佳的检查点间隔。该提议的系统优化了恢复时间,并在具有延迟约束的应用中高效地执行。

更新日期:2021-01-06
down
wechat
bug