Abstract
Nowadays various distributed stream processing systems (DSPSs) are employed to process the ever-expanding real-time data. The DSPSs are highly susceptible to system failure, and the fault-tolerance issue is a major problem, which is getting lot of attention nowadays. Flink is a popular streaming computing framework that implements a lightweight, asynchronous checkpoint technique based on the barrier mechanism to ensure high efficiency in analysing the data. In a checkpoint-based fault-tolerance mechanism, a shorter checkpoint interval can increase runtime cost of streaming applications, while a longer one will increase recovery time of failure recovery. So, selecting an optimal checkpoint interval is critical to attain high efficiency of the streaming applications. Traditional optimal checkpoint interval mechanisms usually assume that the checkpointing delay and the fault recovery time are fixed. However, both factors have a strong relation to the intensity of the application’s workload. To obtain more optimal checkpoint interval under different workload intensities, this paper proposes a performance model to estimate the tuples processing latency and a recovery model to estimate the fault recovery time. With these two models, an optimal checkpoint interval can be arrived. These models and the interval optimisation interval are verified experimentally on Flink. The results show that the proposed model can recommend an optimal checkpoint interval according to the system reliability related indicators. This proposed system optimised recovery time and performs efficiently in applications with delay constraints.
Similar content being viewed by others
References
Akber SMA, Chen H, Wang Y, Jin H (2018) Minimizing overheads of checkpoints in distributed stream processing systems. In 2018 IEEE 7th international Conference on Cloud Networking (CloudNet) (pp. 1-4). IEEE
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink: stream and batch processing in a single engine. Bull IEEE Comp Soc Tech Committee Data Eng 36(4):28–38
Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. International journal of computer trends and technology 19(1):9–14
Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, ..., Poulosky P (2016) Benchmarking streaming computation engines: Storm, flink and spark streaming. In 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW) (pp. 1789–1792). IEEE
Lal DK, Suman U (2019) Towards comparison of real time stream processing engines. In 2019 IEEE Conference on Information and Communication Technology (pp. 1-5). IEEE
Hwang JH, Balazinska M, Rasin A, Cetintemel U, Stonebraker M, Zdonik S (2005) High-availability algorithms for distributed stream processing. In 21st International Conference on Data Engineering (ICDE'05) (pp. 779-790). IEEE
Balazinska M, Balakrishnan H, Madden S, Stonebraker M (2005) Fault-tolerance in the borealis distributed stream processing system. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (pp. 13-24)
Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, ..., Bhagat N (2014) Storm@twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 147–156)
Sebepou Z, Magoutis K (2011) CEC: Continuous eventual checkpointing for data stream processing operators[C]. 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 145–156
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10–10):95
Castro Fernandez R, Migliavacca M, Kalyvianaki E, Pietzuch P (2013) Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD international conference on Management of data (pp. 725-736)
Heinze T, Zia M, Krahn R, Jerzak Z, Fetzer C (2015) An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems (pp. 150-161)
Su L, Zhou Y (2017) Passive and partially active fault tolerance for massively parallel stream processing engines. IEEE Trans Knowl Data Eng 31(1):32–45
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Futur Gener Comput Syst 22(3):303–312
Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In 2008 IEEE International Symposium on Parallel and Distributed Processing (pp. 1-9). IEEE
Jin H, Chen Y, Zhu H, Sun XH (2010) Optimizing HPC fault-tolerant environment: an analytical approach. In 2010 39th International Conference on Parallel Processing (pp. 525-534). IEEE
Zhiliang L (2017) Research on adaptive checkpoint mechanism for large-scale streaming data processing. (Doctoral dissertation)
Zhuang Y, Wei X, Li H, Wang Y, He X (2018) An optimal checkpointing model with online OCI adjustment for stream processing applications. In 2018 27th International Conference on Computer Communication and Networks (ICCCN) (pp. 1-9). IEEE
Vianello V, Patiño-Martínez M, Azqueta-Alzúaz A, Jimenez-Péris R (2018) Cost of fault-tolerance on data stream processing. In: European Conference on Parallel Processing (pp. 17-27). Springer, Cham
Sun G, Chen T, Su Y, Li C (2018) Internet traffic classification based on incremental support vector machines. Mobile Networks and Applications 23(4):789–796
Sun G, Li J, Dai J, Song Z, Lang F (2018) Feature selection for IoT based on maximal information coefficient. Futur Gener Comput Syst 89:606–616
Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge
Stephens MA (1974) EDF statistics for goodness of fit and some comparisons. J Am Stat Assoc 69(347):730–737
Xiao Q (2019) Research on fault-tolerant strategy optimization for Flink stream processing framework [master’s thesis]. Harbin Institute of Technology, Harbin
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Z., Li, W., Qing, X. et al. Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications. Mobile Netw Appl 26, 1950–1959 (2021). https://doi.org/10.1007/s11036-020-01729-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-020-01729-7