Skip to main content
Log in

Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications

  • Published:
Mobile Networks and Applications Aims and scope Submit manuscript

Abstract

Nowadays various distributed stream processing systems (DSPSs) are employed to process the ever-expanding real-time data. The DSPSs are highly susceptible to system failure, and the fault-tolerance issue is a major problem, which is getting lot of attention nowadays. Flink is a popular streaming computing framework that implements a lightweight, asynchronous checkpoint technique based on the barrier mechanism to ensure high efficiency in analysing the data. In a checkpoint-based fault-tolerance mechanism, a shorter checkpoint interval can increase runtime cost of streaming applications, while a longer one will increase recovery time of failure recovery. So, selecting an optimal checkpoint interval is critical to attain high efficiency of the streaming applications. Traditional optimal checkpoint interval mechanisms usually assume that the checkpointing delay and the fault recovery time are fixed. However, both factors have a strong relation to the intensity of the application’s workload. To obtain more optimal checkpoint interval under different workload intensities, this paper proposes a performance model to estimate the tuples processing latency and a recovery model to estimate the fault recovery time. With these two models, an optimal checkpoint interval can be arrived. These models and the interval optimisation interval are verified experimentally on Flink. The results show that the proposed model can recommend an optimal checkpoint interval according to the system reliability related indicators. This proposed system optimised recovery time and performs efficiently in applications with delay constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Akber SMA, Chen H, Wang Y, Jin H (2018) Minimizing overheads of checkpoints in distributed stream processing systems. In 2018 IEEE 7th international Conference on Cloud Networking (CloudNet) (pp. 1-4). IEEE

  2. Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink: stream and batch processing in a single engine. Bull IEEE Comp Soc Tech Committee Data Eng 36(4):28–38

  3. Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. International journal of computer trends and technology 19(1):9–14

    Article  Google Scholar 

  4. Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, ..., Poulosky P (2016) Benchmarking streaming computation engines: Storm, flink and spark streaming. In 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW) (pp. 1789–1792). IEEE

  5. Lal DK, Suman U (2019) Towards comparison of real time stream processing engines. In 2019 IEEE Conference on Information and Communication Technology (pp. 1-5). IEEE

  6. Hwang JH, Balazinska M, Rasin A, Cetintemel U, Stonebraker M, Zdonik S (2005) High-availability algorithms for distributed stream processing. In 21st International Conference on Data Engineering (ICDE'05) (pp. 779-790). IEEE

  7. Balazinska M, Balakrishnan H, Madden S, Stonebraker M (2005) Fault-tolerance in the borealis distributed stream processing system. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (pp. 13-24)

  8. Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, ..., Bhagat N (2014) Storm@twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 147–156)

  9. Sebepou Z, Magoutis K (2011) CEC: Continuous eventual checkpointing for data stream processing operators[C]. 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 145–156

  10. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10–10):95

    Google Scholar 

  11. Castro Fernandez R, Migliavacca M, Kalyvianaki E, Pietzuch P (2013) Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD international conference on Management of data (pp. 725-736)

  12. Heinze T, Zia M, Krahn R, Jerzak Z, Fetzer C (2015) An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems (pp. 150-161)

  13. Su L, Zhou Y (2017) Passive and partially active fault tolerance for massively parallel stream processing engines. IEEE Trans Knowl Data Eng 31(1):32–45

    Article  Google Scholar 

  14. Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531

    Article  Google Scholar 

  15. Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Futur Gener Comput Syst 22(3):303–312

    Article  Google Scholar 

  16. Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In 2008 IEEE International Symposium on Parallel and Distributed Processing (pp. 1-9). IEEE

  17. Jin H, Chen Y, Zhu H, Sun XH (2010) Optimizing HPC fault-tolerant environment: an analytical approach. In 2010 39th International Conference on Parallel Processing (pp. 525-534). IEEE

  18. Zhiliang L (2017) Research on adaptive checkpoint mechanism for large-scale streaming data processing. (Doctoral dissertation)

  19. Zhuang Y, Wei X, Li H, Wang Y, He X (2018) An optimal checkpointing model with online OCI adjustment for stream processing applications. In 2018 27th International Conference on Computer Communication and Networks (ICCCN) (pp. 1-9). IEEE

  20. Vianello V, Patiño-Martínez M, Azqueta-Alzúaz A, Jimenez-Péris R (2018) Cost of fault-tolerance on data stream processing. In: European Conference on Parallel Processing (pp. 17-27). Springer, Cham

  21. Sun G, Chen T, Su Y, Li C (2018) Internet traffic classification based on incremental support vector machines. Mobile Networks and Applications 23(4):789–796

    Article  Google Scholar 

  22. Sun G, Li J, Dai J, Song Z, Lang F (2018) Feature selection for IoT based on maximal information coefficient. Futur Gener Comput Syst 89:606–616

    Article  Google Scholar 

  23. Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  24. Stephens MA (1974) EDF statistics for goodness of fit and some comparisons. J Am Stat Assoc 69(347):730–737

    Article  Google Scholar 

  25. Xiao Q (2019) Research on fault-tolerant strategy optimization for Flink stream processing framework [master’s thesis]. Harbin Institute of Technology, Harbin

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenhao Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Li, W., Qing, X. et al. Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications. Mobile Netw Appl 26, 1950–1959 (2021). https://doi.org/10.1007/s11036-020-01729-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11036-020-01729-7

Keywords

Navigation