Fault-Detection Managers: More May Not Be the Merrier

Zamani, Ghazal; Das, Olivia

doi:10.1007/s10723-021-09546-2

Fault-Detection Managers: More May Not Be the Merrier

Published: 20 February 2021

Volume 19, article number 6, (2021)
Cite this article

Download PDF

Journal of Grid Computing Aims and scope Submit manuscript

Fault-Detection Managers: More May Not Be the Merrier

Download PDF

212 Accesses
Explore all metrics

Abstract

A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortunately makes the recovery of applications dependent on the fault management system itself. This work introduces two novel equations to meet the performance objectives of applications. To this end, we first create an equation that estimates the maximum number of jobs to be handled by an application instance for meeting a given performance objective. This formula is then used by admission control mechanism to restrict the number of jobs (targeted for operational application instances) to be allowed to enter the system. Next, we create a second equation that computes the response time distribution of an application. Thereafter, we develop a simulation model that predicts the impact of the failure of four sample fault management architectures on application’s performance. Exploiting our equations, we compare the architectures in terms of three distinct ways of handling affected jobs when application instances fail—allow job loss; retry jobs resulting in overload; employ admission control to mitigate the overload. Our simulation results show that boosting the number of managers may not always be beneficial; rather, it could possibly be the interconnection topology (i.e. the layout of interconnects linking the architectural components) of the management architecture, together with the model parameter values that may sometimes have a bigger role to play in the application’s performance.

Article PDF

FINJ: A Fault Injection Tool for HPC Systems

Effect of Fault Tolerance in the Field of Cloud Computing

A Utility-Based Fault Handling Approach for Efficient Job Rescue in Clouds

References

Cardellini, V., Colajanni, M., Philip, S.Y.: Dynamic load balancing on web-server systems. IEEE Internet Comput. 3, 28–39 (1999)
Article Google Scholar
Grozev, N., Buyya, R.: Multi-cloud provisioning and load distribution for three-tier applications. ACM Transactions Autonomous Adaptive Systems. 9(3), 13 (2014) 1–13:21
Article Google Scholar
Nuaimi KA, Mohamed N, Nuaimi MA and Al-Jaroodi J (2012) A survey of load balancing in cloud computing: challenges and algorithms. In: 2nd Symposium on Network Cloud Computing and Applications (NCCA), pp 137–142
Vaquero, L.M., Rodero-Merino, L., Buyya, R.: Dynamically scaling applications in the cloud. ACM SIGCOMM Computer Communication Review. 41(1), 45–52 (2011)
Article Google Scholar
Tu, M., Ma, H., Xiao, L., Yen, I.L., Bastani, F., Xu, D.: Data placement in P2P data grids considering the availability, security, access performance and load balancing. J Grid Computing. 11(1), 103–127 (2013)
Article Google Scholar
da Rosa, R.R., Lehmann, M., Gomes, M.M., Nobre, J.C., da Costa, C.A., Rigo, S.J., Lena, M., Mohr, R.F., de Oliveira, L.R.: A survey on global management view: toward combining system monitoring, resource management, and load prediction. J Grid Computing. 17(3), 473–502 (2019)
Article Google Scholar
Roblitz, T., Schintke, F., Reinefeld, A., Barring, O., Lopez, M.B., Cancio, G., Chapeland, S., Chouikh, K., Cons, L., Poznanski, P., Defert, P.: Autonomic management of large clusters and their integration into the grid. J Grid Computing. 2(3), 247–260 (2004)
Article Google Scholar
Birje, M.N., Manvi, S.S.: Wigrimma: a wireless grid monitoring model using agents. J Grid Computing. 9(4), 549–572 (2011)
Article Google Scholar
Zamani G, and O. Das. (2017) Impact of a Fault Management Architecture on the Performance of a Component-based System. 13th European Dependable Computing Conference (EDCC 2017), Geneva, Switzerland, September, 87–94
Calheiros RN, Ranjan R, and Buyya R (2011) Virtual machine provisioning based on analytical performance and QoS in cloud computing environments. In: proceedings of ICPP’11. Pp 295–304
Trivedi, K.S., Muppala, J., Woolet, S.P., Haverkort, B.R.: Composite performance and dependability analysis. Perform. Eval. 14, 197–215 (1992)
Article Google Scholar
Stamatelopoulos F, Roussopoulos N, and Maglaris B (1995) Using a DBMS for hierarchical network management, in: Proceedings of the Engineer conference, NETWORLD + INTEROP’95
Das, O., Woodside, C.M.: Analyzing the effectiveness of fault-management architectures in layered distributed systems. Performance Evaluation, Elsevier. 56(2004), 93–120 (2004)
Article Google Scholar
Das, O., Woodside, C.M.: Modeling the coverage and effectiveness of fault-management architectures in layered distributed systems. In: Proceedings International Conference on Dependable Systems and Networks, pp. 745–754. Washington, DC (2002). https://doi.org/10.1109/DSN.2002.1029020
Poola, D., Ramamohanarao, K., Buyya, R.: Enhancing reliability of workflow execution using task replication and spot instances. ACM Transactions Autonomous Adaptive Systems. 10(4), 30 (2016) 1-30:21
Article Google Scholar
Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. J Parallel Distributed Computing. 72(10), 1318–1331 (2012)
Article Google Scholar
Tang, X., Li, K., Liao, G.: An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Computing, Springer. 17(4), 1413–1425 (2014)
Article Google Scholar
Ming M, Humphrey M (2012) A performance study on the VM startup time in the cloud. IEEE 5th International Conference on Cloud Computing (CLOUD 2012), Honolulu, USA, June, 423–430
Cassel, L.N., Patridge, G., Westcott, J.: Network management architectures and protocols: problems and approaches. IEEE J. Select. Areas Comm. 7(7), 1104–1114 (1989)
Article Google Scholar
Marshall, R.: The Simple Book: an Introduction to Internet Management, 2nd edn. Prentice-Hall, Englewood Cliffs (1994)
Google Scholar
Leinwand, A., Fang, K.: Network Management: a Practical Perspective. Addison-Wesley, Reading (1993)
MATH Google Scholar
Dupuy, A., Sengupta, S., Wolfson, O., Yemini, Y.: Design of the Netmate network management system. In: Krishnan, I., Zimmer, W. (eds.) Integrated Network Management. Elsevier, Amsterdam (1991)
Google Scholar
Ciardo, G., Blakemore, A., Chimento, P.F., Muppala, J.K., Trivedi, K.S.: Automated generation and analysis of Markov reward models using stochastic reward nets. In: Linear Algebra Markov Chains, and Queueing Models, Ima Volumes In Mathematics and Its Applications, vol. 48, pp. 145–191. Springer, New York (1993)
MATH Google Scholar
Trivedi, K. S., Malhotra, M., & Fricks, R. M. (1994). Markov reward approach to performability and reliability analysis, in: Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Durham, NC, 1994, pp. 7–11
Bolch, G., Greiner, S., de Meer, H., and Trivedi, K.S. (2006) Queueing networks and Markov chains - modelling and performance evaluation with computer science applications, 2nd Ed. Wiley, 2006
Trivedi, K.S., Andrade, E.C., Machida, F.: Combining performance and availability analysis in practice. Adv. Comput. 84, 1–38 (2012)
Article Google Scholar
Ramani, S., Goseva-Popstojanova, K., Trivedi, K.S.: A framework for performability modeling of messaging services in distributed systems. In: Proc. of 8th IEEE Intl. Conference on engineering of complex computer systems (ICECCS 02), pp. 25–34 (2002)
Google Scholar
Zimmermann, A. et al. (2000). Petri net modelling and performability evaluation with TimeNET 3.0. International Conference on Modelling Techniques and Tools for Computer Performance Evaluation. Springer Berlin Heidelberg, pp. 188–202
Broadwell PM (2004) Response time as a performability metric for online services. Report No. UCB//CSD-04-1324. Computer Science Division (EECS), University of California, Berkeley
SimPy (2017) Discrete event simulation library in python. Website: simpy.readthedocs.io (Accessed June 7, 2017)
Singh, S., Chana, I., Buyya, R.: STAR: SLA-aware autonomic management of cloud resources. IEEE Transactions on Cloud Computing. 8(4), 1–14 (2020)
Article Google Scholar
Mahmud, R., Ramamohanarao, K., Buyya, R.: Application Management in Fog Computing Environments: A Taxonomy, Review and Future Directions. ACM Computing Survey. 53(4), 88 (2020) 1–88:43
Article Google Scholar
Aslanpour, M. S., Gill, S. S., & Toosi, A. N. (2020). Performance evaluation metrics for cloud, fog and edge computing: a review, taxonomy, benchmarks and standards for future research. Internet of Things, 100273
Das, O., & Das, A. (2020). CogQN: a Queueing model that captures human learning of the user interfaces of session-based systems. 17th international conference on quantitative evaluation of SysTems (QEST 2020), short paper (springer, LNCS series), august 2020, Vienna (to be held online due to COVID-19)

Download references

Acknowledgements

We would like to thank the editors and anonymous reviewers for their valuable comments and suggestions to help and improve our research paper. We would like to thank NSERC Canada for their financial support.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Ryerson University, 350 Victoria Street, Toronto, Ontario, Canada
Ghazal Zamani & Olivia Das

Authors

Ghazal Zamani
View author publications
You can also search for this author in PubMed Google Scholar
Olivia Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivia Das.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zamani, G., Das, O. Fault-Detection Managers: More May Not Be the Merrier. J Grid Computing 19, 6 (2021). https://doi.org/10.1007/s10723-021-09546-2

Download citation

Received: 27 August 2019
Accepted: 10 December 2020
Published: 20 February 2021
DOI: https://doi.org/10.1007/s10723-021-09546-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Fault-Detection Managers: More May Not Be the Merrier

Abstract

Article PDF

Similar content being viewed by others

FINJ: A Fault Injection Tool for HPC Systems

Effect of Fault Tolerance in the Field of Cloud Computing

A Utility-Based Fault Handling Approach for Efficient Job Rescue in Clouds

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fault-Detection Managers: More May Not Be the Merrier

Abstract

Article PDF

Similar content being viewed by others

FINJ: A Fault Injection Tool for HPC Systems

Effect of Fault Tolerance in the Field of Cloud Computing

A Utility-Based Fault Handling Approach for Efficient Job Rescue in Clouds

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation