Abstract
A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortunately makes the recovery of applications dependent on the fault management system itself. This work introduces two novel equations to meet the performance objectives of applications. To this end, we first create an equation that estimates the maximum number of jobs to be handled by an application instance for meeting a given performance objective. This formula is then used by admission control mechanism to restrict the number of jobs (targeted for operational application instances) to be allowed to enter the system. Next, we create a second equation that computes the response time distribution of an application. Thereafter, we develop a simulation model that predicts the impact of the failure of four sample fault management architectures on application’s performance. Exploiting our equations, we compare the architectures in terms of three distinct ways of handling affected jobs when application instances fail—allow job loss; retry jobs resulting in overload; employ admission control to mitigate the overload. Our simulation results show that boosting the number of managers may not always be beneficial; rather, it could possibly be the interconnection topology (i.e. the layout of interconnects linking the architectural components) of the management architecture, together with the model parameter values that may sometimes have a bigger role to play in the application’s performance.
Article PDF
Similar content being viewed by others
References
Cardellini, V., Colajanni, M., Philip, S.Y.: Dynamic load balancing on web-server systems. IEEE Internet Comput. 3, 28–39 (1999)
Grozev, N., Buyya, R.: Multi-cloud provisioning and load distribution for three-tier applications. ACM Transactions Autonomous Adaptive Systems. 9(3), 13 (2014) 1–13:21
Nuaimi KA, Mohamed N, Nuaimi MA and Al-Jaroodi J (2012) A survey of load balancing in cloud computing: challenges and algorithms. In: 2nd Symposium on Network Cloud Computing and Applications (NCCA), pp 137–142
Vaquero, L.M., Rodero-Merino, L., Buyya, R.: Dynamically scaling applications in the cloud. ACM SIGCOMM Computer Communication Review. 41(1), 45–52 (2011)
Tu, M., Ma, H., Xiao, L., Yen, I.L., Bastani, F., Xu, D.: Data placement in P2P data grids considering the availability, security, access performance and load balancing. J Grid Computing. 11(1), 103–127 (2013)
da Rosa, R.R., Lehmann, M., Gomes, M.M., Nobre, J.C., da Costa, C.A., Rigo, S.J., Lena, M., Mohr, R.F., de Oliveira, L.R.: A survey on global management view: toward combining system monitoring, resource management, and load prediction. J Grid Computing. 17(3), 473–502 (2019)
Roblitz, T., Schintke, F., Reinefeld, A., Barring, O., Lopez, M.B., Cancio, G., Chapeland, S., Chouikh, K., Cons, L., Poznanski, P., Defert, P.: Autonomic management of large clusters and their integration into the grid. J Grid Computing. 2(3), 247–260 (2004)
Birje, M.N., Manvi, S.S.: Wigrimma: a wireless grid monitoring model using agents. J Grid Computing. 9(4), 549–572 (2011)
Zamani G, and O. Das. (2017) Impact of a Fault Management Architecture on the Performance of a Component-based System. 13th European Dependable Computing Conference (EDCC 2017), Geneva, Switzerland, September, 87–94
Calheiros RN, Ranjan R, and Buyya R (2011) Virtual machine provisioning based on analytical performance and QoS in cloud computing environments. In: proceedings of ICPP’11. Pp 295–304
Trivedi, K.S., Muppala, J., Woolet, S.P., Haverkort, B.R.: Composite performance and dependability analysis. Perform. Eval. 14, 197–215 (1992)
Stamatelopoulos F, Roussopoulos N, and Maglaris B (1995) Using a DBMS for hierarchical network management, in: Proceedings of the Engineer conference, NETWORLD + INTEROP’95
Das, O., Woodside, C.M.: Analyzing the effectiveness of fault-management architectures in layered distributed systems. Performance Evaluation, Elsevier. 56(2004), 93–120 (2004)
Das, O., Woodside, C.M.: Modeling the coverage and effectiveness of fault-management architectures in layered distributed systems. In: Proceedings International Conference on Dependable Systems and Networks, pp. 745–754. Washington, DC (2002). https://doi.org/10.1109/DSN.2002.1029020
Poola, D., Ramamohanarao, K., Buyya, R.: Enhancing reliability of workflow execution using task replication and spot instances. ACM Transactions Autonomous Adaptive Systems. 10(4), 30 (2016) 1-30:21
Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. J Parallel Distributed Computing. 72(10), 1318–1331 (2012)
Tang, X., Li, K., Liao, G.: An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Computing, Springer. 17(4), 1413–1425 (2014)
Ming M, Humphrey M (2012) A performance study on the VM startup time in the cloud. IEEE 5th International Conference on Cloud Computing (CLOUD 2012), Honolulu, USA, June, 423–430
Cassel, L.N., Patridge, G., Westcott, J.: Network management architectures and protocols: problems and approaches. IEEE J. Select. Areas Comm. 7(7), 1104–1114 (1989)
Marshall, R.: The Simple Book: an Introduction to Internet Management, 2nd edn. Prentice-Hall, Englewood Cliffs (1994)
Leinwand, A., Fang, K.: Network Management: a Practical Perspective. Addison-Wesley, Reading (1993)
Dupuy, A., Sengupta, S., Wolfson, O., Yemini, Y.: Design of the Netmate network management system. In: Krishnan, I., Zimmer, W. (eds.) Integrated Network Management. Elsevier, Amsterdam (1991)
Ciardo, G., Blakemore, A., Chimento, P.F., Muppala, J.K., Trivedi, K.S.: Automated generation and analysis of Markov reward models using stochastic reward nets. In: Linear Algebra Markov Chains, and Queueing Models, Ima Volumes In Mathematics and Its Applications, vol. 48, pp. 145–191. Springer, New York (1993)
Trivedi, K. S., Malhotra, M., & Fricks, R. M. (1994). Markov reward approach to performability and reliability analysis, in: Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Durham, NC, 1994, pp. 7–11
Bolch, G., Greiner, S., de Meer, H., and Trivedi, K.S. (2006) Queueing networks and Markov chains - modelling and performance evaluation with computer science applications, 2nd Ed. Wiley, 2006
Trivedi, K.S., Andrade, E.C., Machida, F.: Combining performance and availability analysis in practice. Adv. Comput. 84, 1–38 (2012)
Ramani, S., Goseva-Popstojanova, K., Trivedi, K.S.: A framework for performability modeling of messaging services in distributed systems. In: Proc. of 8th IEEE Intl. Conference on engineering of complex computer systems (ICECCS 02), pp. 25–34 (2002)
Zimmermann, A. et al. (2000). Petri net modelling and performability evaluation with TimeNET 3.0. International Conference on Modelling Techniques and Tools for Computer Performance Evaluation. Springer Berlin Heidelberg, pp. 188–202
Broadwell PM (2004) Response time as a performability metric for online services. Report No. UCB//CSD-04-1324. Computer Science Division (EECS), University of California, Berkeley
SimPy (2017) Discrete event simulation library in python. Website: simpy.readthedocs.io (Accessed June 7, 2017)
Singh, S., Chana, I., Buyya, R.: STAR: SLA-aware autonomic management of cloud resources. IEEE Transactions on Cloud Computing. 8(4), 1–14 (2020)
Mahmud, R., Ramamohanarao, K., Buyya, R.: Application Management in Fog Computing Environments: A Taxonomy, Review and Future Directions. ACM Computing Survey. 53(4), 88 (2020) 1–88:43
Aslanpour, M. S., Gill, S. S., & Toosi, A. N. (2020). Performance evaluation metrics for cloud, fog and edge computing: a review, taxonomy, benchmarks and standards for future research. Internet of Things, 100273
Das, O., & Das, A. (2020). CogQN: a Queueing model that captures human learning of the user interfaces of session-based systems. 17th international conference on quantitative evaluation of SysTems (QEST 2020), short paper (springer, LNCS series), august 2020, Vienna (to be held online due to COVID-19)
Acknowledgements
We would like to thank the editors and anonymous reviewers for their valuable comments and suggestions to help and improve our research paper. We would like to thank NSERC Canada for their financial support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zamani, G., Das, O. Fault-Detection Managers: More May Not Be the Merrier. J Grid Computing 19, 6 (2021). https://doi.org/10.1007/s10723-021-09546-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-021-09546-2