Abstract
The increasing number of companies that are migrating their IT infrastructure to cloud environments has been motivated many studies on distributed backup strategies to improve the availability of these companies’ systems. In this scenario, it is essential to study mechanisms to evaluate the network conditions to minimize the transmission time to improve the availability of the system. The goal of this study is to build models to evaluate the availability of services running in cloud data center infrastructure, emphasizing the impact of the variation of throughput on the data redundancy, and consequently, on the availability of the service. Based on it, this research purposes some smart models which can be deployed in each data center of a distributed arrange of data centers and help the system administrator to choose the best data center to restore the services of a faulty one. To analyze the impact of the network throughput over the service’s availability, we gathered the MTTF and MTTR metrics of data center’s components and services, generated a reliability block diagram to get the MTTF of the system as a whole, and developed a formalism to model the network component. Based on the results, we built an SPN model to represent the system and get the availability of it in many network conditions. After that, we analyze the availability of the system to discuss the impact of the network conditions over the system’s availability. After building the models and get the system’s availability in many network conditions, we can perceive the enormous impact of the network conditions over the system’s availability through a plot that exhibits the annual downtime along of a year. Using the models developed to study the system availability, we developed smart agents capable of predicting the transfer time of a bulk of data and, with it, choose the data center with the best network conditions to restore the services of a faulty one.
Similar content being viewed by others
References
Amazon: Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region. (2017). Retrieved on February 28, 2017, from https://aws.amazon.com/pt/message/41926/
Bauer, E., Adams, R., & Eustace, D. (2011). Beyond redundancy: How geographic redundancy can improve service availability and reliability of computer-based systems. Hoboken: Wiley.
Bradner, S. (1991). Benchmarking terminology for network interconnection devices. Technical report RFC
Callou, G., Sousa, E., Maciel, P., Tavares, E., Araujo, C., Silva, B., Rosa, N., et al. (2010). Impact analysis of maintenance policies on data center power infrastructure. In 2010 IEEE international conference on systems, man and cybernetics (pp. 526–533). IEEE.
Chen, T., Gao, X., & Chen, G. (2016). The features, hardware, and architectures of data center networks: A survey. Journal of Parallel and Distributed Computing, 96, 45–74.
de QV Lima, M. A., Maciel, P. R., Silva, B., & Guimarães, A. P. (2014). Performability evaluation of emergency call center. Performance Evaluation, 80, 27–42.
Forouzan, B., & Fegan, S. (2007). Data Communications and Networking. McGraw-Hill Forouzan networking series. New York: McGraw-Hill Higher Education.
Foundation, L. (2019). NetEm-Network Emulator. Retrieved on April, 2019, from http://bit.ly/2Hmpghx.
Gartner: Cloud Computing Enters its Second Decade. (2019). Retrieved on April, 2019, from https://cnnmon.ie/2GZDMww.
German, R. (2000). Performance analysis of communication systems: Modeling with non-Markovian stochastic Petri nets. Wiey-Interscience series in systems and optimization. Hoboken: Wiley.
Jiang, C., Qiu, Y., Gao, H., Fan, T., Li, K., & Wan, J. (2019). An edge computing platform for intelligent operational monitoring in internet data centers. IEEE Access, 7, 133375–133387.
Kuo, W., & Zuo, M. (2003). Optimal reliability modeling: Principles and applications. Hoboken: Wiley.
Kurose, J. F., & Ross, K. W. (2013). Computer networking: A top-down approach (international ed.). London: Pearson Higher Ed.
Lee, D. (2018). Amazon data centre fault knocks websites offline temporarily. Retrieved on April, 2018, from https://bbc.in/2HxTegg.
Lima, P. A., Neto, A. S. B., & Maciel, P. R. M. (2018). Data centers service restoration based on distributed agents decision. In 2018 IEEE international conference on systems, man, and cybernetics (SMC) (pp. 1611–1616). IEEE.
Ma, L., & Yang, B. (2018). Data backup against progressive disasters in geo-distributed data center networks. In 2018 international conference on networking and network applications (NaNA) (pp. 223–226). IEEE.
Maciel, P., Matos, R., Silva, B., Figueiredo, J., Oliveira, D., Fé, I., et al. (2017). Mercury: Performance and dependability evaluation of systems with exponential, expolynomial, and general distributions. In 2017 IEEE 22nd Pacific Rim international symposium on dependable computing (PRDC) (pp. 50–57). IEEE.
Maciel, P. R., Trivedi, K. S., Matias, R., & Kim, D. S. (2012). Dependability modeling. In Performance and dependability in service computing: Concepts, techniques and research directions (pp. 53–97). IGI Global.
Mining, O. D. (2019). Cd diagram. Retrieved on November, 2019, from https://bit.ly/2OUqYIU.
Nabi, M., Toeroe, M., & Khendek, F. (2016). Availability in the cloud: State of the art. Journal of Network and Computer Applications, 60, 54–67.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Persico, V., Botta, A., Marchetta, P., Montieri, A., & Pescapé, A. (2017). On the performance of the wide-area networks interconnecting public-cloud datacenters around the globe. Computer Networks, 112, 67–83.
Pina, F. (2019). Speedtest.net python script. Retrieved on April, 2019, from http://bit.ly/2Hgywk1.
Pohlert, T. (2014). The pairwise multiple comparison of mean ranks package (PMCMR). R Package, 27, 9.
Rosendo, D., Leoni, G., Gomes, D., Moreira, A., Gonçalves, G., Endo, P., et al. (2018). How to improve cloud services availability? Investigating the impact of power and it subsystems failures. In Proceedings of the 51st Hawaii international conference on system sciences.
Santos, G. L., Endo, P. T., Gonçalves, G., Rosendo, D., Gomes, D., Kelner, J., et al. (2017) Analyzing the it subsystem failure impact on availability of cloud services. In 2017 IEEE symposium on computers and communications (ISCC) (pp. 717–723). IEEE.
Scikit-learn. (2019). Scikit-learn kfold model selection. Retrieved on November, 2019, from https://bit.ly/37S2KYk.
Scikit-learn. (2019). Scikit-learn mean absolute error metric. Retrieved on November, 2019, from https://bit.ly/34As7Mr.
Silva, B. (2016) A framework for availability performance and survivability evaluation of disaster tolerant cloud computing systems. Ph.D. thesis, Universidade Federal de Pernambuco.
Silva, B., Maciel, P., Brilhante, J., & Zimmermann, A. (2014) Geoclouds modcs: A perfomability evaluation tool for disaster tolerant iaas clouds. In 2014 8th annual IEEE systems conference (SysCon) (pp. 116–122). IEEE.
Silva, B., Maciel, P. R. M., Zimmermannb, A., & Brilhantea, J. (2014). Survivability evaluation of disaster tolerant cloud computing systems. In Proceedings of probabilistic safety assessment & management conference (p. 12).
Souza, R., Callou, G., Camboin, K., Ferreira, J., & Maciel, P. (2013). The effects of temperature variation on data center it systems. In 2013 IEEE international conference on systems, man, and cybernetics (pp. 2354–2359). IEEE.
Toncar, V. (2018). VoIP basics: About jitter. Retrieved on April, 2018, from http://bit.ly/2JKlKw4.
Trivedi, K. (2016). Probability and statistics with reliability, queuing, and computer science applications. Hoboken: Wiley.
Ziafat, H., & Babamir, S. M. (2017). A method for the optimum selection of datacenters in geographically distributed clouds. The Journal of Supercomputing, 73(9), 4042–4081.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lima, P.A., Neto, A.S.B. & Maciel, P. Data centers’ services restoration based on the decision-making of distributed agents. Telecommun Syst 74, 367–378 (2020). https://doi.org/10.1007/s11235-020-00660-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-020-00660-2