Skip to main content
Log in

Data centers’ services restoration based on the decision-making of distributed agents

  • Published:
Telecommunication Systems Aims and scope Submit manuscript

Abstract

The increasing number of companies that are migrating their IT infrastructure to cloud environments has been motivated many studies on distributed backup strategies to improve the availability of these companies’ systems. In this scenario, it is essential to study mechanisms to evaluate the network conditions to minimize the transmission time to improve the availability of the system. The goal of this study is to build models to evaluate the availability of services running in cloud data center infrastructure, emphasizing the impact of the variation of throughput on the data redundancy, and consequently, on the availability of the service. Based on it, this research purposes some smart models which can be deployed in each data center of a distributed arrange of data centers and help the system administrator to choose the best data center to restore the services of a faulty one. To analyze the impact of the network throughput over the service’s availability, we gathered the MTTF and MTTR metrics of data center’s components and services, generated a reliability block diagram to get the MTTF of the system as a whole, and developed a formalism to model the network component. Based on the results, we built an SPN model to represent the system and get the availability of it in many network conditions. After that, we analyze the availability of the system to discuss the impact of the network conditions over the system’s availability. After building the models and get the system’s availability in many network conditions, we can perceive the enormous impact of the network conditions over the system’s availability through a plot that exhibits the annual downtime along of a year. Using the models developed to study the system availability, we developed smart agents capable of predicting the transfer time of a bulk of data and, with it, choose the data center with the best network conditions to restore the services of a faulty one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Amazon: Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region. (2017). Retrieved on February 28, 2017, from https://aws.amazon.com/pt/message/41926/

  2. Bauer, E., Adams, R., & Eustace, D. (2011). Beyond redundancy: How geographic redundancy can improve service availability and reliability of computer-based systems. Hoboken: Wiley.

    Book  Google Scholar 

  3. Bradner, S. (1991). Benchmarking terminology for network interconnection devices. Technical report RFC

  4. Callou, G., Sousa, E., Maciel, P., Tavares, E., Araujo, C., Silva, B., Rosa, N., et al. (2010). Impact analysis of maintenance policies on data center power infrastructure. In 2010 IEEE international conference on systems, man and cybernetics (pp. 526–533). IEEE.

  5. Chen, T., Gao, X., & Chen, G. (2016). The features, hardware, and architectures of data center networks: A survey. Journal of Parallel and Distributed Computing, 96, 45–74.

    Article  Google Scholar 

  6. de QV Lima, M. A., Maciel, P. R., Silva, B., & Guimarães, A. P. (2014). Performability evaluation of emergency call center. Performance Evaluation, 80, 27–42.

    Article  Google Scholar 

  7. Forouzan, B., & Fegan, S. (2007). Data Communications and Networking. McGraw-Hill Forouzan networking series. New York: McGraw-Hill Higher Education.

    Google Scholar 

  8. Foundation, L. (2019). NetEm-Network Emulator. Retrieved on April, 2019, from http://bit.ly/2Hmpghx.

  9. Gartner: Cloud Computing Enters its Second Decade. (2019). Retrieved on April, 2019, from https://cnnmon.ie/2GZDMww.

  10. German, R. (2000). Performance analysis of communication systems: Modeling with non-Markovian stochastic Petri nets. Wiey-Interscience series in systems and optimization. Hoboken: Wiley.

    Google Scholar 

  11. Jiang, C., Qiu, Y., Gao, H., Fan, T., Li, K., & Wan, J. (2019). An edge computing platform for intelligent operational monitoring in internet data centers. IEEE Access, 7, 133375–133387.

    Article  Google Scholar 

  12. Kuo, W., & Zuo, M. (2003). Optimal reliability modeling: Principles and applications. Hoboken: Wiley.

    Google Scholar 

  13. Kurose, J. F., & Ross, K. W. (2013). Computer networking: A top-down approach (international ed.). London: Pearson Higher Ed.

    Google Scholar 

  14. Lee, D. (2018). Amazon data centre fault knocks websites offline temporarily. Retrieved on April, 2018, from https://bbc.in/2HxTegg.

  15. Lima, P. A., Neto, A. S. B., & Maciel, P. R. M. (2018). Data centers service restoration based on distributed agents decision. In 2018 IEEE international conference on systems, man, and cybernetics (SMC) (pp. 1611–1616). IEEE.

  16. Ma, L., & Yang, B. (2018). Data backup against progressive disasters in geo-distributed data center networks. In 2018 international conference on networking and network applications (NaNA) (pp. 223–226). IEEE.

  17. Maciel, P., Matos, R., Silva, B., Figueiredo, J., Oliveira, D., Fé, I., et al. (2017). Mercury: Performance and dependability evaluation of systems with exponential, expolynomial, and general distributions. In 2017 IEEE 22nd Pacific Rim international symposium on dependable computing (PRDC) (pp. 50–57). IEEE.

  18. Maciel, P. R., Trivedi, K. S., Matias, R., & Kim, D. S. (2012). Dependability modeling. In Performance and dependability in service computing: Concepts, techniques and research directions (pp. 53–97). IGI Global.

  19. Mining, O. D. (2019). Cd diagram. Retrieved on November, 2019, from https://bit.ly/2OUqYIU.

  20. Nabi, M., Toeroe, M., & Khendek, F. (2016). Availability in the cloud: State of the art. Journal of Network and Computer Applications, 60, 54–67.

    Article  Google Scholar 

  21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  22. Persico, V., Botta, A., Marchetta, P., Montieri, A., & Pescapé, A. (2017). On the performance of the wide-area networks interconnecting public-cloud datacenters around the globe. Computer Networks, 112, 67–83.

    Article  Google Scholar 

  23. Pina, F. (2019). Speedtest.net python script. Retrieved on April, 2019, from http://bit.ly/2Hgywk1.

  24. Pohlert, T. (2014). The pairwise multiple comparison of mean ranks package (PMCMR). R Package, 27, 9.

    Google Scholar 

  25. Rosendo, D., Leoni, G., Gomes, D., Moreira, A., Gonçalves, G., Endo, P., et al. (2018). How to improve cloud services availability? Investigating the impact of power and it subsystems failures. In Proceedings of the 51st Hawaii international conference on system sciences.

  26. Santos, G. L., Endo, P. T., Gonçalves, G., Rosendo, D., Gomes, D., Kelner, J., et al. (2017) Analyzing the it subsystem failure impact on availability of cloud services. In 2017 IEEE symposium on computers and communications (ISCC) (pp. 717–723). IEEE.

  27. Scikit-learn. (2019). Scikit-learn kfold model selection. Retrieved on November, 2019, from https://bit.ly/37S2KYk.

  28. Scikit-learn. (2019). Scikit-learn mean absolute error metric. Retrieved on November, 2019, from https://bit.ly/34As7Mr.

  29. Silva, B. (2016) A framework for availability performance and survivability evaluation of disaster tolerant cloud computing systems. Ph.D. thesis, Universidade Federal de Pernambuco.

  30. Silva, B., Maciel, P., Brilhante, J., & Zimmermann, A. (2014) Geoclouds modcs: A perfomability evaluation tool for disaster tolerant iaas clouds. In 2014 8th annual IEEE systems conference (SysCon) (pp. 116–122). IEEE.

  31. Silva, B., Maciel, P. R. M., Zimmermannb, A., & Brilhantea, J. (2014). Survivability evaluation of disaster tolerant cloud computing systems. In Proceedings of probabilistic safety assessment & management conference (p. 12).

  32. Souza, R., Callou, G., Camboin, K., Ferreira, J., & Maciel, P. (2013). The effects of temperature variation on data center it systems. In 2013 IEEE international conference on systems, man, and cybernetics (pp. 2354–2359). IEEE.

  33. Toncar, V. (2018). VoIP basics: About jitter. Retrieved on April, 2018, from http://bit.ly/2JKlKw4.

  34. Trivedi, K. (2016). Probability and statistics with reliability, queuing, and computer science applications. Hoboken: Wiley.

    Book  Google Scholar 

  35. Ziafat, H., & Babamir, S. M. (2017). A method for the optimum selection of datacenters in geographically distributed clouds. The Journal of Supercomputing, 73(9), 4042–4081.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antônio Sá Barreto Neto.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lima, P.A., Neto, A.S.B. & Maciel, P. Data centers’ services restoration based on the decision-making of distributed agents. Telecommun Syst 74, 367–378 (2020). https://doi.org/10.1007/s11235-020-00660-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11235-020-00660-2

Keywords

Navigation