Modelling and verification of reconfigurable fault-tolerant and self-recovering systems in hybrid Clouds

https://doi.org/10.1016/j.simpat.2021.102331Get rights and content

Highlights

  • A verification methodology of self-recovery strategies in Hybrid architectures.

  • Our methodology is based on a formal model built as a network of components.

  • Our methodology is applied to the analysis of different recovery strategies.

  • Using our verification methodology, stochastic analysis is also performed.

Abstract

Hybrid Cloud environments allow the utilization of local resources in private Clouds with resources from public Clouds when needed. Such environments represent systems with high failure rates because they feature heterogeneous components, a large number of servers with intensive workload are built as complex architectures. For these reasons, the availability of such systems could be easily compromised if the failure of these heterogeneous components is not handled correctly, which may cause request rejection and frequent performance degradation. Providing highly reliable Cloud applications, in particular in a hybrid Cloud environment, is a challenging and critical research problem. Therefore, the question we address in this paper is how to provision resources to user requests in the presence of failures in a hybrid Cloud environment. To this end, we propose a reconfigurable formal model of the hybrid Cloud architecture, then we utilize instantiations of this model, simulation and real-time execution runs to estimate different performance metrics related to fault detection and self-recovery strategies in hybrid Cloud. Our approach is based on the combination of the model-based and the probabilistic approaches.

Introduction

Cloud computing is widely used to provide services by modelling the pay-as-you-go approach in which computing, storage, and networking resources are utilized over the Internet. It represents an architecture allowing convenient, on-demand network access to a shared pool of computing resources [1]. Resource autonomy, rapid elasticity and always-on availability, are the primary characteristics of Cloud computing [2]. Generally, Cloud computing could be private, public or hybrid. Public Clouds provide shared resources through large-scale data centers, hosting a very large number of servers and storage systems. However, private Clouds provide users with a private and flexible infrastructure to run workloads within their own domain. Hybrid Cloud, is introduced as a combination of private and public infrastructures [3]. A hybrid cloud significantly benefits its owner in terms of availability, reliability and cost reduction [4]. Such a combination results in a heterogeneous and complex architecture and thus a highly adaptive and dynamic behavior allow to manage such heterogeneity. Due to this, hybrid Clouds are always prone to faults and failures.

Indeed, hybrid Clouds have to utilize resources from both types of Clouds in an optimized way and consolidate a various and large amount of resources like processors, memory units, disk drives, networking devices. Any system running applications with such heterogeneous and intensive workload may sometimes be vulnerable to different types of failures. Failures in hybrid Clouds interrupt the normal delivery of the services and degrade different performance metrics such as Quality of Service (QoS), availability, reliability and energy waste [5]. Moreover, improper handling of system failures may lead the system to an unworkable state [6].

There are different types of failures that may affect the reliability of hybrid Cloud services, including computing resource missing, software failure, Hardware failure, and Network failure [7], [8]. Therefore, we propose an approach that considers different type of failures including Software and Hardware failures. A system is considered as fault tolerant if it is capable to keep performing its intended requests, even in the presence of failures [9]. Fault tolerance is among the most imperative issues in Cloud to deliver reliable services. Indeed, without a fault tolerance capability, even a well-designed system with the best of the components and services cannot be considered as reliable [10].

Although several failure analyses for hybrid cloud environments have been proposed in the literature, there is no formal treatment that specifies when or how failure aware policies in a hybrid cloud architecture satisfy eventual functional and non-functional properties. Therefore, it is very important to formally specify various properties and check whether the system satisfies those properties under various combinations of failure scenarios. In this context, a key contribution of this paper is to integrate failure detection strategies in hybrid cloud into a formal, yet realistic, model that permits simulation and analysis during early stages in the design time. For this purpose, we propose a new formal model for hybrid cloud architecture along with the integration of different failure detection and self-recovering resource strategies. A key contribution of this paper is to show how being aware of the impact of the occurrence of different types of failures in hybrid cloud environments, results in analyzing and designing more efficient protocols. In particular, this paper makes the following contributions:

  • A formal model for hybrid cloud environments which is based on the use of a component-based framework that has proved suitable for modelling and analyzing distributed systems. This model is heterogeneous and scalable, which makes it suitable for running real-life configurations.

  • The model integrates non-functional aspects along the functional system behavior, which allows to achieve a separation of concerns between the functional and non-functional aspects of the failure detection and recovering in hybrid cloud.

  • The model is reconfigurable where the reconfiguration consists in modifying the system recovering strategy, and thus the system behavior, to adapt it to the changes related to the failure type and rate. Three different case studies are presented allowing the analysis of different recovering strategies in the context of hybrid clouds.

  • The model allows to combine dynamic and static analysis to validate the model in a novel way. In particular, we have performed stochastic analysis related to failure rate distributions.

The rest of the paper is organized as follows. In Section 2, we describe related work and we expose our main contributions. In Section 3, the background and preliminaries are presented. In Section 4, we give a detailed description of the architecture of our proposed model as well as the behavior of its different components. Section 5 presents details of three case-studies resulted of the instantiation of our model to different recovering strategies. A detailed performance evaluation and analysis of the different requirements and properties of the case studies are also available and discussed in Section 5. Finally, we summarize our findings and present future directions in Section 6.

Section snippets

Related work

The increasing complexity of large-scale systems, together with the decreasing time to market, has forced designers to consider more elaborate strategies and methods for system design. To address such challenges, System-level design is one of the most under use approaches. Such approach is mainly based on the concept of high-level modelling, that is, capturing the system functionality at a high-level abstraction. Such high-level models are usually easy to elaborate and enable fast design, which

Preliminaries: BIP and SBIP frameworks

In this section, we provide a brief overview of the modelling and the specification formalism supported in this research paper which is (Behavior & Interaction & Priority) BIP tool [44] and its Stochastic version SBIP [40].

The BIP framework supports a methodology for building systems from atomic components (Definition 1). It uses connectors (Definition 5), to specify possible interactions (Definition 4) between components, and priorities, to select amongst possible interactions. In SBIP, atomic

A formal model for hybrid clouds architecture

In this section, we detail the first Phase of our approach (see Section 2.2), by describing how our formal generic model of the hybrid Cloud architecture is designed and checked as a component-based system. Components and their composition are defined with respect to formal semantics provided by the BIP framework [44] and its statistical version SBIP [40]. Our Model is built as a superposition of three layers (see Fig. 2), namely: a failure management layer, a hybrid Cloud architecture layer

Model instantiations and performance analysis

In this section and to show the applicability of our approach, formal verification, stochastic analysis, simulations and experimental results are presented. To this end, we study three different case-studies, allowing to see how our model could be parameterized and instantiated to implement different recovering strategies in hybrid Cloud architectures. As already explained in Section 2.2 (see Fig. 1), implementing a given recovery strategy in our model consists of defining the corresponding

Conclusion and future works

In this paper, we have proposed a formal generic model describing a hybrid Cloud architecture. The proposed model allows the specification of different recovery strategies in hybrid Cloud environments. Furthermore, it offers the possibilities to define reconfigurable behaviors and thus to model complex recovery strategies (hybrid strategies). Our approach has as a purpose to verify, analyze and study different recovery strategies based on three steps, including design, verification and

References (48)

  • SmaraM. et al.

    Acceptance test for fault detection in component-based cloud computing and systems

    Future Gener. Comput. Syst.

    (2017)
  • ArmbrustM. et al.

    A view of cloud computing

    Commun. ACM

    (2010)
  • GrobauerB. et al.

    Understanding cloud computing vulnerabilities

    IEEE Secur. Priv.

    (2010)
  • GrozevN. et al.

    Inter-cloud architectures and application brokering: taxonomy and survey

    Softw. - Pract. Exp.

    (2014)
  • WangT. et al.

    Fd4c: Automatic fault diagnosis framework for web applications in cloud computing

    IEEE Trans. Syst. Man Cybern.: Syst.

    (2015)
  • SunD. et al.

    Modelling and evaluating a high serviceability fault tolerance strategy in cloud computing environments

    Int. J. Secur. Netw.

    (2012)
  • DaiY.-S. et al.

    Cloud service reliability: Modeling and analysis

  • GarraghanP. et al.

    An empirical failure-analysis of a large-scale cloud computing environment

  • FaragardiH.R. et al.

    An analytical model to evaluate reliability of cloud computing systems in the presence of qos requirements

  • FortinoG. et al.

    Modeling and simulating internet-of-things systems: A hybrid agent-oriented approach

    Comput. Sci. Eng.

    (2017)
  • SubramanianA.S.R. et al.

    Modeling and simulation of energy systems: A review

    Processes

    (2018)
  • PllanaS. et al.

    Hybrid performance modeling and prediction of large-scale computing systems

  • HafaiedhI.B. et al.

    A parameterized formal model for the analysis of preemption-threshold scheduling in real-time systems

    IEEE Access

    (2020)
  • SchroederB. et al.

    A large-scale study of failures in high-performance computing systems

    IEEE Trans. Dependable Secure Comput.

    (2009)
  • View full text