Modelling and verification of reconfigurable fault-tolerant and self-recovering systems in hybrid Clouds
Introduction
Cloud computing is widely used to provide services by modelling the pay-as-you-go approach in which computing, storage, and networking resources are utilized over the Internet. It represents an architecture allowing convenient, on-demand network access to a shared pool of computing resources [1]. Resource autonomy, rapid elasticity and always-on availability, are the primary characteristics of Cloud computing [2]. Generally, Cloud computing could be private, public or hybrid. Public Clouds provide shared resources through large-scale data centers, hosting a very large number of servers and storage systems. However, private Clouds provide users with a private and flexible infrastructure to run workloads within their own domain. Hybrid Cloud, is introduced as a combination of private and public infrastructures [3]. A hybrid cloud significantly benefits its owner in terms of availability, reliability and cost reduction [4]. Such a combination results in a heterogeneous and complex architecture and thus a highly adaptive and dynamic behavior allow to manage such heterogeneity. Due to this, hybrid Clouds are always prone to faults and failures.
Indeed, hybrid Clouds have to utilize resources from both types of Clouds in an optimized way and consolidate a various and large amount of resources like processors, memory units, disk drives, networking devices. Any system running applications with such heterogeneous and intensive workload may sometimes be vulnerable to different types of failures. Failures in hybrid Clouds interrupt the normal delivery of the services and degrade different performance metrics such as Quality of Service (QoS), availability, reliability and energy waste [5]. Moreover, improper handling of system failures may lead the system to an unworkable state [6].
There are different types of failures that may affect the reliability of hybrid Cloud services, including computing resource missing, software failure, Hardware failure, and Network failure [7], [8]. Therefore, we propose an approach that considers different type of failures including Software and Hardware failures. A system is considered as fault tolerant if it is capable to keep performing its intended requests, even in the presence of failures [9]. Fault tolerance is among the most imperative issues in Cloud to deliver reliable services. Indeed, without a fault tolerance capability, even a well-designed system with the best of the components and services cannot be considered as reliable [10].
Although several failure analyses for hybrid cloud environments have been proposed in the literature, there is no formal treatment that specifies when or how failure aware policies in a hybrid cloud architecture satisfy eventual functional and non-functional properties. Therefore, it is very important to formally specify various properties and check whether the system satisfies those properties under various combinations of failure scenarios. In this context, a key contribution of this paper is to integrate failure detection strategies in hybrid cloud into a formal, yet realistic, model that permits simulation and analysis during early stages in the design time. For this purpose, we propose a new formal model for hybrid cloud architecture along with the integration of different failure detection and self-recovering resource strategies. A key contribution of this paper is to show how being aware of the impact of the occurrence of different types of failures in hybrid cloud environments, results in analyzing and designing more efficient protocols. In particular, this paper makes the following contributions:
- •
A formal model for hybrid cloud environments which is based on the use of a component-based framework that has proved suitable for modelling and analyzing distributed systems. This model is heterogeneous and scalable, which makes it suitable for running real-life configurations.
- •
The model integrates non-functional aspects along the functional system behavior, which allows to achieve a separation of concerns between the functional and non-functional aspects of the failure detection and recovering in hybrid cloud.
- •
The model is reconfigurable where the reconfiguration consists in modifying the system recovering strategy, and thus the system behavior, to adapt it to the changes related to the failure type and rate. Three different case studies are presented allowing the analysis of different recovering strategies in the context of hybrid clouds.
- •
The model allows to combine dynamic and static analysis to validate the model in a novel way. In particular, we have performed stochastic analysis related to failure rate distributions.
The rest of the paper is organized as follows. In Section 2, we describe related work and we expose our main contributions. In Section 3, the background and preliminaries are presented. In Section 4, we give a detailed description of the architecture of our proposed model as well as the behavior of its different components. Section 5 presents details of three case-studies resulted of the instantiation of our model to different recovering strategies. A detailed performance evaluation and analysis of the different requirements and properties of the case studies are also available and discussed in Section 5. Finally, we summarize our findings and present future directions in Section 6.
Section snippets
Related work
The increasing complexity of large-scale systems, together with the decreasing time to market, has forced designers to consider more elaborate strategies and methods for system design. To address such challenges, System-level design is one of the most under use approaches. Such approach is mainly based on the concept of high-level modelling, that is, capturing the system functionality at a high-level abstraction. Such high-level models are usually easy to elaborate and enable fast design, which
Preliminaries: BIP and SBIP frameworks
In this section, we provide a brief overview of the modelling and the specification formalism supported in this research paper which is (Behavior & Interaction & Priority) BIP tool [44] and its Stochastic version SBIP [40].
The BIP framework supports a methodology for building systems from atomic components (Definition 1). It uses connectors (Definition 5), to specify possible interactions (Definition 4) between components, and priorities, to select amongst possible interactions. In SBIP, atomic
A formal model for hybrid clouds architecture
In this section, we detail the first Phase of our approach (see Section 2.2), by describing how our formal generic model of the hybrid Cloud architecture is designed and checked as a component-based system. Components and their composition are defined with respect to formal semantics provided by the BIP framework [44] and its statistical version SBIP [40]. Our Model is built as a superposition of three layers (see Fig. 2), namely: a failure management layer, a hybrid Cloud architecture layer
Model instantiations and performance analysis
In this section and to show the applicability of our approach, formal verification, stochastic analysis, simulations and experimental results are presented. To this end, we study three different case-studies, allowing to see how our model could be parameterized and instantiated to implement different recovering strategies in hybrid Cloud architectures. As already explained in Section 2.2 (see Fig. 1), implementing a given recovery strategy in our model consists of defining the corresponding
Conclusion and future works
In this paper, we have proposed a formal generic model describing a hybrid Cloud architecture. The proposed model allows the specification of different recovery strategies in hybrid Cloud environments. Furthermore, it offers the possibilities to define reconfigurable behaviors and thus to model complex recovery strategies (hybrid strategies). Our approach has as a purpose to verify, analyze and study different recovery strategies based on three steps, including design, verification and
References (48)
- et al.
An automated implementation of hybrid cloud for performance evaluation of distributed databases
J. Netw. Comput. Appl.
(2020) - et al.
A survey on reliability in distributed systems
J. Comput. System Sci.
(2013) - et al.
Dynamic performance testing and implementation for static var compensator controller via hardware-in-the-loop simulation under large-scale power system with real-time simulators
Simul. Model. Pract. Theory
(2021) - et al.
Computing on large-scale distributed systems: Xtremweb architecture, programming models, security, tests and convergence with grid
Future Gener. Comput. Syst.
(2005) A generic formal model for the comparison and analysis of distributed job-scheduling algorithms in grid environment
J. Parallel Distrib. Comput.
(2019)- et al.
Design and verification of a mobile robot based on the integrated model of cyber-physical systems
Simul. Model. Pract. Theory
(2020) - et al.
Simulation of vehicle body spot weld failures due to fatigue by considering road roughness and vehicle velocity
Simul. Model. Pract. Theory
(2020) - et al.
Cloud computing simulators: A comprehensive review
Simul. Model. Pract. Theory
(2020) - et al.
Modelling and simulation of qos-aware service selection in cloud computing
Simul. Model. Pract. Theory
(2020) - et al.
Failure-aware resource provisioning for hybrid cloud infrastructure
J. Parallel Distrib. Comput.
(2012)
Acceptance test for fault detection in component-based cloud computing and systems
Future Gener. Comput. Syst.
A view of cloud computing
Commun. ACM
Understanding cloud computing vulnerabilities
IEEE Secur. Priv.
Inter-cloud architectures and application brokering: taxonomy and survey
Softw. - Pract. Exp.
Fd4c: Automatic fault diagnosis framework for web applications in cloud computing
IEEE Trans. Syst. Man Cybern.: Syst.
Modelling and evaluating a high serviceability fault tolerance strategy in cloud computing environments
Int. J. Secur. Netw.
Cloud service reliability: Modeling and analysis
An empirical failure-analysis of a large-scale cloud computing environment
An analytical model to evaluate reliability of cloud computing systems in the presence of qos requirements
Modeling and simulating internet-of-things systems: A hybrid agent-oriented approach
Comput. Sci. Eng.
Modeling and simulation of energy systems: A review
Processes
Hybrid performance modeling and prediction of large-scale computing systems
A parameterized formal model for the analysis of preemption-threshold scheduling in real-time systems
IEEE Access
A large-scale study of failures in high-performance computing systems
IEEE Trans. Dependable Secure Comput.
Cited by (1)
Challenges and Their Practices in Adoption of Hybrid Cloud Computing: An Analytical Hierarchy Approach
2021, Security and Communication Networks