A novel CFN-Watchdog protocol for edge computing

https://doi.org/10.1016/j.asoc.2021.107873Get rights and content

Highlights

  • Propose CFN-Watchdog, a fault-detection protocol of tasks and VMs for CFN.

  • Develop a theoretical model to analyze the performance of the proposed CFN-Watchdog.

  • Run extensive simulations that verify the CFN-Watchdog and the theoretical model.

Abstract

Compute first networking (CFN) is a latest distributed framework that intelligently allocates computing resources for edge computing according to computing load and network status. It requires real-time visibility of available statuses of local or remote computing resources. To the best of our knowledge, this paper is the first to propose a centralized fault-detection protocol called CFN-Watchdog to well meet this CFN requirement and timely recycle resources occupied by faults. We then theoretically analyze the impact of various parameters (e.g., detection thresholds, task processing time, and network delay) on the Watchdog performance. Extensive simulations verify the effectiveness of our proposed protocol and the accuracy of our theoretical model. This study is very helpful to optimize parameter configurations and better design fault-detection protocols for edge computing.

Introduction

Edge computing, an extension of cloud computing, is a popular computing paradigm that brings computation closer to users, aiming at providing convenient and fast computing services. It adopts a cloud center for computing-resources allocation. As shown in Fig. 1(a), whenever a task arrives at one edge node, this edge node will send a resource request to a cloud center. The cloud center then determines which type of resources (cloud or edge) should be allocated to the task and send resource allocation responses to the involved edge nodes. Upon receiving the allocation response, the requesting node offloads the task to the target edge nodes. However, this centralized allocation mechanism might lead to a long duration from launching to starting processing a request, due to the network delay between edge and cloud. As a result, it actually goes against the design objective (e.g., fast response) of edge computing.

Recently, compute first networking (CFN) [1], as shown in Fig. 1(b), has been proposed to address the above issue in edge computing. CFN is a novel distributed framework for computing resource allocation and it can dynamically route tasks to optimal computing nodes according to computing load and network status. CFN equally treats edge nodes and cloud nodes, and integrates them into a distributed computing-resource pool. It then introduces a new control plane to manage the pool so as to intelligently allocate resources to meet users’ requirements. Roughly speaking, jointly taking into account of network conditions and available computing resources, the control plane intelligently allocates proximity resources to a user if the user’s request is a real-time task; otherwise, it offloads the task to cloud for processing.

CFN adopts a distributed framework to allocate computing resources for task processing. It constructs a common computing resource pool over a number of geographically dispersed edge/cloud nodes with heterogeneous computing capacities. In CFN, a task may be processed locally or remotely in multiple nodes. The diversity of network connections among these nodes and hence their network delays, as well as the large divergence of nodes’ computing capacities, make it very hard to monitor the status of task processing. Even we can monitor the status, it often takes a long time to detect a task or virtual machine (VM) faults in a distributed framework [2], [3], [4]. Here, a fault occurs when the program of a task or a VM throws a runtime exception. On the other hand, CFN requires that the available status of local or remote computing resources should be visible in real time, in order to efficiently allocate resources from the computing pool. Therefore, we need a mechanism that can detect task or VM faults timely and remotely, so as to quickly recycle resources occupied by these faults.

However, CFN is now in early infancy and does not provide a mechanism to acquire real-time availability of local or remote computing resources. Popular fault detection methods in IP networks or cloud computing might not be applicable for CFN as well. For example, the bidirectional forwarding detection protocol [5] in IP networks is mainly used to detect the fault of network devices, instead of task or VM faults; Hadoop’s JobTracker [6] in cloud computing is mainly used to detect local task faults, instead of remote detection. Therefore, under the distributed framework of CFN where tasks are processed locally or remotely in multiple nodes, it calls for a novel remote fault-detection design to monitor the statuses of task and VM timely.

This paper designs and analyzes a novel fault-detection protocol called CFN-Watchdog (for short, Watchdog), referring to the idea of watchdog timer [7], to address the above visibility problem of local or remote computing resources in CFN. Our contributions are summarized as follows.

  • 1.

    Propose Watchdog, the first centralized fault-detection protocol of tasks and VMs for CFN, that remotely monitors available statuses of distributed computing resources in edge computing. In our protocol, a number of Watchdog clients periodically report the statuses of their monitored VMs to one Watchdog server that is connected to the CFN. Benefiting from the fast response of the centralized protocol, the control plane of CFN can make its managed resource statuses visible in real-time and timely recycle resources occupied by faults.

  • 2.

    Develop a theoretical model to analyze the performance of the proposed Watchdog protocol. Our model factors in important parameters including detection thresholds, task processing time, and network delay and characterizes the false alarm and miss detection of fault events on the system throughput. With our model, we can choose optimal parameter settings.

  • 3.

    Run extensive simulations that verify the effectiveness of the proposed design and the accuracy of our theoretical model.

The rest of this paper is organized as follows. Section 2 summarizes related works. Section 3 outlines CFN and designs the proposed CFN-Watchdog protocol. Section 4 theoretically analyzes the throughput of CFN. Section 5 evaluates the system performance. Section 6 concludes the paper. In addition, Table 1 lists the main notations and their meanings.

Section snippets

Related work

In this section, we list related work on fault detection in traditional network, cloud computing and edge computing. High availability is a key requirement for computing frameworks. A basic idea is to detect faults (or errors) and then perform recovery strategies.

CFN-Watchdog design

In this section, we describe the interaction logic in compute first networking (CFN) among edge nodes and then detail the proposed novel centralized fault-detection protocol called CFN-Watchdog (for short, Watchdog).

Throughput analysis

In this section, we theoretically analyze the throughput of the proposed CFN-Watchdog protocol. Below, we first specify the model assumption, then define interaction cases between VM and watchdog, and finally present the throughput.

Performance evaluation

In this section, we run extensive simulations to evaluate the effectiveness of our CFN-Watchdog design and the accuracy of the theoretical model. Let T denote the time of successfully executing a task, let 1/λ denote the mean network delay, let perr denote the probability that a VM error happens at each time. Let Prob-Normal denote the probability of a VM completing a task normally, let Prob-FA denote the false alarm when there is no error (i.e., case 2), let Prob-Det denote the successful

Conclusion and future work

In edge computing, compute first networking (CFN) has recently been proposed to accelerate the resources scheduling and allocation of tasks. It requires real-time visibility of available statuses of local and remote computing resources. This paper proposes a centralized fault-detection protocol named as CFN-Watchdog, to provide the required visibility for CFN. Our watchdog protocol can recycle resources occupied by faults timely and increase the system throughput significantly. We then develop

CRediT authorship contribution statement

Hong Liang: Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Li Feng: Formal analysis, Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition. Fangxin Xu: Data curation, Writing – original draft. Guangcheng Li: Validation, Formal analysis, Writing – original draft. Jie Xu: Validation, Writing – review & editing. Yuqiang Chen: Validation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (30)

  • SmaraMounya et al.

    Acceptance test for fault detection in component-based cloud computing and systems

    Future Gener. Comput. Syst.

    (2017)
  • GyamfiKojo Sarfo et al.

    Heartbeat design for energy-aware IoT: Are your sensors alive?

    Expert Syst. Appl.

    (2019)
  • Y. Li, draft-li-rtgwg-cfn-framework-00 - Framework of Compute First Networking (CFN), Available at...
  • SongYaozhong

    An Approach to QoS-Based Task Distribution in Edge Computing Networks for IoT Applications

    (2018)
  • VargheseBlesson et al.

    Challenges and opportunities in edge computing

  • SongYaozhong et al.

    An approach to QoS-based task distribution in edge computing networks for IoT applications

  • KatzDave et al.

    Bidirectional forwarding detection (BFD)

    (2010)
  • ZhuHao et al.

    Adaptive failure detection via heartbeat under Hadoop

  • Wikipedia

    Watchdog timer

    (2020)
  • ICMP Router Discovery Messages

    (1991)
  • ShalunovStanislav et al.

    A One-Way Active Measurement Protocol (OWAMP)

    (2006)
  • SridharKamakshi et al.

    System and method for monitoring end nodes using ethernet connectivity fault management (CFM) in an access network

    (2010)
  • Michał Król, Spyridon Mastorakis, David Oran, Dirk Kutscher, Compute first networking: Distributed computing meets icn,...
  • ArmbrustMichael et al.

    A view of cloud computing

    Commun. ACM

    (2010)
  • Apache, Apache Storm, Available at...
  • Cited by (3)

    This work is funded in part by the National Natural Science Foundation of China (File no. 61872451 and 61872452) and in part by the Science and Technology Development Fund, Macau SAR (File no. 0098/2018/A3, 0037/2020/A1 and 0062/2020/A2).

    View full text