Improving the hybrid cloud performance through disk activity-aware data access

https://doi.org/10.1016/j.simpat.2021.102296Get rights and content

Abstract

Cloud computing has been making significant contributions to Internet of Things, Big Data, and many other cutting-edge research areas in recent years. To deal with the cloud bursting, on-premises private clouds often extend their service capacity with off-premises public clouds, which causes the migration of jobs and their corresponding data from private clouds to public clouds. For jobs executed in public clouds, promptly transferring data they need from private clouds to public clouds is essential for their quick completion as the volume of data is often large for cloud applications. The Internet connection between private clouds and public clouds is with limited bandwidth in most cases. Therefore, it will be valuable if the underlying operating system could expedite the course of reading data from hard drives to speed up the process of moving data from private clouds to public clouds. The Apache Hadoop is considered as one of the most widely used cloud platforms in the community of cloud computing. It keeps multiple replicas of data across its cluster nodes to improve data availability. We designed and implemented a new model enabling computing nodes in Hadoop to get data in request from a cluster node with the least amount of disk activity regardless of its location to hasten the course of accessing data. Experimental results show that jobs could reduce their execution time by up to 80.83% in our model. Accordingly, our model could help accelerate the completion of job execution in both private clouds and public clouds in the environment of hybrid clouds.

Introduction

With cloud computing widely adopted in many areas, cloud systems need to handle and process large volumes of data more than ever. In cases when the resources of on-premises private clouds cannot meet the requirement of workloads, utilizing off-premises public clouds to share the workloads is an effective and economical alternative than extending the infrastructure of on-premises clouds [1], [2]. However, deploying jobs and data from private clouds to public clouds through the slow Internet connection could cause performance issues such as failing quality of service or missing time constraints [3], [4]. Consequently, it is important to shorten the deployment time in particular for data as their sizes are often large for cloud applications. The course of deployment mainly consists of two parts. The first part is to read required data from hard drives in private clouds, while the second is to transmit them to public clouds through the Internet. As the instability of the Internet is beyond the control of private clouds and public clouds, expediting the process of reading required data in private clouds is what the cloud system can do to reduce the deployment time. In addition, rapidly reading required data could also decrease the execution time of jobs running in private clouds in that the I/O time is a part of the execution time.

The Apache Hadoop is one the most popular cloud platforms in the cloud community [5], [6], [7], [8], [9], [10]. It often includes a large number of nodes and stores data with multiple replicas across its nodes to improve the data availability. Hadoop realizes parallel execution through the MapReduce programming model, which divides individual jobs into smaller tasks and distributes them to multiple nodes to speed up their execution. The computing resources on each node is packaged and distributed in a unit of “container”, which is a logical resource bundle containing memory and CPU [11], [12], [13]. Each task requires a container to proceed with its execution. Once the execution of a task ends, its corresponding container can be recollected and reused for other tasks. For each task, its hosting node will retrieve the required data block from one of the nodes storing that data block. The Hadoop Distributed File System (HDFS) is the default file system in Hadoop. HDFS is a distributed file system consisting of one NameNode and multiple DataNodes in general [5], [6], [14]. The NameNode administers and maintains the entire HDFS namespace, which means any operation processing data needs to go through the NameNode to find out the location of the data. The functionality of DataNodes includes storing cloud data and providing computing resources (containers). When reading data, the host node (a DataNode) of a task first contacts the NameNode to learn about the nearest DataNode storing the required data and then gets the data from that DataNode. For the vast majority of cases, the geographic distance between nodes in a cloud cluster is limited. With the fast network speed nowadays, the time required to transmit a data block from different nodes to the requesting node does not vary that much from the point of network. Nevertheless, the disk speed is about six orders of magnitude behind the CPU speed. The time required to finish individual tasks could be noticeably prolonged if requested data are obtained from nodes with heavy disk I/O when accessing data. As MapReduce jobs are often composed of numerous tasks, their executing progress could be delayed under such circumstances. Hence, instead of considering the limited distance between the requested node and the requesting node, individual tasks should obtain required data from nodes with less disk activity to speed up their execution.

The job scheduling in Hadoop is handled by the Yet Another Resource Negotiator (YARN), which supports Capacity Scheduler (CS), Fair Scheduler (FS), and First In First Out (FIFO) scheduler. Each scheduler will pick jobs to receive system resources (CPU and/or memory) in its own way. Every time a job is selected by a scheduler, one container will be assigned to the next task of that job. The container will then start executing its corresponding task, which often involves the aforementioned process of data accessing. As our approach could shorten the time to complete individual tasks, the execution of MapReduce jobs could be expedited. Our model does not interfere with job schedulers so theoretically it could be applied to all current job schedulers and new job schedulers developed in the future.

Like all other application software, Hadoop completely relies on the underlying operating system to utilize hardware components such as CPU, memory, and hard drives. The Linux operating system is among the most popular operating systems that Hadoop depends on to fulfill its cloud service. As stated, Hadoop could speed up the execution of tasks if required data is acquired from the DataNode with the least amount of disk activity. We modified Linux kernel so it can provide Hadoop with the information regarding the busy degree of disks on each node in the Hadoop cluster. Hadoop then can use the information to decide which DataNode a client should get required data from.

We evaluated our design and implementation by executing the same set of benchmarks in the original Hadoop and our version of Hadoop under various configurations. Experiments show that jobs can shorten their execution time by up to 80.83% in our version of Hadoop. With the help of accessing data swiftly, jobs executed on both private clouds and public clouds could speed up their execution in the environment of hybrid clouds.

The remainder of this paper is organized as follows. Section 2 reviews previous works related to hybrid clouds, Hadoop scheduling, and the Hadoop Distributed File System. Section 3 describes the design and implementation of our model. Section 4 presents experimental results. Section 5 concludes this paper, and the future work is discussed in Section 6.

Section snippets

Related work

The constraints of cost and time are among those most important issues hybrid clouds need to face in practice. The key benefit of using public clouds is to avoid the upfront investment on private clouds when cloud bursting occurs. Researchers had proposed cost-performance models to evaluate the efficiency of using resources in public clouds [4], [15]. The time constraint mainly relates to maintaining quality of service and deadlines of jobs executed on public clouds, which often involves

Disk activity-aware data access

To realize our model, there are three issues that we need to address. The first is how to evaluate and obtain the busy degree of disk activity on every node in Hadoop. The second is how Hadoop communicates with the Linux operating system to collect and update the busy degree of disk on each node. The last is how Hadoop retrieves data in request from the node with the least amount of disk activity. The entire process is totally transparent to cloud users. We will discuss and explain our design

Performance evaluation

We built a Hadoop cluster containing one NameNode and multiple DataNodes connected in a LAN environment through a 1 Gbps switch to evaluate the performance of our design and implementation in four experiments. For the first three experiments, the testing Hadoop cluster was composed of one NameNode and four DataNodes. For the fourth experiment, we set the cluster to have one NameNode and six DataNodes. The Hadoop version installed was 3.1.0. The NameNodes used in 4-DataNode and 6-DataNode Hadoop

Conclusions

As more and more computing applications move to cloud platforms, the cloud performance can affect their quality of service more than ever, in particular in the environment of hybrid clouds. Undoubtedly, if will be helpful if cloud systems could expedite the execution of jobs they host. Like other software, cloud systems cannot function without accessing system hardware through their underlying operating systems. This explains that the performance of cloud systems could be greatly influenced by

Future work

Currently, our model evaluates the busy degrees of hard drives mainly by numbers of their pending disk requests. The time required to complete individual disk requests could vary among hard drives due to their hardware specification. Our model can be improved by analyzing the average time required to serve single disk request for individual hard drives to more accurately reveal their present busy degrees. To further quicken the process of moving jobs and data deployed to public clouds, the

References (42)

  • ShvachkoK. et al.

    The hadoop distributed file system

  • WhiteT.

    Hadoop: The Definitive Guide

    (2012)
  • VavilapalliV.K. et al.

    Apache hadoop yarn: Yet another resource negotiator

  • BuyyaR. et al.

    Cloud Computing: Principles and Paradigms, Vol. 87

    (2010)
  • ...
  • RennerT. et al.

    Coloc: Distributed data and container colocation for data-intensive applications

  • RistaC. et al.

    Improving the network performance of a container-based cloud environment for hadoop systems

  • BurnsB. et al.

    Design patterns for container-based distributed systems

  • ZahariaM. et al.

    Spark: Cluster computing with working sets

    HotCloud

    (2010)
  • BittencourtL.F. et al.

    Scheduling in hybrid clouds

    IEEE Commun. Mag.

    (2012)
  • ZhuJ. et al.

    Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources

    IEEE Trans. Parallel Distrib. Syst.

    (2018)
  • View full text