QuickDedup: Efficient VM deduplication in cloud computing environments

https://doi.org/10.1016/j.jpdc.2020.01.002Get rights and content

Highlights

  • We provide a list of major requirements for deduplication efficiency testing.

  • At the initial stage, we use byte-level comparisons to identify deduplicable blocks.

  • QuickDedup minimises the total number of hash calculated up to 95% .

  • We minimise the overall deduplication time up to 97%.

  • We minimise the metadata overhead during the deduplication process.

Abstract

Deduplication is one of the major storage optimisation techniques for Virtual Machines (VMs) in cloud environment. Usually, hashing of blocks helps in identifying duplicate data blocks. This paper proposes a novel deduplication approach, QuickDedup that reduces the overall deduplication time, metadata overhead and the number of hash computations, and subsequent comparisons for the VM disk images. In addition to minimising the deduplication related metadata, which is a necessary by-product useful in checking deduplication, QuickDedup, follows novel byte comparison scheme to prepare various block classes. This way, QuickDedup eliminates or minimises the need for hash calculation and subsequent comparisons. QuickDedup performs the calculation and comparisons of hashes within the respective categories only. QuickDedup saves the space required for hash storage during deduplication and makes deduplication of VM disk images much faster. We conducted a detailed evaluation of QuickDedup on various metrics with different kinds and sizes of VM images taken from publicly available datasets. The evaluation results show a substantial improvement of up to 96% in the overall deduplication time required to deduplicate VM images apart from significant savings in metadata and storage overhead.

Introduction

Cloud Computing adoption has grown rapidly due to several advantages and features such as multi-tenancy, higher utilisation of servers, energy efficiency and elasticity derived from on-demand utility computing services [32]. The creation and deployment of VMs are quick, added with the capability of easy scalability. The growing popularity of VMs has made ways for the VM appliances [33], which are pre-configured VM images that readily run on a hypervisor. The benefits of VM appliances over traditional software packaging include simplified deployment and enhanced isolation [11]. As Cloud Computing and Big data are becoming more prevalent, an increasing amount of data is stored in the cloud. According to Gartner, enterprise IT‘s biggest challenge today is double-digit data growth. In fact, data is growing in enterprise storage banks at the alarming average and will increase up to 50 times in next decade as predicated.

Deduplication technology helps in handling data growth to a large extent. It is a data compression technology to eliminate redundant/duplicate data from the storage [3], [19], [23]. VMs are large in size, which exacerbates the storage problem. There are multiple contributors, who have provided storage optimisation solutions utilising deduplication techniques [14], [35]. The common element of many of these techniques is to calculate the hashes of small chunks of data known as blocks. A hash comparison helps in identifying common blocks. There are various ways of performing deduplication, and broadly it is categorised into three categories, i.e. based on time, level and location [13]. Deduplication can either follow a source-based or target-based approach based on the location of data storage. In source based deduplication, redundant data is eliminated at the client-side before sending the data to the server reducing the overall bandwidth cost. In target-based deduplication, the server performs the deduplication process utilising higher bandwidth as the redundant data also gets transferred over the network. Based on the level at which the similarity is exercised, the deduplication can be either at byte-level, block-level or at the file-level. Byte-level deduplication process compares bytes to eliminate redundant bytes. On the other hand, block-level deduplication eliminates redundant block and comparison takes place at the block level. On the basis of time, it can either be post-process or inline deduplication. In post-process or offline deduplication, there is no computation before storing the data, ensuring better storage performance. However, duplicate data is stored for a short time which can be an issue if the storage system is near full capacity. In-line deduplication requires less storage as it does not store duplicate data. However, computation takes time which may degrade the storage performance. As cloud environment uses pay-as-you-go service model, time becomes an important factor and cloud resources are enormous therefore no issue of storage being full. Therefore, an offline algorithm is preferable for VM deduplication in the cloud. Deduplication can be source based or target based, depending upon the location of data storage. From the perspective of deduplication, VM disk images are classified as flat and sparse VM disk images [13], [46]. Popular hypervisors, such as Xen and KVM support sparse format qcow2 and virtual box supports vdi sparse format. Deduplication is one of the ways to optimise VM storage over the cloud [24], [46]. Deduplication process helps in saving storage space to a large extent in VM disk images. There are many other areas, where deduplication is useful like in backup systems, databases, and networks [10], [19].

Deduplication in a traditional file system leads to higher storage utilisation and improves the disk cache efficiency [48]. When considered from virtualization perspective, deduplication offers additional benefits like supporting multi-tenancy and reducing the effects of VM sprawl problem. For each user’s VM, there exists a VM disk image that stores OS image, applications data, and other free blocks. Each VM when individually stored occupies huge space on VM storage, which can be optimised using deduplication. Deduplication can be done either on the VM image (intra-VM image deduplication) or among different VMs (inter-VM image deduplication).

Surveys by AFCOM [1] and COMMVAULT [2] shows that over 63% of data centre surveyed have seen tremendous growth in their storage costs. However, in the present scenario, the deduplication ratios are not rising proportionately to the data growth. Therefore, a deduplication technique should be efficient and fast enough to deduplicate the storage even when the deduplication ratio is low or moderate.

In this paper, we focus at the important factors contributing to the overall deduplication process and identify that a major amount of effort for deduplication is spent on identifying exactly same disk blocks. Hash calculation of disk blocks and subsequent comparison among them helps in identification of exactly same blocks. We argue the overall deduplication time can be minimised by minimising the number of hash calculations, and subsequently the eligible blocks for comparisons. Our proposed approach QuickDedup utilises a novel byte-to-byte comparison scheme to reduce the number of candidate blocks for hash calculations and subsequent comparisons. We perform a number of performance evaluation experiments to evaluate the proposed approach which show that it is much faster than the pure hash-based technique, maintaining the same deduplication ratio.

The following are the major contributions of our work:

  • 1.

    Deduplication being an important performance optimisation for cloud and VM storage, we collate a primary list of important requirements for deduplication efficiency.

  • 2.

    Through extensive experiments on VM disk image datasets, we observe that the major factor in deduplication process is hash computation of each block and subsequent comparison among them to identify duplicate blocks. We design QuickDedup, where at initial stages of preprocessing, we make byte-to-byte comparisons to identify deduplicable blocks, which allows to reduce the total number of hash-based comparisons quickly.

  • 3.

    We propose a set of byte comparison strategies with different aspects of their suitability with the types of disk blocks such as filled, zero blocks, and partially filled blocks.

  • 4.

    QuickDedup with the help of a novel meta-data structure, Deduplication Data Tree (DDT), provides savings by minimising the number of hashes (up to 95%) and the overall deduplication time (up to 96%). These savings also helps in reducing the meta-data overhead during the deduplication process.

The rest of the paper is organised as follows: Section 2 provides a deduplication background listing the necessary criterion for an efficient deduplication technique. Section 3 comprises of literature survey and discusses various file systems that implement deduplication for VMs and traditional systems. In Section 4, we propose the QuickDedup algorithm and its design in detail. Section 5 consists of experiments and their results to test the efficiency of our proposed technique. Section 6 discusses various cases where QuickDedup provides best and worst case performance. In Section 7, we conclude and propose future work for further improving deduplication in context of different operating systems.

Section snippets

Deduplication: Background

There are a large number of deduplication techniques proposed in the literature. However, a unified set of requirements to test and verify the suitability of a deduplication technique are missing. Before proceeding towards identifying the key requirements, we provide an overview of the deduplication ratio calculation among inter and intra-VM Images. Let us consider that size (Vi) denotes the original size of a VM and dedup_size (Vi) represents the size of a VM after deduplication where i = 1,

Related work

There are various file systems which implement deduplication strategies to improve storage efficiency. Deduplication is applied in several system domains viz. databases, VM disk images, data files and network data, which in turn benefits both the user and the resource provider. LIQUID [46] and LiveDFS [21] are file systems, which implement deduplication for VMs. LIQUID, a VM deduplication file system, along with deduplication offers features as instant cloning, on-demand fetching, low storage

QuickDedup: An new approach for efficient deduplication

An efficient deduplication technique for VM is the one that follows the criteria listed in Section 2. Based on these, we propose QuickDedup, which is useful for VM appliance stores and VM disk backup/snapshots. QuickDedup is an offline algorithm where at initial stage complete VM image is stored. At a later stage, the deduplication process is invoked, which checks for the presence of deduplicable blocks. Initially, we store the VM image sequentially. During the categorisation process, only the

Performance evaluation

To evaluate QuickDedup algorithm, we conduct extensive experiments to demonstrate the efficiency of QuickDedup compared to the traditional approaches on various parameters. These experiments also demonstrate the selection of optimal block size and number of passes for QuickDedup technique to achieve the best results. The details of the experimental configuration used for the evaluation purpose, are provided in Table 1. We used SHA-1 to showcase our experimental results owing to the popular and

Time complexity analysis

In case of deduplication of a VM image having completely unique blocks, QuickDedup performs its best. As QuickDedup categorises blocks based on byte comparison, all blocks being unique, fall into different categories i.e. each category will have only one block. After the last pass, when each category is checked, each category will have only one block, therefore, the need for calculating and comparing any hash, will not arise. In the best case, the number of hashes computed and compared will

Conclusions and future work

Deduplication process is an important storage optimisation technique for emerging virtualization based cloud storage. In this work, we first collate a number of important performance requirements for an efficient deduplication process. Based on these essential factors, we propose a novel deduplication technique, “QuickDedup”, which outperforms traditional hash-based deduplication approaches on various metrics. Instead of calculating hashes for each block in the disk, QuickDedup uses a novel

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Shweta Saharan is pursing PhD in Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, India. She has completed her master’s degree (M.Tech) in Computer Science and Engineering (Information Security) from Central University of Rajasthan, India, in 2015 and received her Bachelor of Technology (B.Tech) from Rajasthan Technical University in 2012. She has been a part of publication committee of ICISS 2016 conference and program committee member of

References (48)

  • GerofiB. et al.

    Utilizing memory content similarity for improving the performance of replicated virtual machines

  • HansenJ.G. et al.

    Lithium: Virtual machine storage for the cloud

  • HeQ. et al.

    Data deduplication techniques

  • HydeD.

    A survey on the security of virtual machinesTech. Rep

    (2009)
  • JayaramK. et al.

    An empirical analysis of similarity in virtual machine images

  • JinK. et al.

    The effectiveness of deduplication on virtual machine disk images

  • KollerR. et al.

    I/o deduplication: Utilizing content similarity to improve i/o performance

    ACM Trans. Storage (TOS)

    (2010)
  • KruusE. et al.

    Bimodal content defined chunking for backup streams

    Fast

    (2010)
  • LinC. et al.

    HPDV : A highly parallel deduplication cluster for virtual machine images

  • MaoB. et al.

    Read-performance optimization for deduplication-based storage systems in the cloud

    ACM Trans. Storage (TOS)

    (2014)
  • MeyerD.T. et al.

    A study of practical deduplication

    ACM Trans. Storage (TOS)

    (2012)
  • NgC.-H. et al.

    RevDedup: A reverse deduplication storage system optimized for reads to latest backups

  • NgC.-H. et al.

    Live deduplication storage of virtual machine images in an open-source cloud

  • PauloJ.a. et al.

    A survey and classification of storage deduplication systems

    ACM Comput. Surv.

    (2014)
  • Cited by (26)

    • Scaling & fuzzing: Personal image privacy from automated attacks in mobile cloud computing

      2021, Journal of Information Security and Applications
      Citation Excerpt :

      However, any computation on data requires that the data be decrypted before carrying the computation. Data offloaded to MCC and related privacy concerns are increasingly becoming important [64–66] especially in healthcare systems [67,68]. Privacy-preserving computation is supported by techniques based on either homomorphic encryption [16,69,70] or distributed computing based techniques [52,71].

    • Editorial

      2021, Journal of Parallel and Distributed Computing
    • Study of data deduplication for file chunking approaches

      2020, Data Deduplication Approaches: Concepts, Strategies, and Challenges
    • Data deduplication for cloud storage

      2020, Data Deduplication Approaches: Concepts, Strategies, and Challenges
    • SURVEY OF IMAGE DEDUPLICATION FOR CLOUD STORAGE

      2023, System Research and Information Technologies
    View all citing articles on Scopus

    Shweta Saharan is pursing PhD in Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, India. She has completed her master’s degree (M.Tech) in Computer Science and Engineering (Information Security) from Central University of Rajasthan, India, in 2015 and received her Bachelor of Technology (B.Tech) from Rajasthan Technical University in 2012. She has been a part of publication committee of ICISS 2016 conference and program committee member of ISEA-ISAP 2018 conference. She is an active reviewer in various conferences series viz. SIN, ISEA-ISAP, INDICON and Journal of Supercomputing. Her research interests include Cloud Computing, Information and Network Security, and Networking.

    Dr Gaurav Somani is an assistant professor at Department of CSE at Central University of Rajasthan, India. Earlier, he served as a lecturer at the LNMIIT, Jaipur. He completed his PhD from MNIT, Jaipur, MTech from DAIICT, Gandhinagar, and BE from University of Rajasthan, India. His areas of research interests include distributed systems and security. He has published his research at prestigious venues such as IEEE TDSC, JPDC, ACM Computing Surveys, ComCom, ComNet, FGCS, and IEEE Cloud Computing. He is an associate editor of IEEE Access Journal and also served as a lead guest editor for special issue of Software: Practice and Experience Journal (Wiley) on “Integration of IoT, Cloud and Big Data Analytics”. He has been in TPC of various reputed conferences and served as reviewer of many IEEE, ACM, Elsevier, Springer, and Wiley journals. He served as keynote co-chair at Asia Security & Privacy Conference 2019 and keynote and tutorial chair for ICISS 2016. He has received IEI Young Engineer Award 2020. He is senior member of IEEE and a professional member of ACM.

    Dr Gaurav Gupta is a Scientist E in the Ministry of Electronics and Information Technology, New Delhi, India. His research interests include digital forensics, Privacy presvering analystics and foresics, enhacing QR Codes and security and forensic aspects in emerging technologies.

    Dr Robin Verma is a postdoctoral researcher at Cyber Center for Security and Analytics in University of Texas at San Antonio, Texas, USA. His research interests include cyber security, cyber forensics, digital forensics, and application of privacy preserving technologies for digital forensic investigation process.

    Dr Manoj Singh Gaur is director of Indian Institute of Technology, Jammu. Prior to joining IIT Jammu he was a Professor and Head of the Department of Computer Science and Engineering at Malaviya National Institute of Technology (MNIT) Jaipur. He has obtained his PhD from University of Southampton, UK. He has supervised research in the areas of networks on chip and information security. He has published over 150 papers in peerreviewed major conferences and journals and has coordinated national and international projects in the domains of information security and networks on chip. He has been associate editors with CSI Transaction, IET Electronics and Digital Techniques, and Journal of Information Security and Assurance. He was organising chair of SPACE 2015 and was general cochair of SINCONF 2012 and ICISS 2016. He is a member of IEEE and ACM. Currently, he is director of Indian Institute of Technology Jammu, India.

    Dr. Rajkumar Buyya is a Redmond Barry Distinguished Professor and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in Cloud Computing. He served as a Future Fellow of the Australian Research Council during 2012-2016. He has authored over 625 publications and seven text books including “Mastering Cloud Computing” published by McGraw Hill, China Machine Press, and Morgan Kaufmann for Indian, Chinese and international markets respectively. He also edited several books including ”Cloud Computing: Principles and Paradigms” (Wiley Press, USA, Feb 2011). He is one of the highly cited authors in computer science and software engineering worldwide (h-index=132, g-index=294, 93,000+ citations). ”A Scientometric Analysis of Cloud Computing Literature” by German scientists ranked Dr. Buyya as the World’s Top-Cited (#1) Author and the World’s Most-Productive (#1) Author in Cloud Computing. Dr. Buyya is recognized as a ”Web of Science Highly Cited Researcher” for four consecutive years since 2016, a Fellow of IEEE, and Scopus Researcher of the Year 2017 with Excellence in Innovative Research Award by Elsevier and recently (2019) received ”Lifetime Achievement Awards” from two Indian universities for his outstanding contributions to Cloud computing and distributed systems. Software technologies for Grid and Cloud computing developed under Dr. Buyya’s leadership have gained rapid acceptance and are in use at several academic institutions and commercial enterprises in 50 countries around the world. Dr. Buyya has led the establishment and development of key community activities, including serving as foundation Chair of the IEEE Technical Committee on Scalable Computing and five IEEE/ACM conferences. These contributions and international research leadership of Dr. Buyya are recognized through the award of “2009 IEEE Medal for Excellence in Scalable Computing” from the IEEE Computer Society TCSC. Manjrasoft’s Aneka Cloud technology developed under his leadership has received “2010 Frost & Sullivan New Product Innovation Award”. Recently, Dr. Buyya received “Mahatma Gandhi Award” along with Gold Medals for his outstanding and extraordinary achievements in Information Technology field and services rendered to promote greater friendship and India-International cooperation. He served as the founding Editor-in-Chief of the IEEE Transactions on Cloud Computing. He is currently serving as Co-Editor-in-Chief of Journal of Software: Practice and Experience, which was established 50 years ago. For further information on Dr.Buyya, please visit his cyberhome: www.buyya.com

    View full text