Abstract

With the proliferation of new mobile devices, mobile cloud computing technology has emerged to provide rich computing and storage functions for mobile users. The explosive growth of mobile data has led to an increased demand for solutions that conserve storage resources. Data deduplication is a promising technique that eliminates data redundancy for storage. For mobile cloud storage services, enabling the deduplication of encrypted data is of vital importance to reduce costs and preserve data confidentiality. However, recently proposed solutions for encrypted deduplication lack the desired level of security and efficiency. In this paper, we propose a novel scheme for serverless efficient encrypted deduplication (SEED) in mobile cloud computing environments. Without the aid of additional servers, SEED ensures confidentiality, data integrity, and collusion resistance for outsourced data. The absence of dedicated servers increases the effectiveness of SEED for mobile cloud storage services, in which user mobility is essential. In addition, noninteractive file encryption with the support of lazy encryption greatly reduces latency in the file-upload process. The proposed indexing structure (D-tree) supports the deduplication algorithm and thus makes SEED much more efficient and scalable. Security and performance analyses prove the efficiency and effectiveness of SEED for mobile cloud storage services.

1. Introduction

Most mobile devices, such as smartphones and Internet of things products, are constantly connected to the Internet thanks to advances in mobile wireless network technology. Mobile cloud computing (MCC) [1, 2], also referred to as mobile edge computing [3], has emerged to fulfill the need for ubiquitous, low-latency services and applications for mobile users. Through the combination of cloud computing, mobile computing, and wireless networks, MCC provides a rich array of computing and storage options for mobile users [4].

With the explosive growth in the volume of data outsourced from mobile devices, it is crucial for mobile cloud service providers (MCSPs) to minimize the costs of storing outsourced data. Data deduplication, a technique that eliminates data redundancy, can achieve this goal and reduce resource use, including disk space and network bandwidth, by more than 90% [5].

To maintain confidentiality of the outsourced data, it is essential to devise a technique to conduct deduplication over encrypted data. As a first attempt for encrypted deduplication, convergent encryption (CE) [6, 7] was proposed. CE computes an encryption key from the hash of the data, thus generating identical ciphertexts from identical plaintexts. Although the method is quite simple, it is vulnerable to brute-force attacks [8, 9] because encryption keys are deterministically computed from plaintext, which makes them predictable. For example, given CE ciphertext (of plaintext ) and a dictionary of possible plaintexts , an adversary might attempt to derive an encryption key for each plaintext in and then perform encryption on it until is found.

Server-aided encryption [8, 1012] addresses this problem and aims to mitigate brute-force attacks on encrypted deduplication. This approach uses a dedicated key server for the generation of encryption keys. The key server possesses its own secret key and performs an oblivious key generation protocol [13] with users: for each request, it generates an encryption key, using the secret key and a blinded hash computed from the data, and then returns it to the user. By doing so, the randomness of the key server’s secret key contributes to the encryption keys, which makes brute-force attacks infeasible while the secret key is kept hidden from adversaries.

Despite its resistance to brute-force attacks, server-aided encryption has several limitations when applied to an MCC environment. First, the achievement of security comes at the cost of managing key servers, which are subject to single point-of-failure or server-compromise attacks [8]. Second, the dedicated key servers that are usually residing within on-premises networks severely reduce user mobility, which significantly degrades the effectiveness and performance of the MCC technology. This restriction on mobility could be relieved by deploying multiple key servers over geographically separated areas. However, this not only incurs high deployment costs but also exposes the system to a wide variety of security threats.

For the successful provisioning of ubiquitous, low-latency, and secure storage services in a mobile cloud environment, it is necessary to devise serverless encryption that enables brute-force-resistant encrypted deduplication without the aid of additional servers.

In this paper, we propose a novel scheme for serverless and efficient encrypted deduplication (SEED) in MCC environments. Instead of using key servers, users perform bilinear pairing-based encryption on files using their own public and secret keys. The bilinearity of file encryption allows an equality test to be conducted for the ciphertexts generated under different secret keys and thus enables cross-user deduplication of encrypted data. The encryption algorithm randomizes all ciphertexts and the corresponding tags, which are susceptible to exposure to adversaries, using a random source supplied from the users’ secret keys. The provable security ensures that no information about the plaintext is revealed from either ciphertexts or tags. Furthermore, file encryption allows the tags to be computed independently of the ciphertexts, which makes it possible for the ciphertexts and tags to be generated in parallel. This property of SEED enables lazy encryption, a novel feature in which ciphertext generation, a computationally expensive component of encryption, can be delayed or even omitted in the case of client-side deduplication.

In addition to bilinear pairing-based file encryption, SEED is based on an efficient deduplication algorithm. For this, we propose D-tree, a new indexing data structure that supports deduplication. D-tree is a random binary tree, which is a binary search tree that is formed from the random permutation of nodes. Each node in a D-tree contains a tag for an outsourced file as deduplication information within the storage. The cloud server (i.e., MCSP) can perform a binary search over the D-tree for identical files within the storage by running equality tests on each node. Because nodes are balanced in a random binary tree, D-tree preserves logarithmic computational complexity in the worst case for the deduplication algorithm.

SEED is significantly more practical for mobile cloud storage services than existing solutions because of the following advantages:(i)It eliminates the need for key servers, which severely restricts user mobility. The absence of key servers also allows noninteractive file encryption: users can generate encryption keys directly without server interaction. In combination with lazy encryption, efficient and low-latency file uploading to a mobile cloud is realized.(ii)The random binary-tree-based deduplication algorithm reduces the run time complexity when finding duplicates to , where is the number of outsourced files in the storage. This makes the scheme much more efficient and scalable, especially considering very large data items being outsourced.(iii)The use of users’ secret keys for file encryption ensures strong data confidentiality even for predictable data, while also guaranteeing data integrity and resistance against collusion attacks.(iv)Noninteractive file encryption with the support of lazy encryption greatly reduces latency in the file uploading process.

1.1. Contribution

We make several contributions in this paper:(i)We address the challenge of encrypted deduplication in an MCC environment and propose a novel serverless and efficient encrypted deduplication scheme, called SEED, suitable for this environment(ii)The security of SEED is rigorously analyzed in terms of data confidentiality, data integrity, and collusion resistance(iii)The effectiveness of SEED is validated by an extensive analysis of its efficiency and performance

1.2. Organization

The remainder of this paper is organized as follows. In Sections 3 and 4, we present the system model and background knowledge, respectively. In Section 5, we describe the proposed scheme in detail. We analyze the security of the scheme in Section 6 and present a comparative and performance analysis in Section 7. Finally, we conclude the paper in Section 8.

2.1. Convergent Encryption

CE is a cryptographic algorithm that generates identical ciphertexts from identical plaintexts [6, 7]. In CE, a convergent key is derived by computing , where is data (or a file) and is a cryptographic hash function. The ciphertext is then computed with conventional symmetric encryption algorithm and convergent key . A given plaintext will always produce an identical ciphertext . Bellare et al. [1416] presented message-locked encryption (MLE), which is a generalized framework for CE, and attempted to formalize security. MLE essentially follows the CE approach in the sense that it derives encryption keys deterministically from .

Despite the novel nature of encrypted deduplication, CE and MLE are insufficiently secure for two reasons [17]. First, they cannot preserve semantic security due to their deterministic nature. Second, the distribution of message space is the only entropic source of randomness in the convergent key. Thus, the key space is reduced to the message space, which is very small compared to the former. This ultimately renders CE and MLE susceptible to brute-force attacks [8].

Many secure deduplication solutions have been built on CE and MLE. While addressing data confidentiality as a primary goal, these previous solutions also attempted to meet other security goals, such as ownership management [18], authentication [19, 20], authorization [21], reliability [2224], and access control [9, 25, 26]. Recently proposed lattice-based cryptographic schemes for cloud storage [27, 28] are possible candidates for secure deduplication solutions.

2.2. Server-Aided Encryption

To overcome the weaknesses of CE, it is necessary to strengthen the generation of convergent keys so that the key space has high min-entropy. Several solutions have been proposed to achieve this goal. The approach used in server-aided encryption is to generate convergent keys through interacting with key servers. By doing so, the probability distribution of the convergent keys becomes independent of the distribution of message space, and thus brute-force attacks can be mitigated.

DupLESS [8] was the first attempt at server-aided encryption. In this approach, users run an interactive key generation protocol with a key server to compute convergent keys. The protocol operates on RSA-based Oblivious Pseudorandom Function (OPRF) [13], and thus, it guarantees that the convergent keys can be computed without revealing any information about the message or the secret of the key server. In this way, adversaries, such as the MCSP or users, cannot recover plaintext (i.e., messages) with offline brute-force attacks on ciphertext, even if the plaintext is easily predicable.

The security of the DupLESS scheme requires the aid of a key server, which is inherently vulnerable to the single-point-of-failure problem. That is, data confidentiality cannot be retained if the server is compromised.

Subsequent attempts at server-aided encryption have been made to overcome the drawbacks of DupLESS. Miao et al. [11] proposed multiserver-aided encryption, which uses several key servers rather than just one. In this approach, key servers cooperate with each other to process convergent key generation requests. More specifically, convergent keys are generated by executing a threshold blind-signature-based protocol [29] with the aid of the group of key servers. Each key server uses a share of a secret key to generate a partial blind signature for the message a user requested. The partial blind signatures are then combined, and, in turn, a convergent key is computed from the blind signature. Unlike DupLESS, multiserver-aided encryption can resist server-compromise attacks unless the attackers gain access to more than (i.e., the threshold) key servers.

Another solution, proposed by Duan [10], addressed the single point-of-failure inherent in server-aided encryption. Similar to multiserver-aided encryption, in this approach, multiple entities are involved in key generation using an RSA threshold signature. However, it differs in that the tasks of the key servers are distributed to a number of signers (i.e., a qualified subset of users). A key server participates in the system only during the setup phase: it generates a secret key and disperses shares of the secret key across the signers. Convergent keys can be acquired if more than signers participate in the interactive key generation protocol. Zhang et al. [30] proposed a server-aided encrypted deduplication scheme for electronic health systems.

The aforementioned schemes achieved the goal of mitigating server-compromise attacks on a DupLESS system. However, all server-aided encryption schemes fundamentally require key servers. The necessity of dedicated servers severely restricts user mobility, limiting its application in MCC environments.

2.3. Serverless Encryption

Another approach has been proposed to achieve high levels of data confidentiality in encrypted deduplication without the need for additional servers.

Liu et al. [31] proposed serverless encryption that uses Password Authenticated Key Exchange (PAKE) [32]. Instead of interacting with key servers, it allows convergent keys to be derived in cooperation with online checkers (i.e., a subset of uploaders) through a PAKE-based protocol. However, despite the advantages of removing the servers, this scheme suffers from lower performance, including high latency, because many PAKE steps are required when conducting file encryption.

Several schemes for serverless encryption use pairing-based cryptography. Abadi et al. [15] proposed a scheme that deviates from MLE by fully randomizing all components of the ciphertexts. A study precedent to SEED [33] is also built on bilinear pairing encryption algorithms to make the ciphertexts indistinguishable from a random distribution.

In these pairing-based schemes, a test algorithm that checks for equality among the ciphertexts is necessary because the ciphertexts are fully randomized. However, deduplication using an equality test algorithm inherently has a linear time complexity with the number of files in the storage. Without a tree-based indexing structure, it seriously degrades the performance of the cloud storage service.

3. System Model and Design Goals

3.1. System Model

In this paper, we consider a general architecture of mobile cloud storage services where multiple mobile users outsource their data to remote storage.User: this is an entity who owns data (or files (We will use the term “file” and “data” interchangeably in this paper.)) and wishes to outsource the data to the cloud storage. A user who uploaded data is referred to as an uploader: he/she is the initial uploader of the file if it is the first time that has been uploaded to the storage, or a subsequent uploader otherwise.MCSP: this is an entity equipped with abundant storage and computing resources and provides cloud storage services to mobile users. It has an interest in saving storage costs, so it performs deduplication of the outsourced data.

3.2. Threat Model and Security Goals

We consider honest-but-curious adversaries in our threat model. That is, for assigned tasks, MCSP and users will faithfully perform their work within the system. However, they have an interest in obtaining as much information as possible about the outsourced data, beyond their privileges. Thus, our primary security goal is to prevent them from accessing the plaintext version of encrypted data.

In this study, we consider two types of adversaries: (i) an outside adversary, who makes an effort to learn useful information about the outsourced data by playing the role of a user and (ii) an inside adversary, who may be an honest-but-curious MCSP or intruders that have compromised the storage server. Specifically, we aim to achieve the following security goals in the proposed scheme:(i)Data confidentiality: no adversary can acquire information from the outsourced data using brute-force attacks unless they obtain the corresponding key(ii)Data integrity: any valid user should be able to check whether the data downloaded from cloud storage has been kept intact(iii)Collusion resistance: any adversaries without valid ownership of the data should be blocked from obtaining useful information from the data even if they collude with each other

4. Preliminaries

4.1. Server-Side and Client-Side Deduplication

Data deduplication can be classified into two kinds of approaches according to the location where the deduplication occurs. In server-side deduplication, the MCSP performs deduplication once files have been uploaded to the storage. On the other hand, client-side deduplication is executed on the user’s side. That is, before outsourcing a file, a user sends a corresponding tag to the MCSP to check whether the file already exists and, if so, to omit the further upload.

4.2. Bilinear Pairings and Hard Problem

Bilinear Map. Let and be two multiplicative cyclic groups of prime order . Let be a generator of . A bilinear map is an injective function with the following properties:(i)Bilinearity: for all and all , we have (ii)Nondegeneracy: (iii)Computability: there is an efficient algorithm to compute for

Bilinear Diffie–Hellman (BDH) Problem. Let be chosen at random and let be a generator of . The BDH problem is to compute given as input. The BDH assumption [34] states that no probabilistic polynomial time algorithm can solve the BDH problem with nonnegligible advantage.

4.3. Random Binary Tree

A binary tree is referred to as a random binary tree if it is constructed at random from a probability distribution (e.g., a uniform distribution) of binary trees. A random binary tree of size is formed in the following way. First, a random permutation of elements is chosen, and the elements in are added one by one into a binary tree. The addition of elements is similar to the way that elements are inserted into a binary search tree. A root node for a random binary tree is obtained from the first element in . Each subsequent element is then evaluated on the tree from the root until it reaches a leaf. The evaluation result directs the child node for the next evaluation.

5. Serverless and Efficient Encrypted Deduplication

SEED consists of two building blocks: file encryption and deduplication. We first present these building blocks in Section 5.1 and then describe a data outsourcing protocol constructed upon them in Section 5.2.

5.1. Building Blocks
5.1.1. File Encryption

We introduce some notations prior to giving details on our file encryption algorithm. Let and be two multiplicative groups with the prime order , and let be a hash function family. Let be a symmetric encryption algorithm with an encryption key , where is a key space of the underlying block cipher (e.g., AES), and let be a key derivation function.

KeyGen (). Given global information , this algorithm runs as follows:(1)Pick a random value and compute (2)Set as its public key and as its secret key, then return

Encrypt(). Given a secret key and a message , this algorithm runs as follows:(1)Compute a decryption key and a tag , where and (2)Pick a random value , and compute , where (3)Pick a reencryption key , and compute and (4)Return a ciphertext , a message-derived key , and a tag

ReEnc. Given a reencryption key in a message-derived key and a part of a ciphertext , this algorithm computes and returns .

. Given a secret key , a message-derived key , and a ciphertext , this algorithm runs as follows:(1)If is not reencrypted (Without a loss of security, we assume that the information about whether E is reencrypted or not is implicitly augmented with the ciphertext C), then recover by computing.(2)If is reencrypted with another reencryption key , recover by computing.(3)Compute a symmetric decryption key and recover by decrypting with .(4)(Integrity check) Compute , where , and check whether . If both values are the same, return as output. Otherwise, return .

Test. Parse and as and , respectively. Then, given public keys and and tags and , this algorithm runs as follows:(1)Check whether the following equation holds.(2)If the equation holds, return . Otherwise, return .

5.1.2. Deduplication

The performance of deduplication depends on the computational complexity of the algorithm used to find file duplicates in the storage. To achieve efficient deduplication that is as fast as a binary search algorithm with logarithmic complexity, we define a D-tree, a binary-tree-based data structure for deduplication. A D-tree is a random binary tree of size , where is the number of all the distinct outsourced files in the storage. Each node () in a D-tree contains deduplication information for each outsourced file.

A D-tree is an index structure for the storage of the MCSP. Once file has been uploaded to the storage, the MCSP checks whether it has a duplicate. For this, it performs a binary search over a D-tree using the Test algorithm given in the previous section. The search path for is determined at random based on a globally publicized random seed. If a node that contains information of is found, then the MCSP performs deduplication. Otherwise, it creates a new node for and inserts it at the leaf node on the search path.

We will introduce here some notation for our deduplication scheme. Let be a D-tree of size , and let be its nodes. denotes deduplication information assigned to node , where indicates initial uploader who outsourced and the corresponding file. Let be a maximum height of , and let be a random seed chosen from a uniform distribution. Let be global publicly known information. denotes a hash function family, and denotes a digest function. Let be a binary vector of length , and let be the th element of . Vector denotes a search path from a root node on : the bit value of indicates left or right child of the node at the th level of .

Figure 1 shows an instance of a D-tree of size and and its storage structure. Nodes in the tree are traversed from a root node with respect to a search path , in which the bit information of each element indicates the next child node: bit 0 directs the traversal to the left child and bit 1 to the right child. For example, nodes traversed along a path include , and , with which the corresponding deduplication information , and are sequentially evaluated using the Test algorithm.

Details of the D-tree based deduplication algorithms are given below.

InsertNode(). Given a node and a path , this algorithm inserts at the leaf node of on .

DeleteNode(). Given a node , this algorithm deletes the node from . If is a non–leaf node, then the deletion is performed by replacing it with one of its child nodes.

. Given node and , this algorithm returns the left or right child node of according to . If the node does not have , then it returns .

. Given message and global information , this algorithm runs as described in Algorithm 1. It outputs a path vector , where each () is 0 or 1.

Input: M, ψ
Output: p = (b1, ⋯, bπ)
(1)h0 ⟵ H2(ψ||M)
(2)b0 ⟵ 0
(3)for each i ∈ [1, π] do
(4)hi ⟵ H2(hi − 1||bi − 1)
(5)bi ⟵ P (hi)
(6)end for
(7)return (b1, ⋯, bπ)

FindDuplicate . Given path vector of message and its corresponding deduplication information , this algorithm runs as follows (the detailed procedure is presented in Algorithm 2).(1)Get the root node of a D-tree .(2)If , return (initially ). Otherwise, run a test algorithm Test () for the deduplication information of node .(3)If is True, then return and halt.(4)If is False, then get child node of by running GetChildNode. Choosing a child node depends on : a left child node is selected for , and vice versa. Set and , and then repeat step 2.

Input: p, δi
Output: DuplicateFound, k
(1)Get the root node N0 of a D-tree ∆
(2)DuplicateFound ⟵ False
(3)k ⟵ 0
(4)whileNk ! = ⊥ do
(5) Get deduplication information δk assigned to Nk.
(6)if Test (δi, δk) = True then
(7)  DuplicateFound ⟵ True
(8)  break
(9)else
(10)  ifp[k] = = 0 then
(11)   Nc ⟵ GetChildNode(Nk, Left)
(12)  else
(13)   Nc ⟵ GetChildNode(Nk, Right)
(14)  end if
(15)  k ⟵ k + 1
(16)  Nk ⟵ Nc
(17)end if
(18)end while
(19)return (DuplicateFound, k)
5.2. Data Outsourcing Protocol
5.2.1. Data Outsourcing in Server-Side Deduplication

We first present the data outsourcing protocol in server-side deduplication. It consists of four operations: system setup, file upload, file download, and file deletion. For clarity, we denote the public key and secret key that belong to user as and , respectively. We also denote a message-derived key calculated from as . The details of the proposed protocol are given as follows.

System Setup. Given security parameter , the system generates public information . consists of the generator of and the order , and consists of the randomly generated integer and the maximum height of a D-tree . Each user generates a pair of public key and secret key by running . Then, is made public, while is kept secret.

File Upload. Suppose that a user wishes to upload a file to the MCSP. performs a file uploading operation as follows:(1) encrypts by running with his/her secret key to get its ciphertext , a message-derived key , and a tag (2) computes a path vector by running (3)Then, sends to the MCSP, where is the identifier of , and keeps secret for later use

Once the encrypted file , as well as its corresponding tag and are uploaded, the MCSP tries to eliminate the duplicate of by running deduplication as follows:(1)Given , the MCSP runs the algorithm, where .(2)If the result is , then has been previously uploaded to the storage. is the position of a new node where deduplication information of file will be assigned. is on (if ) or a root node if the D-tree is empty (i.e., ). The MCSP inserts in the position in and stores with a link to . The user is then assigned as the initial uploader of .(3)If the result is , then is a subsequent uploader of . indicates a position of node that has stored the deduplication information of . Hence, the MCSP does not have to store in but . Prior to storing , the MCSP finds the initial uploader assigned to and asks to reencrypt . Upon receipt of the request, computes with his/her own key and returns . The MCSP appends to the end of the stored tuple , where is the identifier of another subsequent uploader.

File Download. User interacts with the MCSP to download an outsourced file . The details are as follows:(1) sends a request to download the outsourced file to the MCSP(2)Upon receiving the request, the MCSP sends the corresponding ciphertext to (3)Given message-derived key and secret key , recovers by running (4)If the result is , then drops the ciphertext

File Deletion. Upon receiving a deletion request for from user , the MCSP runs the following steps:(1)If is the only user who owns the file , the MCSP removes in the storage. It also deletes the corresponding node in by running the algorithm.(2)Otherwise, the MCSP only removes in .

5.2.2. Data Outsourcing in Client-Side Deduplication

The previously described protocol of SEED is based on server-side deduplication. We can easily modify the protocol to operate in a client-side deduplication mode. Specifically, it can be modified such that instead of fully uploading , user first sends to the MCSP. The encrypted files and will be uploaded later only if FindDuplicate () returns Nil.

Lazy Encryption. SEED takes advantage of lazy encryption to further enhance the computational efficiency of the file uploading process in client-side deduplication. Lazy encryption is a novel technique that delays file encryption until the MCSP requests to upload subsequent ciphertexts as a result of the FindDuplicate function. It allows a user to omit the job of file encryption when a duplicate is found in the remote storage. In the file uploading process, the task of encryption (i.e., executing SE in the Encrypt algorithm) comprises the majority of computation. Hence, lazy encryption significantly reduces the computational burden of the client. This is a crucial performance factor in mobile devices because it is directly related to a reduction of power consumption.

The lazy encryption technique is enabled in SEED due to the concurrency property of the Encrypt algorithm (in Section 5.1.1). More specifically, in the encryption algorithm, a tag can be computed concurrently and independently of the computation of a ciphertext. Figure 2(a) intuitively depicts the concurrent processing of the file encryption. The concurrency property is only found in the proposed scheme. All the existing schemes, including MLE [14] and DupLESS [8], have sequential processing; the encryption and tag generation process are performed inherently in a sequential way (see Figure 2(b)).

Side-Channel Prevention. Client-side deduplication is inherently vulnerable to a side-channel attack [35], by which adversaries can infer information about the existence of a specific file in the cloud storage. To defend against such an attack, we use a randomized-threshold approach [35]. In this technique, a randomly chosen threshold (, where is a security parameter) is assigned to each in the storage, along with a counter that counts the number of previous uploads of . Unless reaches , a user will be required to fully upload as server-side deduplication despite the existence of the file in the storage.

6. Security Analysis

In this section, we analyze the security of SEED regarding data confidentiality, data integrity, and collusion resistance.

6.1. Data Confidentiality

As mentioned in the previous section, our primary security goal with SEED is to guarantee the confidentiality of users’ outsourced data. In our threat model, we consider an MCSP that is no longer fully trusted although it is faithful. Therefore, any leakage of users’ data should be prevented from adversaries, including the MCSP and unauthorized users. Because our threat model considers various types of attacks from both internal and external adversaries, we analyzed data confidentiality according to these attacks. In the analysis, we assume that all public information, including the public keys of users, are known a priori to the adversaries.

6.1.1. Security against Offline Brute-Force Attacks

Definition 1. An adversary runs the following security game: a challenger picks a random bit . makes multiple encryption queries with the restriction that only distinct messages are permitted. On each query , if , the challenger computes the ciphertext for and returns it to . If , the challenger simply returns a random value to . At the end of the game, outputs . An encryption scheme is D-IND$-CPA secure if the advantage is negligible.

Theorem 1. SEED is D-IND$-CPA secure in the random oracle model assuming that underlying symmetric encryption algorithm is semantically secure and the BDH problem is intractable.

Proof. In the security game, the adversary will be given a correct ciphertext for each query in the case of . We will show that even in such a case, cannot get any information about from the ciphertext and cannot distinguish from random with nonnegligible advantage. Suppose that the challenger responds to ’s queries as follows: for -random oracle query of , the challenger picks a random and returns it to . For Encrypt oracle query, the challenger returns the corresponding ciphertext and the tag to .

Because the underlying symmetric encryption algorithm is semantically secure, the ciphertext is indistinguishable from random data. That is, because is chosen at random, the symmetric encryption key , which is derived from , as well as , are made pseudorandom. Therefore, cannot get any useful information from except a negligible advantage, unless is known to .

Recovering from and is as hard as solving the BDH problem. Suppose that can compute from and in polynomial time with nonnegligible probability . We can construct an algorithm that solves the BDH problem using : given a BDH instance , sets up the instance of such that , where is chosen at random from , and runs . For an -query of , responds to with . From the view of , the instance is a valid ciphertext of , such thatwhere and are random values from and , respectively. If terminates and returns as its output, then outputs as the solution of the BDH problem. With nonnegligible probability , the output is the correct answer of the BDH problem, which contradicts the BDH assumption. Therefore, computing from and is infeasible.

Moreover, because the ciphertexts and are blended with two random values and , these ciphertexts are indistinguishable from random data, except with a negligible probability. With regard to a tag , also cannot distinguish it from random, because for any distinct messages the random oracle makes randomized.

Therefore, SEED makes ciphertexts and tags indistinguishable from random data, which implies that has a negligible advantage in winning the security game.

6.1.2. Security against Online Brute-Force Attacks

Now, we analyze the security of SEED against online brute-force attacks. We consider outside adversaries (e.g., unauthorized users) with a dictionary that contains candidates for a file of interest . The attack proceeds as follows: the adversary repeatedly performs a file upload operation for each candidate until he/she observes a deduplication event, which indicates the candidate file matches in the storage.

If the proposed scheme is run under the mode of server-side deduplication, such an attack cannot succeed, this is because all candidates in the dictionary will eventually be sent to the MCSP during the operation, and thus the adversary can infer no information about whether deduplication takes place. In the case of client-side deduplication, the uploading of a certain file may be omitted if it already exists in the storage, which may give information to the adversary. However, the randomized-threshold strategy makes the adversary fully upload the file even if it exists in the storage and thus obfuscates the information about the file. As analyzed in [35], the adversary cannot obtain the information with probability , where is a security parameter.

6.2. Data Integrity

The integrity of outsourced data can be compromised by data corruption due to defects in the storage system or adversaries’ intentional attacks. SEED provides users with the ability to detect alteration in the outsourced data easily. Say that a user has downloaded an outsourced ciphertext from the MCSP. While running the Decrypt algorithm, the user can restore the plain data from the ciphertext and then compute . If and a decryption key are different, Decrypt outputs , and the user knows that the outsourced file has been modified. Notice that the probability of Decrypt yielding an output other than is negligible for , thanks to the collision-resistant property of the cryptographic hash function . Thus, SEED offers an integrity model that allows users to validate the outsourced data effectively.

6.3. Collusion Resistance

SEED also provides security against any collusion attacks. Let us consider the colluding of unauthorized users who do not have valid ownership of file of interest . Although they have access to ciphertext of the file, they need the correct decryption key to decrypt the ciphertext. Suppose that the colluding users have obtained sufficiently many decryption keys for other files. Even with these decryption keys, it is impossible to compute the correct decryption key for unless they know both and secret key .

We also consider an attack in which unauthorized users collude with an MCSP. In addition to decryption keys for other files, they would have access to ciphertexts other than on the storage. However, because other ciphertexts contain no information about , these adversaries learn nothing about . This is the same as in the former case that requires the adversary to compute the correct decryption key of to succeed in the attack. Therefore, the proposed scheme resists attacks by colluding adversaries.

7. Evaluation

7.1. Comparative Analysis

We comparatively analyzed secure deduplication schemes regarding attack resistance, mobility support, file encryption, and deduplication cost. The result is summarized in Table 1.

CE (or MLE) has the cheapest computational cost among deduplication schemes, because any math operations, such as exponentiation or group multiplication, are not required to perform file encryption. However, because of its weak security against brute-force attacks, this scheme cannot guarantee strong confidentiality to the outsourced data. This implies that CE is also vulnerable to server-compromise attacks, because attackers who compromised cloud servers can easily revert CE ciphertexts to plaintexts by brute-force recovery.

Server-aided encryption schemes achieve resistance against brute-force attacks using an OPRF protocol (and its variants) with key servers. However, they also have vulnerability to server-compromise attacks. This is because if one of the key servers is compromised and a secret key is leaked from the server, then the security of the whole system is downgraded to the level of CE. This implies that it fails to guarantee strong confidentiality of outsourced data. Several works by Miao et al. [11] and Duan [10] tried to alleviate the risk of such attacks. However, these approaches still fail if more than (i.e., threshold) servers are compromised.

Besides, in server-aided encryption, the cost of key computation is larger than in other solutions, and clients are requested to interact with key servers for the generation of convergent keys. This inevitably adds a nonnegligible latency to file outsourcing operations. Such intrinsic latency and the need for key servers, which usually reside in central data centers, make server-aided encryption solutions less attractive in MCC environments, where the support of low-latency service and mobility is critically important.

Liu et al.’s scheme [31] eliminates the need for key servers. Instead of OPRF, this scheme uses a PAKE protocol to achieve security against brute-force attacks. The lack of additional servers inherently leads to improved security that prevents server-compromise attacks. In this scheme, however, executing many PAKE protocols with online checkers (i.e., users) is mandatory for each file encryption. Like server-aided encryption, this will incur high latency in performing file encryption, which degrades the effectiveness of the scheme and makes it unsuitable for MCC environments.

Abadi et al.’s scheme [15] requires neither additional servers nor interactive protocols with any entities in file encryption. However, full randomization in file encryption incurs an extremely high computational cost for ciphertext computation. In addition, randomized ciphertexts consequently lose ordering information that is necessary to allow deduplication using a tree-based index structure. Jiang et al. [36] addressed the problem of Abadi et al.’s scheme and proposed a method that achieves logarithmic complexity in searching duplicate files for the fully randomized deduplication. Their method uses a tree-based data structure called a decision tree, which is similar to a D-tree. Despite sharing the underlying tree-based approach, there are significant differences; in Jiang et al.’s scheme, a user is required to interactively query the cloud server for each node on a path to find duplicates, while this can be achieved in a noninteractive manner in the proposed scheme.

As analyzed in Section 6.1, SEED guarantees strong confidentiality against brute-force attacks without using any additional key servers. Even if the MCSP is compromised, plain data cannot be recovered because the success probability of a brute-force attack is negligible. Therefore, SEED offers further security against server-compromise attacks. Although more math operations for ciphertext computation are needed than in server-aided encryption schemes, any interactions with servers are unnecessary while conducting file encryption. In addition, SEED achieves low latency in file encryption because it supports the novel property of lazy encryption, which is infeasible for other client-side deduplication schemes that require full ciphertext computation for tag generation. Using a random binary tree reduces the complexity of the deduplication algorithm to , which makes SEED much more efficient and scalable in MCC environments.

7.2. Experiments

To evaluate the computational efficiency, we implemented SEED and other deduplication schemes using Charm [37], a Python-based framework for prototyping cryptosystems. Charm provides useful math operations, such as group multiplication, exponentiation, and bilinear pairing, through Python wrap-up modules of the native C libraries GNU Multiple Precision Arithmetic Library (GMP) and Paring-Based Crypto Library (PBC). Therefore, the performance overhead caused by the use of Python is limited to less than 1% [37]. We selected the SS501 curve in our experiment, which is a supersingular elliptic curve with symmetric Type 1 pairing. We chose SHA-256 as the cryptographic hash function and AES-CBC with 128-bit keys as the symmetric block cipher algorithm.

Our implementation consists of two modules: a client-side program simulating a file-uploading user and a server-side program simulating MCSP, which oversees deduplication. In all our experiments, the client-side program was executed on a PC with an Intel Core i7-4770 3.4 GHz CPU and 4 GB of RAM, and the server-side program was executed on a server with an Intel Xeon E5-2676 2.4 GHz CPU and 8 GB of RAM. Ubuntu 14.04 LTS (64 bits) was installed and run on both the PC and the server. For server-aided encryption schemes, we used a LAN with a 100 Mbps Ethernet link to execute interactive protocols with a remote key server.

7.3. File Encryption

In secure deduplication schemes, file encryption makes up the majority of a user’s computational burden for the file uploading phase. Therefore, we measured the execution time of file encryption in SEED and other schemes. For server-aided encryption, we chose DupLESS as a comparative scheme because its computational cost is the cheapest of its kind [17].

We conducted the experiment for both deduplication architectures (i.e., client-side and server-side deduplication) with sample files whose size varied from 1 MB to 1 GB. Regarding client-side deduplication, we assumed that deduplication always happens for all the sample files. For each experiment, the measurement was repeated 1,000 times. The results of the experiments are shown in Figure 3. The term “Execution time” on the y-axis refers to the elapsed time to compute a ciphertext from a corresponding file . For client-side deduplication, it actually means the required time to generate a tag , because of the above assumption.

As shown in Figure 3, SEED shows better computational performance than DupLESS [8], Liu et al.’s scheme [31], and Abadi et al.’s scheme [15], which essentially require large computational tasks or high-latency interactions with remote entities during file encryption. Among server-side deduplication schemes, CE shows the least execution time because of its simplicity. However, in the case of being operated as client-side deduplication (Figure 3(b)), SEED shows the best computational performance owing to the novel property of lazy encryption. This is because the encryption of a file (i.e., operation) can be omitted when deduplication takes place. All other schemes, including CE, must generate a full ciphertext whatever the deduplication result is, because the ciphertext is required for computing the corresponding tag. Hence, those schemes in client-side deduplication showed no difference in the performance of file encryption with server-side deduplication.

7.4. Deduplication

In our second experiment, we measured the computational efficiency of the D-tree-based deduplication algorithm. The data set for the experiment consisted of files sampled from Windows system files, media files, Office files, and so on. The number of files varied from 100 to 20,000. The maximum height of the D-tree was set to be .

For the comparison, we also implemented the deduplication algorithms of other schemes. We chose a red-black tree as the indexing structure of our implementations for CE, DupLESS, and Liu et al.’s scheme. A red-black tree is a type of self-balancing binary search tree that guarantees searching in time in the average case [38]. For Abadi et al.’s scheme, we used sequential search, because the equality test algorithm does not support a binary search tree.

Figure 4 shows the result of the experiment. The term “Number of test operations” refers to the number of operations to test equality between ciphertexts for each deduplication. For SEED, it means the number of executions of the Test algorithm. Because D-tree allows binary search for tags, the number of Test executions for each data set is almost the same as in the other schemes using red-black trees. SEED achieves 2-3 orders of magnitude higher performance (i.e., fewer equality-test operations) than Abadi et al.’s scheme.

We also measured the actual elapsed time during the execution of deduplication algorithms. Figure 5 presents the execution time to complete deduplication for each data set. SEED needed slightly more execution time than the other schemes that use red-black trees, because the Test algorithm includes bilinear pairing operations, which incur high computational costs. Despite the computational overhead, however, the execution time does not exceed 150 ms even for the data set with the maximum number of files. We believe that the computational overhead can be further reduced using high-performance computing technologies, such as distributed and concurrent processing.

8. Discussion

8.1. Reliability of the Initial Uploader

In the proposed scheme, an initial uploader contributes to subsequent file upload processes by reencrypting a part of a ciphertext. Because the reencryption is crucial for subsequent uploaders to access the encrypted content, it is required that the initial uploader remains online to serve requests without interruption. In a mobile environment, for which the proposed scheme is intended, mobile devices are likely to be connected to the Internet most of the time. Hence, we reasonably assume that the reliability of the participation of an initial uploader (i.e., a mobile device) will be acceptable in most cases.

However, we should consider the possibility that the initial uploader might not be available due to various reasons (e.g., temporary loss of the connection). For the sake of more reliable service, we may relax the protocol, so that the first users that uploaded a file are regarded as the initial uploaders. Subsequent uploaders will be able to successfully conduct the file upload process if at least one initial uploader responds to the reencrypting request.

We analyzed the reliability of file uploading for subsequent uploaders with regard to the number of initial uploaders. Consider the case where an initial uploader is not available when a reencryption request has been sent. Suppose that this event happens with a probability independently from each other. Then, the probability that at least one initial uploader will successfully respond to the request is . Figure 6 shows the probability with regard to and . Commercial cloud services such as AWS and Azure usually provide an Service-Level Agreement (SLA) that guarantees more than 95% in terms of service availability. With this information, we can choose the appropriate parameter . For instance, we choose for the case of and for .

8.2. Storage Overhead due to Ciphertext Expansion

The proposed scheme relies on a bilinear pairing-based cryptosystem. Therefore, a ciphertext generated under the proposed scheme consists of several components whose size is directly related to the pairing. More specifically, the Encrypt algorithm, described in Section 5.1.1, generates a ciphertext , among which is an element of and T is an element of , where and are multiplicative groups that form a pairing . Hence, the ciphertext size expands exactly by . Regarding the storage overhead for the cloud service provider (i.e., MCSP), the ciphertext expansion may cause a certain level of performance degradation.

However, the storage overhead can be minimized due to the deduplication feature of the proposed scheme. That is, it is not necessary to store all the ciphertext components for deduplicated files in the storage. As described in Section 5.2.1, only in needs to be stored in the case where a duplicate file is found.

9. Conclusion

In this paper, we addressed the problem of deduplication over encrypted data in MCC environments by proposing SEED, a serverless and efficient encrypted deduplication scheme. The novelty of SEED originates from the elimination of key servers, which severely restrict user mobility, while not losing effective data confidentiality. The computational efficiency of file encryption is achieved through noninteractive file encryption and support for lazy encryption. As a result, SEED offers efficient, low-latency file uploading for mobile cloud storage.

Furthermore, a D-tree-based deduplication algorithm successfully reduces the time complexity of deduplication to . This makes SEED much more efficient and scalable, even in the case of large data items being outsourced in the storage.

The security of SEED was rigorously analyzed in this paper, and it was shown that the proposed scheme strongly guarantees security against brute-force attacks without the help of any key servers. The analysis showed that other desired security properties, such as data integrity and collusion resistance, were also achieved by SEED.

Extensive comparative analysis and experiments were conducted to evaluate the performance of SEED. We showed that SEED has advantages in security and efficiency compared to other encrypted deduplication solutions.

Data Availability

The experimental results used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they do not have any conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was extended from the poster presented at IEEE CloudCom [33]. This research was conducted under a Research Grant from Kwangwoon University in 2020. This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korean Government (MSIT) (No. 2019-0-00533, Research on CPU Vulnerability Detection and Validation), (No. 2019-0-00426, Development of Active Kill-Switch and Biomarker-Based Defense System for Life-Threatening Internet of Things Medical Devices), and (No. 2020-0-00325, Traceability Assurance Technology Development for Full Lifecycle Data Safety of Cloud Edge).