当前位置: X-MOL 学术Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment.
Big Data ( IF 4.6 ) Pub Date : 2020-04-17 , DOI: 10.1089/big.2019.0120
P Ramya 1 , C Sundar 2
Affiliation  

With the rapid growth of storage providers, data deduplication is an essential storage optimization technique that greatly minimizes data storage costs by storing a unique copy of duplicate data. Nowadays, deduplication introduces various new challenges such as security and insufficient space issue. Hence, in this article, we propose a secure data deduplication with access control of big data over HDFS (Hadoop Distributed File System)/Hadoop environment, called SecDedoop. First, the system achieves security for data confidentiality by third party vendor using elliptic curve cryptography. There are two types of keys (public key and private key) generated for data retrieval. Second, we consider data deduplication. The user's original file is divided into a number of equal chunks. Then, each chunk (e.g., 1. txt) is tokenized into words and the weight of words is computed by using TF-IDF frequency. The SHA-3 hash computation is performed to the user's original file. If the hash value is not duplicate, then we store data in HDFS. The PSO (particle swarm optimization)-based MapReduce model is the proposed best data node selection. Initially, MapReduce process is finished for the user's original file and it results in the best set of data nodes; then, we apply PSO to compute the fitness value for best data node selection. Further, we consider MongoDB for fast indexing of the user's original files and also apply FCM (fuzzy-C-means clustering) for clustering the user's files. In this article, we consider the modified version of PSO and FCM to eliminate the open issues in conventional PSO and FCM. The performance of our proposed SecDedoop has been evaluated by using various performance metrics and also proved it outperforms better than previous approaches.

中文翻译:

SecDedoop:在HDFS / Hadoop环境中通过大数据访问控制实现安全重复数据删除。

随着存储提供商的快速增长,重复数据删除是一项必不可少的存储优化技术,它通过存储重复数据的唯一副本来极大地降低数据存储成本。如今,重复数据删除带来了各种新挑战,例如安全性和空间不足问题。因此,在本文中,我们提出了一种安全的重复数据删除技术,该技术具有通过HDFS(Hadoop分布式文件系统)对大数据进行访问控制的功能。)/ Hadoop环境,称为SecDedoop。首先,该系统使用椭圆曲线密码技术实现了第三方供应商对数据机密性的安全性。为数据检索生成了两种类型的密钥(公用密钥和专用密钥)。其次,我们考虑重复数据删除。用户的原始文件分为多个相等的块。然后,将每个块(例如1. txt)标记为单词,并使用TF-IDF频率计算单词的权重。对用户的原始文件执行SHA-3哈希计算。如果哈希值不是重复的,则我们将数据存储在HDFS中。基于PSO(粒子群优化)的MapReduce模型是建议的最佳数据节点选择。最初,针对用户原始文件的MapReduce处理已完成,它会产生最佳的数据节点集;然后,我们应用PSO来计算适合度值,以选择最佳数据节点。此外,我们考虑将MongoDB用于用户原始文件的快速索引,并且还将FCM(模糊C均值聚类)应用于用户文件的聚类。在本文中,我们考虑了PSO和FCM的修改版本,以消除常规PSO和FCM中的未解决问题。我们提出的SecDedoop的性能已通过使用各种性能指标进行了评估,并且证明其性能优于以前的方法。
更新日期:2020-04-17
down
wechat
bug