DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark,The Journal of Supercomputing

当前位置： X-MOL 学术 › J. Supercomput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark
The Journal of Supercomputing ( IF 3.3 ) Pub Date : 2021-07-05 , DOI: 10.1007/s11227-021-03958-3
Farough Ashkouti ₁ , Keyhan Khamforoosh ₁ , Amir Sheikhahmadi ₁ , Hana Khamfroush ₂

Affiliation

One of the main steps in the data lifecycle is to publish it for data analysts to discover hidden patterns. But, data publishing may lead to unwanted disclosure of personal information and cause privacy problems. Data anonymization techniques preserve privacy models to prevent the disclosure of individuals’ private information in published data. In this paper, a distributed in-memory method is proposed on the Apache Spark framework to preserve the ℓ-diversity privacy model. This method anonymizes large-scale data in a three-phase process, which includes, seed selection, data clustering for \(\ell\)-diversity, and finalizing phase. In this method, a hierarchical kmeans-based data clustering algorithm has been designed for data anonymization. One of the major challenges of anonymization methods is to establish a better trade-off between data utility and privacy. Therefore, for calculating the distance between records and forming more cohesive ℓdiverse-clusters, the authors have designed two Manhattan-based and Euclidean-based distance functions to satisfy the requirements of the ℓ-diversity model. Given the 100-fold speed of the Spark compared to MapReduce, the proposed method is presented using in-memory RDD programming in Apache Spark, to address the runtime, scalability, and performance in large-scale data anonymization as it exists in the previous MapReduce-based algorithms. Our method provides general knowledge to use parallel in-memory computation of Spark in big data anonymization. In experiments, this method has obtained lower information loss and loses about 1% to 2% accuracy and FMeasure criteria; therefore, it establishes a better trade-off than the state-of-the-art MapReduce-based Mondrian methods

中文翻译：

DHkmeans-ℓdiversity：使用Apache Spark满足ℓ-diversity隐私模型的分布式分层K-means

数据生命周期的主要步骤之一是将其发布以供数据分析师发现隐藏的模式。但是，数据发布可能会导致不必要的个人信息泄露并导致隐私问题。数据匿名化技术保留隐私模型，以防止在已发布数据中泄露个人隐私信息。在本文中，在 Apache Spark 框架上提出了一种分布式内存方法来保留ℓ-多样性隐私模型。这种方法在三个阶段的过程中匿名化大规模数据，包括种子选择、数据聚类\(\ell\)-多样性和完成阶段。在该方法中，针对数据匿名化设计了基于分层kmeans的数据聚类算法。匿名化方法的主要挑战之一是在数据效用和隐私之间建立更好的权衡。因此，为了计算记录之间的距离并形成更具凝聚力的ℓ多样性集群，作者设计了两个基于曼哈顿和基于欧几里得的距离函数来满足ℓ-多样性模型的要求。鉴于与 MapReduce 相比 Spark 的速度提高了 100 倍，所提出的方法是在 Apache Spark 中使用内存中 RDD 编程提出的，以解决之前 MapReduce 中存在的大规模数据匿名化的运行时、可扩展性和性能问题基于算法。我们的方法提供了在大数据匿名化中使用 Spark 的并行内存计算的一般知识。在实验中，这种方法获得了较低的信息损失，损失了大约1%到2%的准确度和FMeasure标准；因此，它建立了比最先进的基于 MapReduce 的蒙德里安方法更好的权衡

更新日期：2021-07-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>