Heterogeneous data release for cluster analysis with differential privacy

doi:10.1016/j.knosys.2020.106047

Knowledge-Based Systems

Volumes 201–202, 9 August 2020, 106047

https://doi.org/10.1016/j.knosys.2020.106047 Get rights and content

Abstract

Many models have been proposed to preserve data privacy for different data publishing scenarios. Among these models, $ϵ$ -differential privacy has drawn increasing attention in recent years due to its rigorous privacy guarantees. While many existing solutions using $ϵ$ -differential privacy deal with relational data and set-valued data separately, most of the real-life data, such as electronic health records, are in heterogeneous form. Privacy protection on heterogeneous data has not been widely studied. Furthermore, many existing works in privacy protection consider preserving the utility for the tasks of frequent itemset mining or classification analysis, but few works have focused on data publication for cluster analysis. In this paper, we propose the first differentially-private solution to release heterogeneous data for cluster analysis. The challenge facing us is how to mask raw data without any explicit guidance. Our approach addresses this challenge by converting a clustering problem to a classification problem, in which class labels can be used to encode the cluster structure of the raw data and assist the masking process. The approach generalizes the raw data probabilistically and adds noise to them for satisfying $ϵ$ -differential privacy. Through extensive experiments on real-life datasets, we validate the performance of our approach.

Introduction

As information becomes a kind of strategic resource in the era of big data, many organizations, such as government agencies and hospitals, release their data (e.g., census data or medical records) to third parties in order to reveal the hidden value of the data [1], [2]. However, directly releasing raw data may unavoidably leak data privacy and may even violate privacy laws [3], [4]. To address this problem, privacy-preserving data publishing (PPDP) [5] has been studied extensively, with the goal of protecting private information by distorting the raw data before publication while preserving as much utility of the perturbed data as possible for subsequent data analysis.

Because of its strong privacy guarantee, $ϵ$ -differential privacy [6], [7] has received increasing attention in the literature. As the structure of the collected data becomes much richer, many differentially-private approaches [8], [9], [10], [11] that handle relational data or set-valued data individually are non-effective. Relational data refer to the data in which records have a single value for each attribute, and set-valued data refer to the data in which records have one or more values for each attribute. Many real-life data are typically composed of relational data and set-valued data, and they are called heterogeneous data. For example, a patient who goes to the hospital for the first time may be asked to fill out a form that requires his/her gender (relational), age (relational), medical history (set-valued), etc. The information is stored as a heterogeneous data record in the hospital’s database to assist physicians in diagnosing and treating. For heterogeneous data publishing, one naive approach is to vertically divide the raw data into different subsets such that each subset has only one type of data structure, and then to apply existing approaches on these subsets independently. However, most data publishing scenarios require that the entire data be released together so that the associations among different data types can be retained. On the other hand, many privacy-preserving works consider preserving the utility for frequent itemset mining [12], [13], [14] or classification analysis [15], [16], [17], but a very limited number of works have focused on privacy protection for cluster analysis. Thus, we cover these gaps with a differentially-private approach to release heterogeneous data for cluster analysis.

Consider our data release scenario as follows. The data owner wants to release heterogeneous data (e.g., Table 1) to the data recipient for clustering. If the data owner releases the raw data directly, the individual privacy of the data may be leaked. Thus, private information should be masked before being released. Note that the data owner wants to release data records to the data recipient, instead of clustering results, because unlike association rules and classifiers, releasing the clustering results (e.g., clusters with their centroids and sizes) may not provide enough information for further analysis. For example, the data recipient may browse into the clustered records to find their inherent relationship. Releasing data records not only satisfies the demand for clustering, but also gives the data recipient greater flexibility in conducting his specific data analysis.

In this paper, we present a differentially-private algorithm to protect individual privacy while preserving as much information as possible for cluster analysis. To tackle the challenge of lacking proper guidance for the masking process, our approach converts the clustering problem into a classification problem. That is, it groups the raw data into clusters and utilizes cluster/class labels to encode the cluster structure of the data. It then generalizes the raw data iteratively while preserving the cluster structure. At each iteration, the approach selects a general value in a probabilistic manner and specializes the value to a more specific one. The process is repeated until certain conditions are reached. Finally, noise is added to further guarantee $ϵ$ -differential privacy. The contributions of this paper are summarized as follows:

•
We formally define the problem of differentially-private heterogeneous data release for cluster analysis. This paper is the first work that tackles this problem and addresses the challenges of heterogeneity and lack of guidance in the anonymization process for cluster analysis.
•
We propose a customizable approach to heterogeneous data anonymization for cluster analysis. Users can choose different clustering algorithms and algorithmic parameters to get their desired results. Also, a distance metric that considers both relational and set-valued attributes is tailored for heterogeneous data clustering.
•
To satisfy the differential privacy principle, we propose an algorithm to simultaneously handle relational and set-valued data in a non-deterministic fashion. Data of different types are anonymized in a similar way, which is computationally efficient.
•
We extensively evaluate the performance of the proposed cluster-oriented approach on real-life datasets. The results suggest that our approach can generate anonymous data of better utility compared to the general method that does not consider the task of cluster analysis during anonymization.

The rest of the paper is organized as follows. Related work is discussed in Section 2. Preliminaries including the problem statement are presented in Section 3. The proposed approach is described in Section 4, and experimental results are presented in Section 5. A discussion of the approach is given in Section 6. Section 7 concludes the paper.

Section snippets

Anonymization of different types of data

Relational data anonymization. Many privacy models had been proposed to anonymize relational data, such as $k$ -anonymity [18], [19], $l$ -diversity [20], and $t$ -closeness [21]. Recently, researchers extend these models to provide stricter privacy protections. Amiri et al. [22] hide the correlations between identifying attributes and sensitive attributes and generate $k$ -anonymous $β$ -likeness data to prevent identity and attribute disclosures. Agarwal et al. [23] propose a privacy model called ( $P$ , $U$

Preliminaries

Table 2 summarizes some notations used in the following.

Proposed approach

In this section, we first present an overview of our approach to the problem of heterogeneous data anonymization for cluster analysis. We then elaborate details of the proposed differentially-private algorithm. Finally, we analyze the privacy guarantee and the time complexity of the algorithm.

Experimental evaluation

In this section, we evaluate the performance of our approach. First, we study the quality of the clusters by satisfying different differential privacies. Second, we evaluate the quality of the clusters of the anonymized dataset generated by our approach and those generated by a general method without focusing on cluster analysis during anonymization. Third, we investigate the impact of using different clustering algorithms before and after anonymization. Fourth, we evaluate the scalability of

Discussion

Adaptability of DPHeter. Although only $k$ -means and bisecting $k$ -means were used in Section 5 to evaluate the performance of DPHeter, other clustering algorithms, such as DBSCAN [56], can be integrated into our approach; namely, other clustering algorithms can be applied to steps ① and ③ in Fig. 1. Our proposed approach provides a flexible framework in which the clustering algorithms can be viewed as “plug-in” components. DPHeter utilizes the clustering results to anonymize the raw data, not the

Conclusions and future work

In this paper, we introduced an approach to release heterogeneous data for cluster analysis. The proposed approach utilizes cluster labels to encode the cluster structure and combines the generalization technique with output perturbation to mask raw data. The experimental results showed that the utility of the anonymized data produced by our cluster-oriented approach was significantly better than that of the anonymized data produced by the method without initially considering cluster analysis.

CRediT authorship contribution statement

Rong Wang: Methodology, Software, Validation, Investigation, Data curation, Writing - original draft, Visualization, Funding acquisition. Benjamin C.M. Fung: Conceptualization, Methodology, Formal analysis, Resources, Writing - review & editing. Yan Zhu: Resources, Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research was supported by the Sichuan Province Science and Technology Program (2019YFSY0032) and China Scholarship Council (201807000083). We also would like to thank the editors and anonymous reviewers for their helpful comments that have led to an improved version of this paper.

References (56)

SánchezD. et al.
Utility-preserving differentially private data releases via individual ranking microaggregation
Inf. Fusion
(2016)
AmiriF. et al.
Hierarchical anonymization algorithms against background knowledge attack in data releasing
Knowl.-Based Syst.
(2016)
PuriV. et al.
Privacy preserving publication of relational and transaction data: survey on the anonymization of patient data
Comp. Sci. Rev.
(2019)
WangJ. et al.
A new approach for anonymizing relational and transaction data
WangL.-E. et al.
A graph-based multifold model for anonymizing data with attributes of multiple types
Comput. Secur.
(2018)
GongQ. et al.
Anonymizing 1: M microdata with high utility
Knowl.-Based Syst.
(2017)
KanwalT. et al.
Privacy-preserving model and generalization correlation attacks for 1: m data with multiple sensitive attributes
Inform. Sci.
(2019)
GongM. et al.
Differential privacy preservation in regression analysis based on relevance
Knowl.-Based Syst.
(2019)
NguyenH.H.
Privacy-preserving mechanisms for k-modes clustering
Comput. Secur.
(2018)
LiuX. et al.
Differentially private classification with decision tree ensemble
Appl. Soft Comput.
(2018)

JanssenM. et al.

Benefits, adoption barriers and myths of open data and open government

Inf. Syst. Manage.

(2012)

DoshiP. et al.

The imperative to share clinical study reports: Recommendations from the tamiflu experience

PLoS Med.

(2012)

The HIPAA privacy rule

(2019)

The PIPEDA privacy law

(2019)

RashidA.H. et al.

Privacy preserving data publishing

Int. J. Phys. Sci.

(2015)

DworkC.

Differential privacy

ZhuT. et al.

Differentially private data publishing and analysis: A survey

IEEE Trans. Knowl. Data Eng.

(2017)

FriedmanA. et al.

Data mining with differential privacy

JiaO. et al.

An effective differential privacy transaction data publication strategy

J. Comput. Res. Dev.

(2014)

ChenR. et al.

Publishing set-valued data via differential privacy

Proc. VLDB Endow.

(2011)

LeeJ. et al.

Top-k frequent itemsets via differentially private fp-trees

WangT. et al.

Locally differentially private frequent itemset mining

MaruseacM. et al.

Precision-enhanced differentially-private mining of high-confidence association rules

IEEE Trans. Dependable Secure Comput.

(2018)

SunZ. et al.

Differential privacy for data and model publishing of medical data

IEEE Access

(2019)

SuD. et al.

PrivPfC: Differentially private data publication for classification

VLDB J.

(2018)

ZhangY. et al.

A differential privacy support vector machine classifier based on dual variable perturbation

IEEE Access

(2019)

SamaratiP.

Protecting respondents identities in microdata release

IEEE Trans. Knowl. Data Eng.

(2001)

SweeneyL.

K-Anonymity: A model for protecting privacy

Int. J. Uncertain. Fuzziness Knowl.-Based Syst.

(2002)

Cited by (15)

A divide-and-conquer approach to privacy-preserving high-dimensional big data release
2024, Journal of Information Security and Applications
Data anonymization has been used extensively in data-sharing scenarios to protect the privacy of people’s raw data. However, in the era of Big Data, the amount of data released has increased so rapidly that most existing data anonymization approaches have become ineffective. This is because the scalability of these approaches is inadequate when dealing with large-scale data. In addition, these approaches cannot handle the sparseness of high-dimensional search space. In this paper, we propose a MapReduce-based approach to address the problem of anonymization of high-dimensional big data. First, our approach uses a vertical partition criterion based on normalized mutual information to decompose raw data into different fragments with smaller dimensionality. Then, a clustering-based local recoding is used to group the records of each fragment into clusters. During this phase, records with similar values of quasi-identifier attributes but dissimilar values of sensitive attributes tend to be grouped. Finally, clusters of each fragment are anonymized to resist simultaneously (1) the disclosure of the individual identification and (2) proximity breaches. Our proposed approach is integrated with MapReduce to implement parallel distributed computing. Experiments on three public data sets demonstrated that our approach outperformed the compared approaches in terms of efficiency and scalability.
A meta-heuristics based framework of cluster label optimization in MR images using stable random walk
2024, Multimedia Tools and Applications
Study on Impact of Climate Change and Sea Water Intrusion on Water Quality Parameters of Coastal Area – GIS Based Research
2024, Polish Journal of Environmental Studies
Secure Data Transmission Scheme in Wireless Sensor Network Resisting Unknown Lethal Threats
2024, Lecture Notes in Networks and Systems
Global Combination and Clustering Based Differential Privacy Mixed Data Publishing
2023, IEEE Transactions on Knowledge and Data Engineering
A Survey of Data Security: Practices from Cybersecurity and Challenges of Machine Learning
2023, arXiv

View all citing articles on Scopus

¹: The first author conducted the research during the visit at McGill University.

View full text

Heterogeneous data release for cluster analysis with differential privacy

Abstract

Introduction

Section snippets

Anonymization of different types of data

Preliminaries

Proposed approach

Experimental evaluation

Discussion

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Fusion

Knowl.-Based Syst.

Comp. Sci. Rev.

Comput. Secur.

Knowl.-Based Syst.

Inform. Sci.

Knowl.-Based Syst.

Comput. Secur.

Appl. Soft Comput.

Benefits, adoption barriers and myths of open data and open government

Inf. Syst. Manage.

The imperative to share clinical study reports: Recommendations from the tamiflu experience

PLoS Med.

The HIPAA privacy rule

The PIPEDA privacy law

Privacy preserving data publishing

Int. J. Phys. Sci.

Differential privacy

Differentially private data publishing and analysis: A survey

IEEE Trans. Knowl. Data Eng.

Data mining with differential privacy

An effective differential privacy transaction data publication strategy

J. Comput. Res. Dev.

Publishing set-valued data via differential privacy

Proc. VLDB Endow.

Top-k frequent itemsets via differentially private fp-trees

Locally differentially private frequent itemset mining

Precision-enhanced differentially-private mining of high-confidence association rules

IEEE Trans. Dependable Secure Comput.

Differential privacy for data and model publishing of medical data

IEEE Access

PrivPfC: Differentially private data publication for classification

VLDB J.

A differential privacy support vector machine classifier based on dual variable perturbation

IEEE Access

Protecting respondents identities in microdata release

IEEE Trans. Knowl. Data Eng.

K-Anonymity: A model for protecting privacy

Int. J. Uncertain. Fuzziness Knowl.-Based Syst.