Heterogeneous data release for cluster analysis with differential privacy
Introduction
As information becomes a kind of strategic resource in the era of big data, many organizations, such as government agencies and hospitals, release their data (e.g., census data or medical records) to third parties in order to reveal the hidden value of the data [1], [2]. However, directly releasing raw data may unavoidably leak data privacy and may even violate privacy laws [3], [4]. To address this problem, privacy-preserving data publishing (PPDP) [5] has been studied extensively, with the goal of protecting private information by distorting the raw data before publication while preserving as much utility of the perturbed data as possible for subsequent data analysis.
Because of its strong privacy guarantee, -differential privacy [6], [7] has received increasing attention in the literature. As the structure of the collected data becomes much richer, many differentially-private approaches [8], [9], [10], [11] that handle relational data or set-valued data individually are non-effective. Relational data refer to the data in which records have a single value for each attribute, and set-valued data refer to the data in which records have one or more values for each attribute. Many real-life data are typically composed of relational data and set-valued data, and they are called heterogeneous data. For example, a patient who goes to the hospital for the first time may be asked to fill out a form that requires his/her gender (relational), age (relational), medical history (set-valued), etc. The information is stored as a heterogeneous data record in the hospital’s database to assist physicians in diagnosing and treating. For heterogeneous data publishing, one naive approach is to vertically divide the raw data into different subsets such that each subset has only one type of data structure, and then to apply existing approaches on these subsets independently. However, most data publishing scenarios require that the entire data be released together so that the associations among different data types can be retained. On the other hand, many privacy-preserving works consider preserving the utility for frequent itemset mining [12], [13], [14] or classification analysis [15], [16], [17], but a very limited number of works have focused on privacy protection for cluster analysis. Thus, we cover these gaps with a differentially-private approach to release heterogeneous data for cluster analysis.
Consider our data release scenario as follows. The data owner wants to release heterogeneous data (e.g., Table 1) to the data recipient for clustering. If the data owner releases the raw data directly, the individual privacy of the data may be leaked. Thus, private information should be masked before being released. Note that the data owner wants to release data records to the data recipient, instead of clustering results, because unlike association rules and classifiers, releasing the clustering results (e.g., clusters with their centroids and sizes) may not provide enough information for further analysis. For example, the data recipient may browse into the clustered records to find their inherent relationship. Releasing data records not only satisfies the demand for clustering, but also gives the data recipient greater flexibility in conducting his specific data analysis.
In this paper, we present a differentially-private algorithm to protect individual privacy while preserving as much information as possible for cluster analysis. To tackle the challenge of lacking proper guidance for the masking process, our approach converts the clustering problem into a classification problem. That is, it groups the raw data into clusters and utilizes cluster/class labels to encode the cluster structure of the data. It then generalizes the raw data iteratively while preserving the cluster structure. At each iteration, the approach selects a general value in a probabilistic manner and specializes the value to a more specific one. The process is repeated until certain conditions are reached. Finally, noise is added to further guarantee -differential privacy. The contributions of this paper are summarized as follows:
- •
We formally define the problem of differentially-private heterogeneous data release for cluster analysis. This paper is the first work that tackles this problem and addresses the challenges of heterogeneity and lack of guidance in the anonymization process for cluster analysis.
- •
We propose a customizable approach to heterogeneous data anonymization for cluster analysis. Users can choose different clustering algorithms and algorithmic parameters to get their desired results. Also, a distance metric that considers both relational and set-valued attributes is tailored for heterogeneous data clustering.
- •
To satisfy the differential privacy principle, we propose an algorithm to simultaneously handle relational and set-valued data in a non-deterministic fashion. Data of different types are anonymized in a similar way, which is computationally efficient.
- •
We extensively evaluate the performance of the proposed cluster-oriented approach on real-life datasets. The results suggest that our approach can generate anonymous data of better utility compared to the general method that does not consider the task of cluster analysis during anonymization.
The rest of the paper is organized as follows. Related work is discussed in Section 2. Preliminaries including the problem statement are presented in Section 3. The proposed approach is described in Section 4, and experimental results are presented in Section 5. A discussion of the approach is given in Section 6. Section 7 concludes the paper.
Section snippets
Anonymization of different types of data
Relational data anonymization. Many privacy models had been proposed to anonymize relational data, such as -anonymity [18], [19], -diversity [20], and -closeness [21]. Recently, researchers extend these models to provide stricter privacy protections. Amiri et al. [22] hide the correlations between identifying attributes and sensitive attributes and generate -anonymous -likeness data to prevent identity and attribute disclosures. Agarwal et al. [23] propose a privacy model called (,
Preliminaries
Table 2 summarizes some notations used in the following.
Proposed approach
In this section, we first present an overview of our approach to the problem of heterogeneous data anonymization for cluster analysis. We then elaborate details of the proposed differentially-private algorithm. Finally, we analyze the privacy guarantee and the time complexity of the algorithm.
Experimental evaluation
In this section, we evaluate the performance of our approach. First, we study the quality of the clusters by satisfying different differential privacies. Second, we evaluate the quality of the clusters of the anonymized dataset generated by our approach and those generated by a general method without focusing on cluster analysis during anonymization. Third, we investigate the impact of using different clustering algorithms before and after anonymization. Fourth, we evaluate the scalability of
Discussion
Adaptability of DPHeter. Although only -means and bisecting -means were used in Section 5 to evaluate the performance of DPHeter, other clustering algorithms, such as DBSCAN [56], can be integrated into our approach; namely, other clustering algorithms can be applied to steps ① and ③ in Fig. 1. Our proposed approach provides a flexible framework in which the clustering algorithms can be viewed as “plug-in” components. DPHeter utilizes the clustering results to anonymize the raw data, not the
Conclusions and future work
In this paper, we introduced an approach to release heterogeneous data for cluster analysis. The proposed approach utilizes cluster labels to encode the cluster structure and combines the generalization technique with output perturbation to mask raw data. The experimental results showed that the utility of the anonymized data produced by our cluster-oriented approach was significantly better than that of the anonymized data produced by the method without initially considering cluster analysis.
CRediT authorship contribution statement
Rong Wang: Methodology, Software, Validation, Investigation, Data curation, Writing - original draft, Visualization, Funding acquisition. Benjamin C.M. Fung: Conceptualization, Methodology, Formal analysis, Resources, Writing - review & editing. Yan Zhu: Resources, Writing - review & editing, Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The research was supported by the Sichuan Province Science and Technology Program (2019YFSY0032) and China Scholarship Council (201807000083). We also would like to thank the editors and anonymous reviewers for their helpful comments that have led to an improved version of this paper.
References (56)
- et al.
Utility-preserving differentially private data releases via individual ranking microaggregation
Inf. Fusion
(2016) - et al.
Hierarchical anonymization algorithms against background knowledge attack in data releasing
Knowl.-Based Syst.
(2016) - et al.
Privacy preserving publication of relational and transaction data: survey on the anonymization of patient data
Comp. Sci. Rev.
(2019) - et al.
A new approach for anonymizing relational and transaction data
- et al.
A graph-based multifold model for anonymizing data with attributes of multiple types
Comput. Secur.
(2018) - et al.
Anonymizing 1: M microdata with high utility
Knowl.-Based Syst.
(2017) - et al.
Privacy-preserving model and generalization correlation attacks for 1: m data with multiple sensitive attributes
Inform. Sci.
(2019) - et al.
Differential privacy preservation in regression analysis based on relevance
Knowl.-Based Syst.
(2019) Privacy-preserving mechanisms for k-modes clustering
Comput. Secur.
(2018)- et al.
Differentially private classification with decision tree ensemble
Appl. Soft Comput.
(2018)
Benefits, adoption barriers and myths of open data and open government
Inf. Syst. Manage.
The imperative to share clinical study reports: Recommendations from the tamiflu experience
PLoS Med.
The HIPAA privacy rule
The PIPEDA privacy law
Privacy preserving data publishing
Int. J. Phys. Sci.
Differential privacy
Differentially private data publishing and analysis: A survey
IEEE Trans. Knowl. Data Eng.
Data mining with differential privacy
An effective differential privacy transaction data publication strategy
J. Comput. Res. Dev.
Publishing set-valued data via differential privacy
Proc. VLDB Endow.
Top-k frequent itemsets via differentially private fp-trees
Locally differentially private frequent itemset mining
Precision-enhanced differentially-private mining of high-confidence association rules
IEEE Trans. Dependable Secure Comput.
Differential privacy for data and model publishing of medical data
IEEE Access
PrivPfC: Differentially private data publication for classification
VLDB J.
A differential privacy support vector machine classifier based on dual variable perturbation
IEEE Access
Protecting respondents identities in microdata release
IEEE Trans. Knowl. Data Eng.
K-Anonymity: A model for protecting privacy
Int. J. Uncertain. Fuzziness Knowl.-Based Syst.
Cited by (15)
A divide-and-conquer approach to privacy-preserving high-dimensional big data release
2024, Journal of Information Security and ApplicationsA meta-heuristics based framework of cluster label optimization in MR images using stable random walk
2024, Multimedia Tools and ApplicationsStudy on Impact of Climate Change and Sea Water Intrusion on Water Quality Parameters of Coastal Area – GIS Based Research
2024, Polish Journal of Environmental StudiesSecure Data Transmission Scheme in Wireless Sensor Network Resisting Unknown Lethal Threats
2024, Lecture Notes in Networks and SystemsGlobal Combination and Clustering Based Differential Privacy Mixed Data Publishing
2023, IEEE Transactions on Knowledge and Data Engineering
- 1
The first author conducted the research during the visit at McGill University.