Heterogeneous data release for cluster analysis with differential privacy

https://doi.org/10.1016/j.knosys.2020.106047Get rights and content

Abstract

Many models have been proposed to preserve data privacy for different data publishing scenarios. Among these models, ϵ-differential privacy has drawn increasing attention in recent years due to its rigorous privacy guarantees. While many existing solutions using ϵ-differential privacy deal with relational data and set-valued data separately, most of the real-life data, such as electronic health records, are in heterogeneous form. Privacy protection on heterogeneous data has not been widely studied. Furthermore, many existing works in privacy protection consider preserving the utility for the tasks of frequent itemset mining or classification analysis, but few works have focused on data publication for cluster analysis. In this paper, we propose the first differentially-private solution to release heterogeneous data for cluster analysis. The challenge facing us is how to mask raw data without any explicit guidance. Our approach addresses this challenge by converting a clustering problem to a classification problem, in which class labels can be used to encode the cluster structure of the raw data and assist the masking process. The approach generalizes the raw data probabilistically and adds noise to them for satisfying ϵ-differential privacy. Through extensive experiments on real-life datasets, we validate the performance of our approach.

Introduction

As information becomes a kind of strategic resource in the era of big data, many organizations, such as government agencies and hospitals, release their data (e.g., census data or medical records) to third parties in order to reveal the hidden value of the data [1], [2]. However, directly releasing raw data may unavoidably leak data privacy and may even violate privacy laws [3], [4]. To address this problem, privacy-preserving data publishing (PPDP) [5] has been studied extensively, with the goal of protecting private information by distorting the raw data before publication while preserving as much utility of the perturbed data as possible for subsequent data analysis.

Because of its strong privacy guarantee, ϵ-differential privacy [6], [7] has received increasing attention in the literature. As the structure of the collected data becomes much richer, many differentially-private approaches [8], [9], [10], [11] that handle relational data or set-valued data individually are non-effective. Relational data refer to the data in which records have a single value for each attribute, and set-valued data refer to the data in which records have one or more values for each attribute. Many real-life data are typically composed of relational data and set-valued data, and they are called heterogeneous data. For example, a patient who goes to the hospital for the first time may be asked to fill out a form that requires his/her gender (relational), age (relational), medical history (set-valued), etc. The information is stored as a heterogeneous data record in the hospital’s database to assist physicians in diagnosing and treating. For heterogeneous data publishing, one naive approach is to vertically divide the raw data into different subsets such that each subset has only one type of data structure, and then to apply existing approaches on these subsets independently. However, most data publishing scenarios require that the entire data be released together so that the associations among different data types can be retained. On the other hand, many privacy-preserving works consider preserving the utility for frequent itemset mining [12], [13], [14] or classification analysis [15], [16], [17], but a very limited number of works have focused on privacy protection for cluster analysis. Thus, we cover these gaps with a differentially-private approach to release heterogeneous data for cluster analysis.

Consider our data release scenario as follows. The data owner wants to release heterogeneous data (e.g., Table 1) to the data recipient for clustering. If the data owner releases the raw data directly, the individual privacy of the data may be leaked. Thus, private information should be masked before being released. Note that the data owner wants to release data records to the data recipient, instead of clustering results, because unlike association rules and classifiers, releasing the clustering results (e.g., clusters with their centroids and sizes) may not provide enough information for further analysis. For example, the data recipient may browse into the clustered records to find their inherent relationship. Releasing data records not only satisfies the demand for clustering, but also gives the data recipient greater flexibility in conducting his specific data analysis.

In this paper, we present a differentially-private algorithm to protect individual privacy while preserving as much information as possible for cluster analysis. To tackle the challenge of lacking proper guidance for the masking process, our approach converts the clustering problem into a classification problem. That is, it groups the raw data into clusters and utilizes cluster/class labels to encode the cluster structure of the data. It then generalizes the raw data iteratively while preserving the cluster structure. At each iteration, the approach selects a general value in a probabilistic manner and specializes the value to a more specific one. The process is repeated until certain conditions are reached. Finally, noise is added to further guarantee ϵ-differential privacy. The contributions of this paper are summarized as follows:

  • We formally define the problem of differentially-private heterogeneous data release for cluster analysis. This paper is the first work that tackles this problem and addresses the challenges of heterogeneity and lack of guidance in the anonymization process for cluster analysis.

  • We propose a customizable approach to heterogeneous data anonymization for cluster analysis. Users can choose different clustering algorithms and algorithmic parameters to get their desired results. Also, a distance metric that considers both relational and set-valued attributes is tailored for heterogeneous data clustering.

  • To satisfy the differential privacy principle, we propose an algorithm to simultaneously handle relational and set-valued data in a non-deterministic fashion. Data of different types are anonymized in a similar way, which is computationally efficient.

  • We extensively evaluate the performance of the proposed cluster-oriented approach on real-life datasets. The results suggest that our approach can generate anonymous data of better utility compared to the general method that does not consider the task of cluster analysis during anonymization.

The rest of the paper is organized as follows. Related work is discussed in Section 2. Preliminaries including the problem statement are presented in Section 3. The proposed approach is described in Section 4, and experimental results are presented in Section 5. A discussion of the approach is given in Section 6. Section 7 concludes the paper.

Section snippets

Anonymization of different types of data

Relational data anonymization. Many privacy models had been proposed to anonymize relational data, such as k-anonymity [18], [19], l-diversity [20], and t-closeness [21]. Recently, researchers extend these models to provide stricter privacy protections. Amiri et al. [22] hide the correlations between identifying attributes and sensitive attributes and generate k-anonymous β-likeness data to prevent identity and attribute disclosures. Agarwal et al. [23] propose a privacy model called (P, U

Preliminaries

Table 2 summarizes some notations used in the following.

Proposed approach

In this section, we first present an overview of our approach to the problem of heterogeneous data anonymization for cluster analysis. We then elaborate details of the proposed differentially-private algorithm. Finally, we analyze the privacy guarantee and the time complexity of the algorithm.

Experimental evaluation

In this section, we evaluate the performance of our approach. First, we study the quality of the clusters by satisfying different differential privacies. Second, we evaluate the quality of the clusters of the anonymized dataset generated by our approach and those generated by a general method without focusing on cluster analysis during anonymization. Third, we investigate the impact of using different clustering algorithms before and after anonymization. Fourth, we evaluate the scalability of

Discussion

Adaptability of DPHeter. Although only k-means and bisecting k-means were used in Section 5 to evaluate the performance of DPHeter, other clustering algorithms, such as DBSCAN [56], can be integrated into our approach; namely, other clustering algorithms can be applied to steps ① and ③ in Fig. 1. Our proposed approach provides a flexible framework in which the clustering algorithms can be viewed as “plug-in” components. DPHeter utilizes the clustering results to anonymize the raw data, not the

Conclusions and future work

In this paper, we introduced an approach to release heterogeneous data for cluster analysis. The proposed approach utilizes cluster labels to encode the cluster structure and combines the generalization technique with output perturbation to mask raw data. The experimental results showed that the utility of the anonymized data produced by our cluster-oriented approach was significantly better than that of the anonymized data produced by the method without initially considering cluster analysis.

CRediT authorship contribution statement

Rong Wang: Methodology, Software, Validation, Investigation, Data curation, Writing - original draft, Visualization, Funding acquisition. Benjamin C.M. Fung: Conceptualization, Methodology, Formal analysis, Resources, Writing - review & editing. Yan Zhu: Resources, Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research was supported by the Sichuan Province Science and Technology Program (2019YFSY0032) and China Scholarship Council (201807000083). We also would like to thank the editors and anonymous reviewers for their helpful comments that have led to an improved version of this paper.

References (56)

  • JanssenM. et al.

    Benefits, adoption barriers and myths of open data and open government

    Inf. Syst. Manage.

    (2012)
  • DoshiP. et al.

    The imperative to share clinical study reports: Recommendations from the tamiflu experience

    PLoS Med.

    (2012)
  • The HIPAA privacy rule

    (2019)
  • The PIPEDA privacy law

    (2019)
  • RashidA.H. et al.

    Privacy preserving data publishing

    Int. J. Phys. Sci.

    (2015)
  • DworkC.

    Differential privacy

  • ZhuT. et al.

    Differentially private data publishing and analysis: A survey

    IEEE Trans. Knowl. Data Eng.

    (2017)
  • FriedmanA. et al.

    Data mining with differential privacy

  • JiaO. et al.

    An effective differential privacy transaction data publication strategy

    J. Comput. Res. Dev.

    (2014)
  • ChenR. et al.

    Publishing set-valued data via differential privacy

    Proc. VLDB Endow.

    (2011)
  • LeeJ. et al.

    Top-k frequent itemsets via differentially private fp-trees

  • WangT. et al.

    Locally differentially private frequent itemset mining

  • MaruseacM. et al.

    Precision-enhanced differentially-private mining of high-confidence association rules

    IEEE Trans. Dependable Secure Comput.

    (2018)
  • SunZ. et al.

    Differential privacy for data and model publishing of medical data

    IEEE Access

    (2019)
  • SuD. et al.

    PrivPfC: Differentially private data publication for classification

    VLDB J.

    (2018)
  • ZhangY. et al.

    A differential privacy support vector machine classifier based on dual variable perturbation

    IEEE Access

    (2019)
  • SamaratiP.

    Protecting respondents identities in microdata release

    IEEE Trans. Knowl. Data Eng.

    (2001)
  • SweeneyL.

    K-Anonymity: A model for protecting privacy

    Int. J. Uncertain. Fuzziness Knowl.-Based Syst.

    (2002)
  • Cited by (15)

    View all citing articles on Scopus
    1

    The first author conducted the research during the visit at McGill University.

    View full text