A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Data Sets,Journal of Official Statistics

当前位置： X-MOL 学术 › Journal of Official Statistics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Data Sets
Journal of Official Statistics ( IF 0.5 ) Pub Date : 2020-03-01 , DOI: 10.2478/jos-2020-0005
Harvey Goldstein ₁ , Natalie Shlomo ₂

Affiliation

Abstract The requirement to anonymise data sets that are to be released for secondary analysis should be balanced by the need to allow their analysis to provide efficient and consistent parameter estimates. The proposal in this article is to integrate the process of anonymisation and data analysis. The first stage uses the addition of random noise with known distributional properties to some or all variables in a released (already pseudonymised) data set, in which the values of some identifying and sensitive variables for data subjects of interest are also available to an external ‘attacker’ who wishes to identify those data subjects in order to interrogate their records in the data set. The second stage of the analysis consists of specifying the model of interest so that parameter estimation accounts for the added noise. Where the characteristics of the noise are made available to the analyst by the data provider, we propose a new method that allows a valid analysis. This is formally a measurement error model and we describe a Bayesian MCMC algorithm that recovers consistent estimates of the true model parameters. A new method for handling categorical data is presented. The article shows how an appropriate noise distribution can be determined.

中文翻译：

匿名化，评估重新识别风险和分析扰动数据集的概率过程

摘要对匿名数据集进行二次分析的要求应通过允许其分析提供有效且一致的参数估计的需求来平衡。本文中的建议是整合匿名化和数据分析过程。第一阶段是将具有已知分布特性的随机噪声添加到已发布（已假名）数据集中的某些或所有变量中，其中感兴趣的数据主体的某些标识性变量和敏感变量的值也可用于外部“希望识别这些数据主体以便询问其在数据集中的记录的攻击者。分析的第二阶段包括指定感兴趣的模型，以便参数估计可以解决增加的噪声。当数据提供者可以将噪声的特征提供给分析人员时，我们提出了一种可以进行有效分析的新方法。这正式是一个测量误差模型，我们描述了一种贝叶斯MCMC算法，该算法可恢复对真实模型参数的一致估计。提出了一种处理分类数据的新方法。本文介绍了如何确定适当的噪声分布。

更新日期：2020-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文