Synthetic microdata for establishment surveys under informative sampling,The Journal of the Royal Statistical Society, Series A (Statistics in Society)

当前位置： X-MOL 学术 › J. R. Stat. Soc. A › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Synthetic microdata for establishment surveys under informative sampling
The Journal of the Royal Statistical Society, Series A (Statistics in Society) ( IF 1.5 ) Pub Date : 2020-11-10 , DOI: 10.1111/rssa.12622
Hang J. Kim ₁ , Jörg Drechsler ₂ , Katherine J. Thompson ₃

Affiliation

Many agencies are investigating whether releasing synthetic microdata could be a viable dissemination strategy for highly sensitive data, such as business data, for which disclosure avoidance regulations otherwise prohibit the release of public use microdata. However, existing methods assume that the original data either cover the entire population or comprise a simple random sample, which limits the application of these methods in the context of survey data with unequal weights. This paper discusses synthetic data generation under informative sampling. To utilise design information in survey weights, we rely on the pseudo likelihood approach when building a hierarchical Bayesian model to estimate the distribution of the finite population. Then, synthetic populations are randomly drawn from the estimated finite population density. We present the full conditional distributions of the Markov chain Monte Carlo algorithm for posterior inference with the pseudo likelihood function. Using simulation studies, we show that the suggested synthetic data approach offers high utility for design‐based and model‐based analyses while offering a high level of disclosure protection. We apply the proposed method to a subset of the 2012 U.S. Economic Census and evaluate results with utility metrics and disclosure avoidance metrics under data attacker scenarios commonly used for business data.

中文翻译：

信息性抽样下企业调查的合成微数据

许多机构正在调查发布合成微数据对于高敏感度数据（例如业务数据）是否可以作为可行的传播策略，对于这些数据，避免披露规定否则会禁止发布公共用途微数据。但是，现有方法假定原始数据要么覆盖整个人口，要么包含简单的随机样本，这限制了这些方法在权重不相等的调查数据中的应用。本文讨论了信息采样下的综合数据生成。为了在调查权重中利用设计信息，在构建分层贝叶斯模型以估计有限总体的分布时，我们依靠伪似然法。然后，从估计的有限人口密度中随机抽取合成人口。我们提出了使用伪似然函数进行后验的马尔可夫链蒙特卡罗算法的全部条件分布。通过仿真研究，我们表明，建议的综合数据方法在基于设计和基于模型的分析中具有很高的实用性，同时提供了高水平的公开保护。我们将建议的方法应用于2012年美国经济普查的子集，并在业务数据常用的数据攻击者方案下，使用效用指标和避免披露指标评估结果。我们表明，建议的综合数据方法在基于设计和基于模型的分析中具有很高的实用性，同时提供了高水平的公开保护。我们将建议的方法应用于2012年美国经济普查的子集，并在业务数据常用的数据攻击者方案下，使用效用指标和避免披露指标评估结果。我们表明，建议的综合数据方法在基于设计和基于模型的分析中具有很高的实用性，同时提供了高水平的公开保护。我们将建议的方法应用于2012年美国经济普查的子集，并在业务数据常用的数据攻击者方案下，使用效用指标和避免披露指标评估结果。

更新日期：2020-11-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文