当前位置: X-MOL 学术J. Med. Internet Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
Journal of Medical Internet Research ( IF 7.4 ) Pub Date : 2020-11-16 , DOI: 10.2196/23139
Khaled El Emam , Lucy Mosquera , Jason Bass

Background: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

This is the abstract only. Read the full article on the JMIR site. JMIR is the leading open access journal for eHealth and healthcare in the Internet age.


中文翻译:

在完全综合的健康数据中评估身份披露风险:模型开发和验证

背景:人们对数据合成越来越感兴趣,因为它可以共享数据以进行二次分析。但是,对于完全合成的数据,需要一个综合的隐私风险模型:如果生成模型过拟合,则可以从合成数据中识别个人并从中学习新知识。目的:本研究的目的是开发和应用一种方法来评估全合成数据的身份披露风险。方法:提出了一个完整的风险模型,该模型可以评估身份披露和对手的学习新东西的能力(如果合成记录与真实人的匹配)。我们称这种“有意义的身份披露风险”。该模型应用于华盛顿州立医院出院数据库(2007)和加拿大COVID-19病例数据库的样本。这两个数据集都是使用顺序决策树过程合成的,该过程通常用于合成健康和社会科学数据。结果:这两个合成样本的有意义的身份披露风险均低于常用的0.09风险阈值(分别为0.0198和0.0086),分别比原始数据集的风险值低4倍和5倍。结论:我们已经提出了一个针对完全综合数据的全面的身份披露风险模型。该合成方法在2个数据集上的结果表明,合成可以显着降低有意义的身份披露风险。

这仅仅是抽象的。阅读JMIR网站上的全文。JMIR是互联网时代电子健康和医疗保健领域领先的开放获取期刊。
更新日期:2020-11-16
down
wechat
bug