当前位置: X-MOL 学术arXiv.cs.CR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fidelity and Privacy of Synthetic Medical Data
arXiv - CS - Cryptography and Security Pub Date : 2021-01-18 , DOI: arxiv-2101.08658
Ofer Mendelevitch, Michael D. Lesh

The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

中文翻译:

合成医学数据的保真度和保密性

病历的数字化为临床科学开创了大数据的新纪元,并有可能共享数据,从而扩大研究人员从纸质记录中可以提取的见解。随着科学家努力应对COVID-19大流行,共享个人水平的医学数据以加速精密医学创新的需求不断增长,而且这一需求从未如此紧迫。但是,对患者自主权和隐私的充分关注降低了对大数据使用的热情。就是说,实际上,提取有关个人的私人或机密信息的能力使共享数据变得困难,因为必须先建立重要的基础架构和数据治理,然后才能共享数据。尽管HIPAA提供了取消身份验证作为已批准的数据共享机制,但是已将链接攻击确定为主要漏洞。已经建立了各种机制来避免泄漏私人信息,例如字段抑制或抽象,严格限制可以共享的信息量或采用数学技术(例如差分隐私)。我们在此重点关注的另一种方法是创建模拟基础数据的综合数据。为了使合成数据成为支持医学创新的有用机制并成为现实证据的代名词,必须证明合成数据集的两个属性:(1)对真实数据的任何分析都必须通过对合成数据的分析来匹配(统计保真度)和(2)综合数据必须保护隐私,重新识别的风险最小(隐私保证)。在本文中,我们提出了一个量化合成数据集的统计保真度和隐私保护属性的框架,并演示了Syntegra技术生成的合成数据的这些指标。
更新日期:2021-01-22
down
wechat
bug