HDDA: DataSifter: statistical obfuscation of electronic health records and other sensitive datasets,Journal of Statistical Computation and Simulation

当前位置： X-MOL 学术 › J. Stat. Comput. Simul. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

HDDA: DataSifter: statistical obfuscation of electronic health records and other sensitive datasets
Journal of Statistical Computation and Simulation ( IF 1.1 ) Pub Date : 2018-11-11 , DOI: 10.1080/00949655.2018.1545228
Simeone Marino ₁ , Nina Zhou _{1,

2} , Yi Zhao ₁ , Lu Wang ₂ , Qiucheng Wu ₁ , Ivo D Dinov _{1,

3,

4,

5}

Affiliation

ABSTRACT There are no practical and effective mechanisms to share high-dimensional data including sensitive information in various fields like health financial intelligence or socioeconomics without compromising either the utility of the data or exposing private personal or secure organizational information. Excessive scrambling or encoding of the information makes it less useful for modelling or analytical processing. Insufficient preprocessing may compromise sensitive information and introduce a substantial risk for re-identification of individuals by various stratification techniques. To address this problem, we developed a novel statistical obfuscation method (DataSifter) for on-the-fly de-identification of structured and unstructured sensitive high-dimensional data such as clinical data from electronic health records (EHR). DataSifter provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Simulation results suggest that DataSifter can provide privacy protection while maintaining data utility for different types of outcomes of interest. The application of DataSifter on a large autism dataset provides a realistic demonstration of its promise practical applications.

中文翻译：

HDDA：DataSifter：电子健康记录和其他敏感数据集的统计混淆

摘要：没有实用且有效的机制来共享高维数据，包括健康金融情报或社会经济学等各个领域的敏感信息，而不损害数据的实用性或暴露私人或安全的组织信息。信息的过度加扰或编码使其对于建模或分析处理的用处不大。预处理不充分可能会损害敏感信息，并为通过各种分层技术重新识别个人带来巨大风险。为了解决这个问题，我们开发了一种新颖的统计混淆方法（DataSifter），用于对结构化和非结构化敏感高维数据（例如电子健康记录（EHR）中的临床数据）进行即时去识别。 DataSifter 对数据重新识别风险和数据信息保存之间的平衡提供完整的管理控制。模拟结果表明，DataSifter 可以提供隐私保护，同时保持不同类型感兴趣结果的数据效用。 DataSifter 在大型自闭症数据集上的应用为其实际应用前景提供了真实的演示。

更新日期：2018-11-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11