当前位置: X-MOL 学术J. Am. Med. Inform. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Application of Bayesian networks to generate synthetic health data
Journal of the American Medical Informatics Association ( IF 4.7 ) Pub Date : 2020-12-23 , DOI: 10.1093/jamia/ocaa303
Dhamanpreet Kaur 1 , Matthew Sobiesk 1 , Shubham Patil 2 , Jin Liu 3 , Puran Bhagat 3 , Amar Gupta 1 , Natasha Markuzon 3
Affiliation  

Abstract
Objective
This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data.
Materials and Methods
We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data.
Results
Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules.
Discussion
Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools.
Conclusion
We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.


中文翻译:

应用贝叶斯网络生成综合健康数据

摘要
客观的
本研究旨在开发一种完全自动化的方法,从真实数据集生成合成数据,医疗组织可以使用该方法将健康数据分发给研究人员,从而减少访问真实数据的需要。我们假设贝叶斯网络的应用将改进现有的主要方法 medBGAN,以处理医疗保健数据的复杂性和维度。
材料和方法
我们采用贝叶斯网络来学习概率图形结构并从学习到的结构中模拟合成患者记录。我们使用了加州大学欧文分校 (UCI) 心脏病和糖尿病数据集以及 MIMIC-III 诊断数据库。我们通过统计测试、机器学习任务、罕见事件的保存、披露风险以及机器学习分类器区分真实数据和合成数据的能力来评估我们的方法。
结果
我们的贝叶斯网络模型在所有关键指标上都优于或等于 medBGAN。在捕获稀有变量和保留关联规则方面取得了显着改进。
讨论
贝叶斯网络生成的数据与原始数据非常相似,泄露风险最小,同时与现有方法相比,提供了额外的透明度、计算效率和处理更多数据类型的能力。我们希望这种方法能让医疗保健组织有效地向研究人员传播合成健康数据,使他们能够产生假设并开发分析工具。
结论
我们得出的结论是,贝叶斯网络的应用是生成逼真的合成健康数据的一个有前途的选择,该数据保留了原始数据的特征,同时又不影响数据隐私。
更新日期:2020-12-23
down
wechat
bug