当前位置: X-MOL 学术J. Am. Med. Inform. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing the synthesis of clinical trial data using sequential trees
Journal of the American Medical Informatics Association ( IF 4.7 ) Pub Date : 2020-11-13 , DOI: 10.1093/jamia/ocaa249
Khaled El Emam 1, 2, 3 , Lucy Mosquera 3 , Chaoyi Zheng 3
Affiliation  

Abstract
Objective
With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high.
Materials and Methods
Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables.
Results
As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility.
Conclusions
The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.


中文翻译:


使用顺序树优化临床试验数据的合成


 抽象的
 客观的

随着共享临床试验数据的需求不断增长,需要可扩展的方法来实现对高效数据的隐私保护访问。数据合成就是这样一种方法。顺序树通常用于合成健康数据。假设生成的数据的效用取决于变量顺序。迄今为止,尚未评估可变顺序对综合临床试验数据的影响。通过模拟,我们的目标是评估合成临床试验数据效用的变异性,因为变量顺序被随机洗牌,并实施优化算法以在变异性太高时找到好的顺序。
 材料和方法

在模拟中评估了六个肿瘤学临床试验数据集。比较真实数据和合成数据计算了三个效用指标:单变量相似性、多变量预测准确性的相似性以及可区分性指标。采用粒子群来优化变量顺序,并将其与排序变量的课程学习方法进行比较。
 结果

随着临床试验数据集中变量数量的增加,数据效用的可变性随顺序显着增加。具有可区分铰链损失的粒子群确保了所有 6 个数据集的足够实用性。选择铰链阈值是为了避免过度拟合,从而导致隐私问题。就实用性而言,这优于课程学习。
 结论

本研究中提出的优化方法提供了一种合成高效临床试验数据集的可靠方法。
更新日期:2021-01-16
down
wechat
bug