当前位置: X-MOL 学术Artif. Intell. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival.
Artificial Intelligence in Medicine ( IF 6.1 ) Pub Date : 2020-05-24 , DOI: 10.1016/j.artmed.2020.101875
Mireia Vilardell 1 , Maria Buxó 2 , Ramon Clèries 3 , José Miguel Martínez 4 , Gemma Garcia 1 , Alberto Ameijide 5 , Rebeca Font 6 , Sergi Civit 1 , , Rafael Marcos-Gragera 7 , Maria Loreto Vilardell 8 , Marià Carulla 5 , Josep Alfons Espinàs 6 , Jaume Galceran 5 , Angel Izquierdo 9 , Josep Ma Borràs 3
Affiliation  

Background

Two common issues may arise in certain population-based breast cancer (BC) survival studies: I) missing values in a survivals’ predictive variable, such as “Stage” at diagnosis, and II) small sample size due to “imbalance class problem” in certain subsets of patients, demanding data modeling/simulation methods.

Methods

We present a procedure, ModGraProDep, based on graphical modeling (GM) of a dataset to overcome these two issues. The performance of the models derived from ModGraProDep is compared with a set of frequently used classification and machine learning algorithms (Missing Data Problem) and with oversampling algorithms (Synthetic Data Simulation). For the Missing Data Problem we assessed two scenarios: missing completely at random (MCAR) and missing not at random (MNAR). Two validated BC datasets provided by the cancer registries of Girona and Tarragona (northeastern Spain) were used.

Results

In both MCAR and MNAR scenarios all models showed poorer prediction performance compared to three GM models: the saturated one (GM.SAT) and two with penalty factors on the partial likelihood (GM.K1 and GM.TEST). However, GM.SAT predictions could lead to non-reliable conclusions in BC survival analysis. Simulation of a “synthetic” dataset derived from GM.SAT could be the worst strategy, but the use of the remaining GMs models could be better than oversampling.

Conclusion

Our results suggest the use of the GM-procedure presented for one-variable imputation/prediction of missing data and for simulating “synthetic” BC survival datasets. The “synthetic” datasets derived from GMs could be also used in clinical applications of cancer survival data such as predictive risk analysis.



中文翻译:

通过对变量之间的图形概率依赖性建模(ModGraProDep)来进行缺失数据插补和合成数据模拟:在乳腺癌生存中的应用。

背景

在某些基于人群的乳腺癌 (BC) 生存研究中可能会出现两个常见问题:I) 生存预测变量中的缺失值,例如诊断时的“阶段”,以及 II) 由于“类别不平衡问题”而导致的小样本量在某些患者子集中,需要数据建模/模拟方法。

方法

我们提出了一个基于数据集图形建模 (GM) 的程序ModGraProDep来克服这两个问题。将源自ModGraProDep的模型的性能与一组常用的分类和机器学习算法(缺失数据问题)以及过采样算法(合成数据模拟)进行比较。对于缺失数据问题,我们评估了两种情况:完全随机缺失 (MCAR) 和非随机缺失 (MNAR)。使用了赫罗纳和塔拉戈纳(西班牙东北部)癌症登记处提供的两个经过验证的 BC 数据集。

结果

在 MCAR 和 MNAR 场景中,与三种 GM 模型相比,所有模型都显示出较差的预测性能:饱和模型 (GM.SAT) 和带有部分似然惩罚因子的两个模型(GM.K1 和 GM.TEST)。但是,GM.SAT 预测可能会导致 BC 生存分析中的结论不可靠。模拟源自 GM.SAT 的“合成”数据集可能是最糟糕的策略,但使用剩余的 GMs 模型可能比过采样更好。

结论

我们的结果表明,使用 GM 程序来对缺失数据进行单变量插补/预测以及模拟“合成”BC 生存数据集。源自 GM 的“合成”数据集也可用于癌症生存数据的临床应用,例如预测风险分析。

更新日期:2020-05-24
down
wechat
bug