Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival.,Artificial Intelligence in Medicine

当前位置： X-MOL 学术 › Artif. Intell. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival.
Artificial Intelligence in Medicine ( IF 6.1 ) Pub Date : 2020-05-24 , DOI: 10.1016/j.artmed.2020.101875
Mireia Vilardell ₁ , Maria Buxó ₂ , Ramon Clèries ₃ , José Miguel Martínez ₄ , Gemma Garcia ₁ , Alberto Ameijide ₅ , Rebeca Font ₆ , Sergi Civit ₁ , , Rafael Marcos-Gragera ₇ , Maria Loreto Vilardell ₈ , Marià Carulla ₅ , Josep Alfons Espinàs ₆ , Jaume Galceran ₅ , Angel Izquierdo ₉ , Josep Ma Borràs ₃

Affiliation

Sección de Estadística del Departamento de Genética, Microbiología y Estadística de la Facultad de Biología, Universidad de Barcelona, 08028, Spain.
Institut d'Investigació Biomèdica de Girona, IDIBGI, C/Dr.Castany s/n. Edifici M2, Parc Hospitalari Martí i Julià, 17190 Salt, Spain; Registre de Cáncer de Girona - Unitat d'Epidemiologia, Pla Director d'Oncologia, Institut Català d'Oncología, Grup d'Epidemiologia Descriptiva, Genètica i Prevenció del Càncer de Girona-IDIBGI, Girona 17005, Spain.
Pla Director d'Oncología. IDIBELL, Hospitalet de Llobregat, Av Gran Vía 199-203 08908, Spain; Department de Ciències Clíniques de la Universitat de Barcelona, 08907, Spain.
Departamento Análisis y Planificación Recursos Sanitarios, MC Mutual, 08037, Barcelona, Spain; Department of Statistics, Technical University of Catalonia, 08028 Barcelona, Spain; Public Health Research Group, University of Alicante, 03690 Alicante, Spain.
Registre de Càncer de Tarragona, Servei d'Epidemiologia i Prevenció del Càncer, Hospital Universitari Sant Joan de Reus, IISPV, Reus, Spain.
Pla Director d'Oncología. IDIBELL, Hospitalet de Llobregat, Av Gran Vía 199-203 08908, Spain.
School of Medicine, University of Girona (UdG), Girona, Spain; Centro de Investigación Biomédica en Red: Epidemiología y Salud Pública (CIBERESP), Madrid, Spain; Institut d'Investigació Biomèdica de Girona, IDIBGI, C/Dr.Castany s/n. Edifici M2, Parc Hospitalari Martí i Julià, 17190 Salt, Spain; Registre de Cáncer de Girona - Unitat d'Epidemiologia, Pla Director d'Oncologia, Institut Català d'Oncología, Grup d'Epidemiologia Descriptiva, Genètica i Prevenció del Càncer de Girona-IDIBGI, Girona 17005, Spain.
Registre de Cáncer de Girona - Unitat d'Epidemiologia, Pla Director d'Oncologia, Institut Català d'Oncología, Grup d'Epidemiologia Descriptiva, Genètica i Prevenció del Càncer de Girona-IDIBGI, Girona 17005, Spain.
Servei d'Oncología Médica. Institut Català d'Oncología, Hospital Universitari de Girona Doctor Josep Trueta, Girona 17005, Spain; Registre de Cáncer de Girona - Unitat d'Epidemiologia, Pla Director d'Oncologia, Institut Català d'Oncología, Grup d'Epidemiologia Descriptiva, Genètica i Prevenció del Càncer de Girona-IDIBGI, Girona 17005, Spain.

Background

Two common issues may arise in certain population-based breast cancer (BC) survival studies: I) missing values in a survivals’ predictive variable, such as “Stage” at diagnosis, and II) small sample size due to “imbalance class problem” in certain subsets of patients, demanding data modeling/simulation methods.

Methods

We present a procedure, ModGraProDep, based on graphical modeling (GM) of a dataset to overcome these two issues. The performance of the models derived from ModGraProDep is compared with a set of frequently used classification and machine learning algorithms (Missing Data Problem) and with oversampling algorithms (Synthetic Data Simulation). For the Missing Data Problem we assessed two scenarios: missing completely at random (MCAR) and missing not at random (MNAR). Two validated BC datasets provided by the cancer registries of Girona and Tarragona (northeastern Spain) were used.

Results

In both MCAR and MNAR scenarios all models showed poorer prediction performance compared to three GM models: the saturated one (GM.SAT) and two with penalty factors on the partial likelihood (GM.K1 and GM.TEST). However, GM.SAT predictions could lead to non-reliable conclusions in BC survival analysis. Simulation of a “synthetic” dataset derived from GM.SAT could be the worst strategy, but the use of the remaining GMs models could be better than oversampling.

Conclusion

Our results suggest the use of the GM-procedure presented for one-variable imputation/prediction of missing data and for simulating “synthetic” BC survival datasets. The “synthetic” datasets derived from GMs could be also used in clinical applications of cancer survival data such as predictive risk analysis.

中文翻译：

通过对变量之间的图形概率依赖性建模（ModGraProDep）来进行缺失数据插补和合成数据模拟：在乳腺癌生存中的应用。

背景

在某些基于人群的乳腺癌 (BC) 生存研究中可能会出现两个常见问题：I) 生存预测变量中的缺失值，例如诊断时的“阶段”，以及 II) 由于“类别不平衡问题”而导致的小样本量在某些患者子集中，需要数据建模/模拟方法。

方法

我们提出了一个基于数据集图形建模 (GM) 的程序ModGraProDep来克服这两个问题。将源自ModGraProDep的模型的性能与一组常用的分类和机器学习算法（缺失数据问题）以及过采样算法（合成数据模拟）进行比较。对于缺失数据问题，我们评估了两种情况：完全随机缺失 (MCAR) 和非随机缺失 (MNAR)。使用了赫罗纳和塔拉戈纳（西班牙东北部）癌症登记处提供的两个经过验证的 BC 数据集。

结果

在 MCAR 和 MNAR 场景中，与三种 GM 模型相比，所有模型都显示出较差的预测性能：饱和模型 (GM.SAT) 和带有部分似然惩罚因子的两个模型（GM.K1 和 GM.TEST）。但是，GM.SAT 预测可能会导致 BC 生存分析中的结论不可靠。模拟源自 GM.SAT 的“合成”数据集可能是最糟糕的策略，但使用剩余的 GMs 模型可能比过采样更好。