当前位置: X-MOL 学术Int. J. Med. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A two-stage modeling approach for breast cancer survivability prediction
International Journal of Medical Informatics ( IF 3.7 ) Pub Date : 2021-03-11 , DOI: 10.1016/j.ijmedinf.2021.104438
Zahra Sedighi-Maman 1 , Alexa Mondello 2
Affiliation  

Background

Despite the increasing number of studies in breast cancer survival prediction, there is little attention put toward deceased patients and their survival lengths. Moreover, developing a model that is both accurate and interpretable remains a challenge.

Objective

This paper proposes a two-stage data analytic framework, where Stage I classifies the survival and deceased statuses and Stage II predicts the number of survival months for deceased females with cancer. Since medical data are not entirely clean nor prepared for model development, we aim to show that data preparation can strengthen a simple Generalized Linear Model (GLM)1 to predict as accurate as the complex models like Extreme Gradient Boosting (XGB)2 and Multilayer Perceptron based on Artificial Neural Networks (MLP-ANNs)3 in both stages.

Methods

In Stage I, we use recent Surveillance, Epidemiology, and End Results (SEER)4 data from 2004 to 2016 to predict short term survival statuses from 6-months to 3-years with 6 month increments. Synthetic Minority Over-sampling Technique (SMOTE),5 Relocating Safe-Level SMOTE (RSLS)6, Adaptive Synthetic (ADASYN)7 re-sampling techniques, Least Absolute Shrinkage and Selection Operator (LASSO)8 and Random Forest (RF)9 feature selection methods along with integer and one-hot encoding are combined with the three popular data mining methods: GLM, XGB, and MLP. In Stage II, we predict the number of survival months for patients who are correctly predicted as deceased within 3-years. Again, we employ GLM, XGB, and MLP for regression along with LASSO and RF for feature selection and one-hot encoding to encode the categorical features.

Results

We obtain Area Under the Receiver Operating Characteristic Curve (AUC)10 values of 0.900, 0.898, 0.877, 0.852, 0.852, and 0.858 for 6-month, 1-, 1.5-, 2-, 2.5, and 3-year survival time-points, respectively, using OneHotEncoding-GLM-LASSO-ADASYN. We use the change in the Odds Ratio values in GLM to manifest the impact of individual categorical levels and numerical features on the odds of death. In Stage II, we obtain Mean Absolute Error (MAE)11 of 7.960 months using OneHotEncoding-GLM-LASSO when predicting the number of survival months for deceased patients. We present the top contributing features and their coefficient values to illustrate how the presence of each feature alters the predicted number of survival months.

Conclusion

To the best of our knowledge, this is the first study that implements both breast cancer survival classification and regression in a two-stage approach. All data-driven findings are presented in order to assist clinicians make better care decisions using GLM, an interpretable and computationally efficient method that predicts survival status and survival lengths for deceased patients, to help foster human and machine interactions.



中文翻译:

乳腺癌存活率预测的两阶段建模方法

背景

尽管有关乳腺癌生存预测的研究越来越多,但对死者及其生存时间的关注却很少。此外,开发一个既准确又可解释的模型仍然是一个挑战。

客观的

本文提出了一个两阶段的数据分析框架,其中第一阶段对生存率和已故状况进行分类,第二阶段预测已故癌症女性的生存期数。由于医学数据不是完全干净,也不是为模型开发准备的,因此,我们旨在证明数据准备可以增强简单的广义线性模型(GLM)1,以像复杂的模型(如极端梯度增强(XGB)2和多层感知器)一样准确地进行预测在两个阶段都基于人工神经网络(MLP-ANNs)3

方法

在第一阶段,我们使用2004年至2016年的最新监测,流行病学和最终结果(SEER)4数据来预测6个月至3年的短期生存状态,并以6个月为增量。综合少数族裔过采样技术(SMOTE),5重定位安全级别SMOTE(RSLS)6,自适应综合(ADASYN)7重采样技术,最小绝对收缩和选择算子(LASSO)8和随机森林(RF)9特征选择方法以及整数和一键编码与三种流行的数据挖掘方法结合在一起:GLM,XGB和MLP。在阶段II中,我们预测正确预测为3年内死亡的患者的生存期数。再次,我们使用GLM,XGB和MLP进行回归,并使用LASSO和RF进行特征选择以及使用一键编码对分类特征进行编码。

结果

我们获得了接收器工作特性曲线(AUC)下的6个月,1年,1年,1.5年,2年,2.5年和3年生存时间的0.900、0.898、0.877、0.852、0.852和0.858的10个值,使用OneHotEncoding-GLM-LASSO-ADASYN分别指向 我们使用GLM中的赔率比率值的变化来表明个体分类水平和数字特征对死亡几率的影响。在II期中,当预测死者的生存期数时,我们使用OneHotEncoding-GLM-LASSO获得7.960个月的平均绝对误差(MAE)11。我们介绍了最重要的功能及其系数值,以说明每个功能的存在如何改变预测的生存期数。

结论

据我们所知,这是第一项以两阶段方法同时实现乳腺癌生存分类和回归的研究。本文介绍了所有以数据为依据的发现,以帮助临床医生使用GLM做出更好的护理决策,GLM是一种可解释且计算效率高的方法,可预测已故患者的生存状况和生存时间,以帮助促进人机交互。

更新日期:2021-03-15
down
wechat
bug