Predict multicategory causes of death in lung cancer patients using clinicopathologic factors,Computers in Biology and Medicine

当前位置： X-MOL 学术 › Comput. Biol. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Predict multicategory causes of death in lung cancer patients using clinicopathologic factors
Computers in Biology and Medicine ( IF 7.7 ) Pub Date : 2020-12-01 , DOI: 10.1016/j.compbiomed.2020.104161
Fei Deng ₁ , Haijun Zhou ₂ , Yong Lin ₃ , John A Heim ₄ , Lanlan Shen ₅ , Yuan Li ₆ , Lanjing Zhang ₇

Affiliation

Background

Random forests (RF) is a widely used machine-learning algorithm, and outperforms many other machine learning algorithms in prediction-accuracy. But it is rarely used for predicting causes of death (COD) in cancer patients. On the other hand, multicategory COD are difficult to classify in lung cancer patients, largely because they have multiple labels (versus binary labels).

Methods

We tuned RF algorithms to classify 5-category COD among the lung cancer patients in the surveillance, epidemiology and end results-18, whose lung cancers were diagnosed in 2004, for the completeness in their follow-up. The patients were randomly divided into training and validation sets (1:1 and 4:1 sample-splits). We compared the prediction accuracy of the tuned RF and multinomial logistic regression (MLR) models.

Results

We included 42,257 qualified lung cancers in the database. The COD were lung cancer (72.41%), other causes or alive (14.43%), non-lung cancer (6.85%), cardiovascular disease (5.35%), and infection (0.96%). The tuned RF model with 300 iterations and 10 variables outperformed the MLR model (accuracy = 69.8% vs 64.6%, 1:1 sample-split), while 4:1 sample-split produced lower prediction-accuracy than 1:1 sample-split. The top-10 important factors in the RF model were sex, chemotherapy status, age (65+ vs < 65 years), radiotherapy status, nodal status, T category, histology type and laterality, all of which except T category and laterality were also important in MLR model.

Conclusion

We tuned RF models to predict 5-category CODs in lung cancer patients, and show RF outperforms MLR in prediction accuracy. We also identified the factors associated with these COD.

中文翻译：

使用临床病理因素预测肺癌患者的多类别死因

背景

随机森林 (RF) 是一种广泛使用的机器学习算法，在预测准确性方面优于许多其他机器学习算法。但它很少用于预测癌症患者的死因 (COD)。另一方面，多类别 COD 在肺癌患者中难以分类，主要是因为它们具有多个标签（相对于二元标签）。

方法

我们调整了 RF 算法，在监测、流行病学和最终结果 18 中对 2004 年诊断出肺癌的肺癌患者进行 5 类 COD 分类，以确保随访的完整性。患者被随机分为训练集和验证集（1:1 和 4:1 样本拆分）。我们比较了调谐 RF 和多项逻辑回归 (MLR) 模型的预测准确性。

结果

我们在数据库中纳入了 42,257 例合格的肺癌。COD为肺癌（72.41%）、其他原因或存活（14.43%）、非肺癌（6.85%）、心血管疾病（5.35%）和感染（0.96%）。具有 300 次迭代和 10 个变量的调谐 RF 模型优于 MLR 模型（准确度 = 69.8% 对 64.6%，1:1 样本拆分），而 4:1 样本拆分产生的预测准确度低于 1:1 样本拆分. RF模型中的前10个重要因素是性别、化疗状态、年龄（65+ vs < 65岁）、放疗状态、淋巴结状态、T分类、组织学类型和偏侧性，除T类和偏侧性外，所有这些因素也在 MLR 模型中很重要。