当前位置: X-MOL 学术arXiv.stat.ME › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accelerated and interpretable oblique random survival forests
arXiv - STAT - Methodology Pub Date : 2022-08-01 , DOI: arxiv-2208.01129
Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, Jaime L. Speiser, Matthew Segar, Ambarish Pandey, Nicholas M. Pajewski

The oblique random survival forest (RSF) is an ensemble supervised learning method for right-censored outcomes. Trees in the oblique RSF are grown using linear combinations of predictors to create branches, whereas in the standard RSF, a single predictor is used. Oblique RSF ensembles often have higher prediction accuracy than standard RSF ensembles. However, assessing all possible linear combinations of predictors induces significant computational overhead that limits applications to large-scale data sets. In addition, few methods have been developed for interpretation of oblique RSF ensembles, and they remain more difficult to interpret compared to their axis-based counterparts. We introduce a method to increase computational efficiency of the oblique RSF and a method to estimate importance of individual predictor variables with the oblique RSF. Our strategy to reduce computational overhead makes use of Newton-Raphson scoring, a classical optimization technique that we apply to the Cox partial likelihood function within each non-leaf node of decision trees. We estimate the importance of individual predictors for the oblique RSF by negating each coefficient used for the given predictor in linear combinations, and then computing the reduction in out-of-bag accuracy. In general benchmarking experiments, we find that our implementation of the oblique RSF is approximately 450 times faster with equivalent discrimination and superior Brier score compared to existing software for oblique RSFs. We find in simulation studies that 'negation importance' discriminates between relevant and irrelevant predictors more reliably than permutation importance, Shapley additive explanations, and a previously introduced technique to measure variable importance with oblique RSFs based on analysis of variance. Methods introduced in the current study are available in the aorsf R package.

中文翻译:

加速和可解释的倾斜随机生存森林

斜随机生存森林 (RSF) 是一种用于右删失结果的集成监督学习方法。倾斜 RSF 中的树是使用预测变量的线性组合来创建分支的,而在标准 RSF 中,使用单个预测变量。倾斜 RSF 集成通常比标准 RSF 集成具有更高的预测精度。然而,评估预测变量的所有可能线性组合会导致显着的计算开销,从而限制了对大规模数据集的应用。此外,几乎没有开发出用于解释倾斜 RSF 系综的方法,并且与基于轴的对应物相比,它们仍然更难以解释。我们介绍了一种提高倾斜 RSF 计算效率的方法,以及一种使用倾斜 RSF 估计单个预测变量重要性的方法。我们减少计算开销的策略使用了 Newton-Raphson 评分,这是一种经典的优化技术,我们将其应用于决策树的每个非叶节点内的 Cox 部分似然函数。我们通过在线性组合中否定用于给定预测器的每个系数来估计各个预测器对倾斜 RSF 的重要性,然后计算袋外精度的降低。在一般的基准测试实验中,我们发现与现有的倾斜 RSF 软件相比,我们的倾斜 RSF 实现速度快了大约 450 倍,具有同等的辨别力和优越的 Brier 分数。我们在模拟研究中发现,“否定重要性”比排列重要性、Shapley 加性解释更可靠地区分相关和不相关的预测变量,以及先前引入的基于方差分析使用倾斜 RSF 测量变量重要性的技术。当前研究中介绍的方法可在 aorsf R 包中找到。
更新日期:2022-08-03
down
wechat
bug