当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Discrete-time survival forests with Hellinger distance decision trees
Data Mining and Knowledge Discovery ( IF 2.8 ) Pub Date : 2020-03-14 , DOI: 10.1007/s10618-020-00682-z
Matthias Schmid , Thomas Welchowski , Marvin N. Wright , Moritz Berger

Random survival forests (RSF) are a powerful nonparametric method for building prediction models with a time-to-event outcome. RSF do not rely on the proportional hazards assumption and can be readily applied to both low- and higher-dimensional data. A remaining limitation of RSF, however, arises from the fact that the method is almost entirely focussed on continuously measured event times. This issue may become problematic in studies where time is measured on a discrete scale \(t = 1, 2, ...\), referring to time intervals \([0,a_1), [a_1,a_2), \ldots \). In this situation, the application of methods designed for continuous time-to-event data may lead to biased estimators and inaccurate predictions if discreteness is ignored. To address this issue, we develop a RSF algorithm that is specifically designed for the analysis of (possibly right-censored) discrete event times. The algorithm is based on an ensemble of discrete-time survival trees that operate on transformed versions of the original time-to-event data using tree methods for binary classification. As the outcome variable in these trees is typically highly imbalanced, our algorithm implements a node splitting strategy based on Hellinger’s distance, which is a skew-insensitive alternative to classical split criteria such as the Gini impurity. The new algorithm thus provides flexible nonparametric predictions of individual-specific discrete hazard and survival functions. Our numerical results suggest that node splitting by Hellinger’s distance improves predictive performance when compared to the Gini impurity. Furthermore, discrete-time RSF improve prediction accuracy when compared to RSF approaches treating discrete event times as continuous in situations where the number of time intervals is small.

中文翻译:

具有Hellinger距离决策树的离散时间生存森林

随机生存森林(RSF)是一种功能强大的非参数方法,用于建立具有事件发生时间的预测模型。RSF不依赖于比例风险假设,并且可以轻松应用于低维和高维数据。但是,RSF的另一个局限性是由于该方法几乎完全集中在连续测量的事件时间上。在以时间间隔\([0,a_1),[a_1,a_2),\ ldots \为单位以离散尺度\(t = 1,2,... \)测量时间的研究中,此问题可能会变得有问题。 )。在这种情况下,如果忽略离散性,则设计用于连续时间到事件数据的方法的应用可能会导致估计量有偏差并且预测不准确。为解决此问题,我们开发了一种RSF算法,该算法专门用于分析(可能是右删失的)离散事件时间。该算法基于离散时间生存树的集合,该集合使用树方法进行二进制分类,对原始时间事件数据的转换版本进行操作。由于这些树中的结果变量通常高度不平衡,因此我们的算法基于Hellinger距离实现了节点拆分策略,这是对经典拆分标准(如基尼杂质)的偏斜不敏感选择。因此,新算法为特定于个体的离散危害和生存函数提供了灵活的非参数预测。我们的数值结果表明,与基尼杂质相比,按赫林格距离进行节点分裂可提高预测性能。此外,与在时间间隔数量少的情况下将离散事件时间视为连续的RSF方法相比,离散时间RSF提高了预测精度。
更新日期:2020-03-14
down
wechat
bug