1 Introduction

Survival analysis considers the problem of modeling the distribution of the time to an event. A plethora of statistical methods for analyzing time-to-event data, and especially right-censored survival data, have been developed over the last fifty years or so. Most of these methods, like Cox regression, assume continuous-time models, but methods based on discrete-time models are sometimes used as well. Statistical methods for continuous-time survival data are discussed in a number of textbooks, such as Klein and Moeschberger (2003). The literature on discrete-time models and methods is more limited, but the recent book by Tutz and Schmid (2016) provides a nice presentation.

An important part of survival analysis is the topic of time-to-event prediction, also denoted survival prediction. This generally concerns predicting when an event will occur for new individuals (not part of our training set), where each individual is defined by a vector of covariates. Time-to-event prediction is most commonly approached by predicting the survival function for each individual, meaning we provide an estimate of the event time distribution conditioned on each individual’s covariates. In standard survival analysis, Cox regression is often used for this purpose (Klein and Moeschberger 2003, Chapter 8.6).

As a result of the rapid development in machine learning, and in particular neural networks, a number of new methods for time-to-event predictions have been developed in the last few years. This development has benefited from the excellent frameworks for neural network development, such as TensorFlow, PyTorch, Theano, Keras, and CNTK, which have simplified the application of neural networks to existing likelihood-based methodologies. Thus, novel methods for time-to-event predictions have been developed based on Cox’s partial likelihood (e.g., Katzman et al. 2018; Luck et al. 2017; Yousefi et al. 2017; Kvamme et al. 2019) and the likelihood for discrete-time survival data (e.g., Lee et al. 2018; Fotso 2018; Gensheimer and Narasimhan 2019).

In survival analysis, continuous-time models are arguably more commonly applied than their discrete-time counterparts. However, as neural networks are parametric models, fully parametric models are typically more straightforward to integrate with the neural network frameworks than semi- and non-parametric models. This makes discrete-time survival models, which are fully parametric, interesting candidates to extend with neural networks when developing methods for time-to-event predictions.

To the best of our knowledge, Lee et al. (2018) were the first to apply modern neural networks to the discrete-time likelihood for right-censored data. Their prediction method, denoted DeepHit, parameterizes the probability mass function (PMF) of the survival distribution with a neural network and combines the log-likelihood for right-censored data with a ranking loss for improved discriminative performance. In statistical survival analysis, it is, however, more common to express the likelihood by the hazard rate. Gensheimer and Narasimhan (2019) used this form of the likelihood and parameterized the discrete-time hazard rate with a neural network. They showed that their prediction method performs well, both in terms of discrimination and calibration of the survival predictions. However, they did not compare their methodology, which they refer to as Nnet-survival, with methods that parameterize the PMF.

One aim of the present paper is to perform a systematic study of the use of neural nets in conjunction with discrete-time likelihoods for right-censored time-to-event data. In particular, we perform a systematic comparison of methods that parameterize the PMF and the discrete hazard rate.

More importantly, we show how methods that are developed for discrete-time survival data using neural networks, may be adopted as approximations for continuous-time data. In this way, we circumvent the problem that continuous-time survival models are not so easily adapted to the neural net frameworks. To this end, we have to perform a discretization of the continuous time scale; a subject that has received little attention in the literature. Immediately, it might seem reasonable to have a very fine-grained discretization scheme, allowing the approximate discretized event times to be very close to the true event times. However, the number of parameters in the neural networks typically increases with finer discretization schemes, possibly making the network overfit. Therefore, there is a balance to be considered between the number of parameters in the network and the approximation error introduced by the discretization. We consider two discretization schemes, corresponding to equidistant times or equidistant survival probabilities, and conduct a simulation study to better understand the effect of the discretization scheme and the number of time points used for the discrete-time methods.

Closely related to the discretization of a continuous time scale is the subject of interpolation. A coarse discretization grid has the benefit of reducing the number of parameters in a neural network. But the approximation error that incurs when a discrete-time method is used as an approximation for continuous-time data, becomes smaller with a finer discretization grid. By interpolating the discrete-time survival predictions, one may use a coarser discretization grid with less of an impact on the approximation error of the predictions. For this reason, two interpolation schemes are investigated in this paper. The first assumes constant density functions between the time points in the discretization grid, and the second assumes constant hazard rates between the grid points. As a modification of the latter method, we also propose a continuous-time method obtained by assuming that the continuous-time hazard rate is piecewise constant, and we compare this method with the aforementioned discrete-time methods with and without interpolation.

The paper is organized as follows. First, in Sect. 2, we consider the discrete-time likelihood for right-censored event times and discuss how the likelihood may be parameterized with neural networks. Then, in Sect. 3, continuous-time models for time-to-event data are considered, and we discuss how discretization of the continuous time scale enables the use of discrete-time survival methods for continuous-time data. Here we also present the two schemes for interpolating discrete survival functions, and we consider our continuous-time method obtained by assuming piecewise constant hazards. In Sect. 4, a simulation study is conducted to understand the impact the discretization and interpolation schemes have on the methods, and in Sect. 5, we compare the methods with existing methods for time-to-event predictions using five real-world data sets. Finally, we summarize and discuss our findings in Sect. 6. Some additional material on the simulations and the implementation of the methods are provided in the Appendix. The code for all methods, data sets, and simulations presented in this paper are available at https://github.com/havakv/pycox.

2 Discrete-time models

In this section, we will restrict ourselves to models in discrete time. Then, in Sect. 3, we will discuss how discrete-time models may be used as approximations of models in continuous time.

2.1 The discrete-time survival likelihood

Consider an individual described by its covariate vector \({\mathbf {x}}\in {\mathbb {R}}^q\). Assume that time is discrete with values \(0 = \tau _0< \tau _1 < \ldots \), and let \({\mathbb {T}} = \{\tau _1, \tau _2, \dots \}\) denote the set of positive \(\tau _j\)’s. The time of an event is denoted \(T^* \in {\mathbb {T}}\), and our goal is to model the conditional distribution of this event time given the covariate vector \({\mathbf {x}}\). The probability mass function (PMF) and the survival function for the event time are defined as

$$\begin{aligned} f(\tau _j \,|\,{\mathbf {x}})&= \text {P}(T^* = \tau _j \,|\,{\mathbf {x}}),\nonumber \\ S(\tau _j \,|\,{\mathbf {x}})&= \text {P}(T^*> \tau _j \,|\,{\mathbf {x}}) = \sum _{k > j} f(\tau _k \,|\,{\mathbf {x}}). \end{aligned}$$
(1)

In survival analysis, models are often expressed in terms of the hazard rate rather than the PMF. For discrete time, the hazard rate is defined as

$$\begin{aligned} h(\tau _j \,|\,{\mathbf {x}})&= \text {P}(T^* = \tau _j \,|\,T^* > \tau _{j-1}, {\mathbf {x}}) = \frac{f(\tau _j \,|\,{\mathbf {x}})}{S(\tau _{j-1} \,|\,{\mathbf {x}})} , \end{aligned}$$

and it follows that

$$\begin{aligned} f(\tau _j \,|\,{\mathbf {x}})&= h(\tau _j \,|\,{\mathbf {x}})\, S(\tau _{j-1} \,|\,{\mathbf {x}}), \end{aligned}$$
(2)
$$\begin{aligned} S(\tau _j \,|\,{\mathbf {x}})&= [1 - h(\tau _j \,|\,{\mathbf {x}})]\, S(\tau _{j-1} \,|\,{\mathbf {x}}). \end{aligned}$$
(3)

Note further that from (3) it follows that the survival function can be expressed as

$$\begin{aligned} S(\tau _j \,|\,{\mathbf {x}})&= \prod _{k=1}^j [1 - h(\tau _k \,|\,{\mathbf {x}})]. \end{aligned}$$
(4)

In most studies, we do not observe all event times. For some individuals, we only have a right-censored observation. To allow for censoring, we let \(C^* \in {\mathbb {T}}_C = \{\tau _1, \tau _2, \ldots , \tau _m\}\) be a right-censoring time. Here \(\tau _m\) defines the maximum follow-up time, at which all individuals still at risk are administratively censored. The random variables \(T^*\) and \(C^*\) are typically not observed directly, but instead we observe a potentially right-censored event time \(T = \min \{T^*,\, C^*\}\) and an event indicator \(D = \mathbbm {1}\{T^* \le C^*\}\). We here follow the common convention in survival analysis that when an event and censoring time coincide, we observe the occurrence of the event. Note that, as \(C^* \le \tau _m\), we are not able to observe event times \(T^*\) larger than \(\tau _m\). Hence, we are restricted to model the distribution of the event times in \({\mathbb {T}}_C\).

We assume that \(T^*\) and \(C^*\) are conditionally independent given \({\mathbf {x}}\), and that their distributions have no parameters in common. Then we can consider, separately, the contribution to the likelihood of the event time distribution and the censoring distribution. We are, however, typically only interested in modeling the event time distribution.

Now, considering a set of n independent individuals, each with covariates \({\mathbf {x}}_i\), event or censoring time \(t_i\), and event indicator \(d_i\), the likelihood contribution of each individual i is given by

$$\begin{aligned} L_i&= {f(t_i \,|\,{\mathbf {x}}_i)}^{d_i} {S(t_i \,|\,{\mathbf {x}}_i)}^{1-d_i}. \end{aligned}$$
(5)

Using this, we can fit models by minimizing the mean negative log-likelihood

$$\begin{aligned} \text {loss}&= - \frac{1}{n} \sum _{i=1}^n \big \{ d_i \log [f(t_i \,|\,{\mathbf {x}}_i)] + (1-d_i) \log [S(t_i \,|\,{\mathbf {x}}_i)] \big \}. \end{aligned}$$
(6)

A useful reformulation of the loss function (6) is obtained by rewriting it in terms of the discrete hazards. To this end, let \(\kappa (t) \in \{0, \ldots , m\}\) define the index of the discrete time t, meaning \(t = \tau _{\kappa (t)}\). Using (2), (3), and (4), we can then rewrite the likelihood contribution (5) as

$$\begin{aligned} L_i = {h(t_i \,|\,{\mathbf {x}}_i)}^{d_i} \, {[1 - h(t_i \,|\,{\mathbf {x}}_i)]}^{1-d_i} \, \prod _{j=1}^{{\kappa (t_i)}-1} [1 - h(\tau _j \,|\,{\mathbf {x}}_i)]. \end{aligned}$$

With this formulation, the mean negative log-likelihood in (6) can be rewritten as

$$\begin{aligned} \text {loss} = - \frac{1}{n} \sum _{i=1}^n \sum _{j=1}^{\kappa (t_i)}\big \{y_{ij} \log [h(\tau _{j} \,|\,{\mathbf {x}}_i)] + (1 - y_{ij}) \log [1 - h(\tau _{j} \,|\,{\mathbf {x}}_i)] \big \}. \end{aligned}$$
(7)

Here, \(y_{ij} = \mathbbm {1}\{t_i = \tau _j,\, d_i = 1\}\), so \(\mathbf{y}_i = (y_{i1}, \ldots , y_{i{\kappa (t_i)}})\) is a vector of zeros with a single 1 at the event index \(\kappa (t_i)\) when \(t_i\) corresponds to an observed event (\(d_i = 1\)). We recognize this as the negative log-likelihood for Bernoulli data, or binary cross-entropy, a useful discovery first noted by Brown (1975).

With the two loss functions (6) and (7), we can now make survival models by parameterizing the PMF or the discrete hazard rate and minimizing the corresponding loss. For classical statistical models, these approaches are equivalent and have been used to obtain maximum likelihood estimates for the parameters in the PMF/hazard rate; see Tutz and Schmid (2016) for a review. We will, however, not consider classical maximum likelihood estimates, but focus on the part of the literature that fits neural networks for the purpose of time-to-event prediction, in which case the two loss functions may give different results.

2.2 Parameterization with neural networks

A neural network \(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) is a parametric, differentiable function of a covariate vector \({\mathbf {x}}\in {\mathbb {R}}^q\) that minimizes a loss function using a gradient descent approach. While networks typically contain thousands or millions of parameters, simple models such as linear and logistic regression can also be considered neural networks. For a large number of parameters, we are usually not interested in the parameter estimates themselves, but only in the network’s predictive capabilities. While there is a vast literature on various ways to parameterize neural networks, the internal structure of the networks is not that relevant for this paper as we only consider the most standard multilayer perceptron networks, or MLP’s. So, for the purposes of this paper, we think of the network \(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) as some very flexible parametric function of the covariates \({\mathbf {x}}\). For more on MLP’s and neural networks in general see, e.g., the book by Goodfellow et al. (2016).

In the previous subsection, we saw that the survival likelihood can be expressed in terms of the PMF or the hazard rate. In the following, we will describe how to use this to create survival prediction methods by parameterizing the PMF or hazard with neural networks. In theory, as both approaches aim at minimizing the same negative log-likelihood, the methods should yield the same results. But due to the nature of neural networks, this might not be the case in practice. Contrary to most parametric statistical models, neural networks are typically overparameterized and a minimum is not obtained for the training loss. Instead, a held-out validation set is monitored, and the iterative optimization procedure is stopped when performance on this validation set starts to deteriorate. Also, considering that neural networks are well known to be sensitive to numerical instabilities, some parameterizations of a likelihood might result in better performance than others.

First, considering the hazard parametrization of the likelihood, let \(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) represent a neural network that takes the covariates \({\mathbf {x}}\) as input and gives m outputs. Each output \(\phi _j({\mathbf {x}})\) corresponds to a discrete time-point \(\tau _j\), so \(\phi ({\mathbf {x}}) = {\{\phi _1({\mathbf {x}}), \ldots , \phi _m({\mathbf {x}})\}}\). As the discrete hazards are (conditional) probabilities, we apply the logistic function (sigmoid function) to the output of the network

$$\begin{aligned} h(\tau _j \,|\,{\mathbf {x}}) = \frac{1}{1 + \exp [-\phi _j({\mathbf {x}})]}, \end{aligned}$$

to ensure that \(h(\tau _j \,|\,{\mathbf {x}}) \in (0, 1)\). We can estimate the hazard rate by minimizing the loss (7), and survival estimates can be obtained from (4). To the best of our knowledge, this method was first proposed by Gensheimer and Narasimhan (2019). However, if one considers the special case where \(\phi _j({\mathbf {x}}) = \varvec{\beta }^T {\mathbf {x}}\), the approach is well known in the survival literature and seems to have been first addressed by Cox (1972) and Brown (1975); see also Allison (1982). The book by Tutz and Schmid (2016) gives a review of the approach.

The implementation we use in the experiments in Sects. 4 and 5 differs slightly from that of Gensheimer and Narasimhan (2019), as it was found to be numerically more stable (see Appendix B). In this paper, we will refer to the method as Logistic-Hazard, as coined by Brown (1975), but one can also find the term Logistic Discrete Hazard used in the statistical literature. Gensheimer and Narasimhan (2019) referred to it as Nnet-survival, but we will refrain from using that name as we find Logistic-Hazard to be more descriptive.

We can obtain a survival model by parameterizing the PMF in a similar manner to the Logistic-Hazard method. As for the hazards, the PMF \(f(\tau _j \,|\,{\mathbf {x}})\) represents probabilities, but, contrary to the conditional probabilities that define the hazard, we now require the PMF to sum to 1. As we only observe event times in \({\mathbb {T}}_C\), we fulfill this requirement indirectly through the probability of surviving past \(\tau _m\). Thus we have

$$\begin{aligned} \sum _{k=1}^m f(\tau _k \,|\,{\mathbf {x}}) + S(\tau _m \,|\,{\mathbf {x}}) = 1. \end{aligned}$$
(8)

Now, again with \(\phi ({\mathbf {x}}) \in {\mathbb {R}}^{m}\) denoting a neural network, the PMF can be expressed as

$$\begin{aligned} f(\tau _j \,|\,{\mathbf {x}}) = \frac{\exp [\phi _j({\mathbf {x}})]}{1 + \sum _{k=1}^{m} \exp [\phi _k({\mathbf {x}})]}, \quad \quad \text {for } j = 1, \ldots , m. \end{aligned}$$
(9)

Note that (9) is equivalent to the softmax function (also used in multinomial logistic regression) with a fixed \(\phi _{m+1}({\mathbf {x}}) = 0\). Alternatively, one could let \(\phi _{m+1}({\mathbf {x}})\) vary freely, something that is quite common in machine learning, but we chose to follow the typical conventions in statistics. By combining (1) and (8), we can express the survival function as

$$\begin{aligned} S(\tau _j \,|\,{\mathbf {x}}) = \sum _{k=j+1}^{m} f(\tau _k \,|\,{\mathbf {x}}) + S(\tau _m \,|\,{\mathbf {x}}) \end{aligned}$$
(10)

for \(j=1,\ldots ,m-1\), and

$$\begin{aligned} S(\tau _m \,|\,{\mathbf {x}}) = \frac{1}{1 + \sum _{k=1}^m \exp [\phi _k({\mathbf {x}})]}. \end{aligned}$$

Now, let \(\sigma _j[\phi ({\mathbf {x}})]\), for \(j=1,\ldots ,m+1\), denote the softmax in (9), meaning \(\sigma _{m+1}[\phi ({\mathbf {x}})] = S(\tau _m \,|\,{\mathbf {x}})\). Notice the similarities to classification with \(m+1\) classes, as we are essentially classifying whether the event is happening at either time \(\tau _1, \ldots , \tau _m\) or later than \(\tau _m\). However, due to censoring, the likelihood is not the cross-entropy. Instead, by inserting (9) and (10) into (6), we get the mean negative log-likelihood

$$\begin{aligned} \text {loss}&= -\frac{1}{n} \sum _{i=1}^n \left( d_i \log [\sigma _{\kappa (t_i)}(\phi ({\mathbf {x}}_i)) ] + (1-d_i) \log \left[ \sum _{k={\kappa (t_i)}+1}^{m+1} \sigma _k(\phi ({\mathbf {x}}_i)) \right] \right) , \end{aligned}$$
(11)

where \({\kappa (t_i)}\) still denotes the index of individual i’s event or censoring time, that is, \(t_i = \tau _{\kappa (t_i)}\). This is essentially the same negative log-likelihood as presented by Lee et al. (2018). Note, however, that contrary to the work by Lee et al. (2018) the negative log-likelihood in (11) allows for survival past time \(\tau _m\). Some numerical improvements of the implementation are addressed in Appendix B. We will refer to this method simply by PMF as this term is unambiguously discrete, contrary to the term hazard which is used both for discrete and continuous time.

As a side note, the Multi-task logistic regression (Yu et al. 2011), and the neural network extension of this method (Fotso 2018), can be shown to be a PMF model by considering a cumulative sum of the linear predictor, or in the neural network case, the cumulative sum of the output of the network. Details are given in Appendix C.

3 Continuous-time models

In the following, we no longer consider the time scale to be discrete, but instead consider continuous-time models, where \(T^*, C^* > 0\), and we let \(T = \min \{T^*, C^*\}\) and \(D = \mathbbm {1}\{T^*\le C^*\}\) be as before. Let \(\tau \) denote the maximum possible value of \(C^*\), so that \(P(C^* \le \tau ) = 1\). Hence, a potentially right-censored observation T is in the interval \(\mathopen {(} 0,\, \tau \mathclose {]}\). Instead of a PMF, we now have the density function \(f(t \,|\,{\mathbf {x}})\) and the continuous-time survival function

$$\begin{aligned} S(t \,|\,{\mathbf {x}}) = \text {P}(T^* > t \,|\,{\mathbf {x}}) = \int _t^\tau f(z \,|\,{\mathbf {x}})\, dz + S(\tau \,|\,{\mathbf {x}}). \end{aligned}$$

Furthermore, the continuous-time hazard rate is a non-negative function of the time (no longer restricted to [0, 1]),

$$\begin{aligned} h(t \,|\,{\mathbf {x}}) = \frac{f(t \,|\,{\mathbf {x}})}{S(t \,|\,{\mathbf {x}})} = \lim _{\Delta t \rightarrow 0} \frac{\text {P}(t \le T^* < t + \Delta t \,|\,T^* \ge t, {\mathbf {x}})}{\Delta t}. \end{aligned}$$
(12)

As a result, we can express the survival function in terms of the cumulative hazard \(H(t \,|\,{\mathbf {x}}) = \int _{0}^t h(z \,|\,{\mathbf {x}})\, dz\),

$$\begin{aligned} S(t \,|\,{\mathbf {x}}) = \exp [-H(t \,|\,{\mathbf {x}})]. \end{aligned}$$
(13)

This yields the continuous-time version of the likelihood contribution in (5),

$$\begin{aligned} L_i = {f(t_i \,|\,{\mathbf {x}}_i)}^{d_i}\, {S(t_i \,|\,{\mathbf {x}}_i)}^{1-d_i} = {h(t_i \,|\,{\mathbf {x}}_i)}^{d_i}\, \exp [-H(t_i \,|\,{\mathbf {x}}_i)]. \end{aligned}$$
(14)

In what follows, we will first discuss how we can apply the discrete-time methods from Sect. 2.2 for continuous-time data. We will here address how time can be discretized to fit the discrete-time model formulation, and how to interpolate an estimated discrete survival function for continuous-time predictions. Then, we will propose a new continuous-time method by assuming that the hazard in (12) is piecewise constant.

3.1 Discretization of the time scale

Both the PMF and Logistic-Hazard methods require time to be discrete on the form \(0 = \tau _0< \tau _1< \cdots < \tau _m\). Hence, to apply the methods to continuous-time data, we need to perform some form of discretization of the time scale. Possibly the most obvious way to discretize time is to make an equidistant grid in \([0, \tau ]\) with m grid points. An alternative, that we explore in this paper, is to make a grid based on the distribution of the event times. By disregarding covariates and estimating the marginal survival function S(t) with the Kaplan–Meier estimator, we obtain a general trend of event times. With \({{\hat{S}}}(t)\) denoting the Kaplan–Meier survival estimates, we can make a grid from the quantiles of the estimates, \(1 = {{\hat{S}}}(0) = \zeta _0> \zeta _1> \cdots > \zeta _m = {{\hat{S}}}(\tau )\). We will assume that each interval has the same decrease in the survival estimate, so that \(\zeta _j - \zeta _{j+1} = (1 - {{\hat{S}}}(\tau )) / m\). The corresponding time grid, \(\tau _1< \cdots < \tau _m\), is obtained by letting \(\tau _j\) be the smallest value of t such that \({{\hat{S}}}(t) \le \zeta _j\). We will then obtain a more dense grid in intervals with more events, and a less dense grid in intervals with fewer events. This is illustrated in Fig. 1, where we can see that the grid becomes coarser as the slope of the survival curve becomes less steep.

Fig. 1
figure 1

Illustration of the Kaplan–Meier based discretization scheme. The quantiles of the Kaplan–Meier curve are used as the grid points

The discrete-time methods assume that all events and censorings occur at the \(\tau _j\)’s, so, when performing the discretization, we move all event times in an interval to the end of that interval while censored times are moved to the end of the previous interval. This means that for \(\tau _{j-1} < T_i \le \tau _j\), we replace \(T_i\) by \(\tau _j\) if \(D_i = 1\), and by \(\tau _{j-1}\) if \(D_i = 0\). Our reason for this choice is that this is typically how the times are recorded. Consider a study where we are only able to make observations at times \(\tau _1< \tau _2< \cdots < \tau _m\). For a censored observation, \(\tau _{j-1}\) is the last point in time where the individual was recorded alive, while for an observed event, \(\tau _j\) is the first point in time for which the individual was recorded with the event.

3.2 Interpolation for continuous-time predictions

When discrete-time survival methods are applied to continuous-time data, the survival estimates become a step function with steps at the grid points; see the blue curve in Fig. 2. Consequently, for coarser grids, it might be beneficial to interpolate the discrete survival estimates. To this end, we propose two simple interpolation schemes that fulfill the monotonicity requirement of the survival function. The first assumes that the probability density function is constant in each time interval \(\mathopen {(} \tau _{j-1},\, \tau _{j} \mathclose {]}\), while the second scheme assumes constant hazard in each time interval. We will refer to the schemes as constant density interpolation (CDI) and constant hazard interpolation (CHI). Note that the two interpolation schemes correspond to piecewise linear and piecewise exponential survival estimates, as illustrated in Fig. 2.

Fig. 2
figure 2

Survival estimates by a discrete model for 5 grid points. The three lines represent the discrete survival estimates and the two interpolation schemes in Sect. 3.2: The constant density interpolation (CDI) and constant hazard interpolation (CHI)

3.3 A piecewise constant continuous-time hazard parametrization

We now propose a continuous-time method by parameterizing the hazards in (14). As for the constant hazard interpolation, we will let the continuous-time hazard be piecewise constant. Disregarding neural networks, this model was first proposed by Holford (1976), and further developed by Friedman (1982) who found that piecewise constant hazards yield a likelihood proportional to that of a Poisson likelihood.

Consider a partition of the time scale \(0 = \tau _0< \tau _1< \cdots < \tau _m = \tau \), and let \(\kappa (t)\) denote the interval index of time t such that \(t \in \mathopen {(} \tau _{{\kappa (t)}-1},\, \tau _{\kappa (t)} \mathclose {]}\) (this is slightly different from the discrete case where we had \(t = \tau _{\kappa (t)}\)). If we assume that the hazard is constant within each interval, we can express the hazard as the step function \(h(t \,|\,{\mathbf {x}}) = \eta _{\kappa (t)}({\mathbf {x}})\) for a set of non-negative functions \(\{\eta _1({\mathbf {x}}), \ldots , \eta _m({\mathbf {x}})\}\). For \(\Delta \tau _j = \tau _j - \tau _{j-1}\), we can now express the cumulative conditional hazard as

$$\begin{aligned} H(t \,|\,{\mathbf {x}}) = \left( \sum _{j=1}^{{\kappa (t)}- 1} \eta _j({\mathbf {x}})\, \Delta \tau _j\right) + \eta _{\kappa (t)}({\mathbf {x}})\, (t - \tau _{{\kappa (t)}-1 }). \end{aligned}$$

Inserting this into (14) yields the likelihood contribution for individual i,

$$\begin{aligned} L_i = {\eta _{\kappa (t_i)}({\mathbf {x}}_i)}^{d_i}\, \exp \left[ -\eta _{\kappa (t_i)}({\mathbf {x}}_i)\, (t - \tau _{{\kappa (t_i)}- 1})\right] \prod _{j=1}^{{\kappa (t_i)}- 1} \exp \left[ -\eta _j({\mathbf {x}}_i)\, \Delta \tau _{j}\right] . \end{aligned}$$

What remains is to parameterize the hazard with a neural network. However, to avoid passing all the \(\tau _j\)’s to the loss function, we let the network instead parameterize the functions \(\tilde{\eta }_j({\mathbf {x}}) = \eta _j({\mathbf {x}})\, \Delta \tau _j\). This allows us to rewrite the likelihood contribution as

$$\begin{aligned} L_i&= {\left( \frac{{{{\tilde{\eta }}}_{\kappa (t_i)}({\mathbf {x}}_i)}}{\Delta \tau _{\kappa (t_i)}}\right) }^{d_i}\, \exp \left[ -{{\tilde{\eta }}}_{\kappa (t_i)}({\mathbf {x}}_i)\, \rho (t_i) \right] \prod _{j=1}^{{\kappa (t_i)}- 1} \exp \left[ -{{\tilde{\eta }}}_j({\mathbf {x}}_i)\right] \\&\propto {{{\tilde{\eta }}}_{\kappa (t_i)}({\mathbf {x}}_i)}^{d_i}\, \exp \left[ -{{\tilde{\eta }}}_{\kappa (t_i)}({\mathbf {x}}_i)\, \rho (t_i) \right] \prod _{j=1}^{{\kappa (t_i)}- 1} \exp \left[ -{{\tilde{\eta }}}_j({\mathbf {x}}_i)\right] , \end{aligned}$$

where

$$\begin{aligned} \rho (t) = \frac{t - \tau _{{\kappa (t)}-1}}{\Delta \tau _{\kappa (t)}}, \end{aligned}$$
(15)

is the proportion of interval \({\kappa (t)}\) at time t.

As before, let \(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) denote a neural network. To ensure that \({\tilde{\eta }}_j({\mathbf {x}})\) is non-negative, we could have used \({\tilde{\eta }}_j({\mathbf {x}}) = \exp [\phi _j({\mathbf {x}})]\). However, for better numerical stability we prefer to use the softplus function

$$\begin{aligned} {\tilde{\eta }}_j({\mathbf {x}}) = \log (1 + \exp [\phi _j({\mathbf {x}})]). \end{aligned}$$
(16)

Now, again considering n independent individuals, each with covariates \({\mathbf {x}}_i\), observed event or censoring time \(t_i\), and event indicator \(d_i\), our model can be fitted by minimizing the mean negative log-likelihood

$$\begin{aligned} \text {loss}&= - \frac{1}{n}\sum _{i=1}^n \left( d_i\, \log {\tilde{\eta }}_{\kappa (t_i)}({\mathbf {x}}_i) - {\tilde{\eta }}_{\kappa (t_i)}({\mathbf {x}}_i)\, \rho (t_i) - \sum _{j=1}^{{\kappa (t_i)}- 1} {\tilde{\eta }}_j ({\mathbf {x}}_i) \right) , \end{aligned}$$

and estimates for the survival function can be obtained by

$$\begin{aligned} S(t \,|\,{\mathbf {x}}) = \exp [- H(t \,|\,{\mathbf {x}})] = \exp [-{\tilde{\eta }}_{\kappa (t)}({\mathbf {x}})\, \rho (t)] \prod _{j=1}^{{\kappa (t)}-1} \exp [-{\tilde{\eta }}_j({\mathbf {x}})], \end{aligned}$$
(17)

where \(\rho (t)\) is given by (15). We will refer to this method as the piecewise constant hazard method, or PC-Hazard. Even though this is a continuous-time method, we still need to decide the set of \(\tau _j\)’s that define the intervals. Therefore, the discretization techniques discussed in Sect. 3.1 are also relevant for this method.

4 Simulations

To get a better understanding of the methodologies discussed in Sects. 2 and 3, we perform a simulation study where we vary the size of the training sets, the discretization scheme, and the number of grid points used for discretization. Gensheimer and Narasimhan (2019) performed a similar study to evaluate the effect of discretization on their Logistic-Hazard method with the conclusion that there were no differences in performance. However, their simulations were quite simple (only one binary covariate), their only performance metric was the Harrell Jr et al. (1982) concordance at 1-year survival, and they did not include any interpolation of the survival estimates. For this reason, we find that further investigations are warranted.

We generate simulated survival times by sequentially sampling from discrete-time hazards defined on a fine grid of time points. The hazards are specified through their logit transforms, as this enables us to use functions in \({\mathbb {R}}\) while still obtaining hazards in (0, 1). The logit hazards, \(g(t \,|\,{\mathbf {x}}) = \text {logit}[h(t \,|\,{\mathbf {x}})]\), are defined as

$$\begin{aligned} g(t \,|\,{\mathbf {x}})&= \alpha _1({\mathbf {x}})\, g_\text {sin}(t \,|\,{\mathbf {x}}) + \alpha _2({\mathbf {x}})\, g_\text {con}(t \,|\,{\mathbf {x}}) + \alpha _3({\mathbf {x}})\, g_\text {acc}(t \,|\,{\mathbf {x}}), \end{aligned}$$
(18)

where

$$\begin{aligned} g_\text {sin}(t \,|\,{\mathbf {x}})&= \gamma _1({\mathbf {x}}) \sin \big (\gamma _2({\mathbf {x}}) [ t + \gamma _3({\mathbf {x}})]\big ) + \gamma _4({\mathbf {x}}),\\ g_\text {con}(t \,|\,{\mathbf {x}})&= \gamma _5({\mathbf {x}}), \\ g_\text {acc}(t \,|\,{\mathbf {x}})&= \gamma _6({\mathbf {x}}) \cdot t - 10, \end{aligned}$$

and

$$\begin{aligned} \alpha _i({\mathbf {x}})&= \frac{\exp (\gamma _{i+6}({\mathbf {x}}))}{\sum _{j=1}^3 \exp (\gamma _{j+6}({\mathbf {x}}))}, \quad \text {for } i=1, 2, 3. \end{aligned}$$

Each of the three functions \(g_\text {sin}(t \,|\,{\mathbf {x}})\), \(g_\text {con}(t \,|\,{\mathbf {x}})\), and \(g_\text {acc}(t \,|\,{\mathbf {x}})\) are constructed to give a specific contribution to the hazards: \(g_\text {con}(t \,|\,{\mathbf {x}})\) gives a constant hazard for a set of covariates, \(g_\text {acc}(t \,|\,{\mathbf {x}})\) allows for a hazard that increases with time, and \(g_\text {sin}(t \,|\,{\mathbf {x}})\) enables periodic patterns in the hazards. With this combination, we are able to represent a variety of event time distributions. Each function \(\gamma _k({\mathbf {x}})\) uses five of the covariates in \({\mathbf {x}}\), totaling a covariate vector \({\mathbf {x}}\) of size 45. The functions \(\gamma _k({\mathbf {x}})\) are defined in Appendix A. We let the discrete time scale consist of 1,000 equidistant points between 0 and 100 so that \(\tau _0=0\), \(\tau _1 = 0.1\), \(\tau _2 = 0.2\), ..., \(\tau _{1000} = 100\). Knowing the hazards, the true survival function can be obtained with (4). In Fig. 3 we show five examples of logit hazard rates and their corresponding survival functions. Note that even though we simulate our data using a discrete-time model, the time-grid is so fine that this mimics simulation from a continuous-time model. The full details of this simulation study are given in Appendix A.

Fig. 3
figure 3

Examples from the simulation study in Sect. 4. The left figure shows examples of 5 simulated survival curves, while the right figure shows the corresponding logit hazards. The examples are selected to illustrate the richness of event time distributions that are expressed by the covariates in the simulated data

4.1 Simulation setup, hyperparameter tuning, and evaluation

We created three training sets of size 3000, 10,000, and 50,000, a validation set of size 10,000 (for hyperparameter tuning), and a test set of size 100,000. For the training and validation sets, we included a censoring distribution with constant hazard resulting in 37% censoring. The full uncensored test set is used for evaluation. For the discretization of the time scale, we applied both the equidistant scheme and the Kaplan–Meier quantiles, each with 5, 25, 100, and 250 grid points.

The three prediction methods under investigation, described in Sects. 2.2 and 3.3, all consist of the same “base” neural network \(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\), but have individual output layers and functions transforming \(\phi ({\mathbf {x}})\) to survival estimates. These final transforms do, however, not include any parameters or hyperparameters. For a fair comparison, we therefore perform the same hyperparameter search of \(\phi ({\mathbf {x}})\) for all models. As an extensive search is unfeasible for such models, we only consider MLP networks with ReLU activation functions, batch normalization, and dropout between each layer. The size of the networks is controlled by the number of layers and the number of nodes in each layer, which we assume to be the same for all layers. Our choices are motivated by an attempt to represent the most standard MLP’s. For more on MLP’s and neural networks in general see, e.g., the book by Goodfellow et al. (2016).

We performed a hyperparameter grid search over 1, 2, 4, and 8 hidden layers; 16, 64, and 256 nodes; and dropout of 0 (no dropout) and 0.5. Each net was trained with a batch size of 256 and the AdamWR optimizer (Loshchilov and Hutter 2019) with cycle length 1, where, at each restart, the cycle length was doubled and the learning rate was multiplied by 0.8. Learning rates were found using the methods proposed by Smith (2017). The methods’ respective loss functions (negative log-likelihoods), computed on the validation set, were used for selecting the best set of hyperparameters. The hyperparameter tuning was repeated 10 times, giving 10 fitted models for each combination of method, grid size, discretization scheme, and training set size. In general, the discretization scheme, both granularity and discretization method, are hyperparameters. In the simulation experiments, however, we are interested in comparing the performance across discretization schemes, and to this end we obtain a model for each scheme. But when we in Sect. 5 compare performance on real data, we need to include the discretization scheme in the hyperparameter tuning.

For evaluating the predictive performance of the methods, we consider two metrics on the uncensored held-out test set. The first metric is the average mean squared error (MSE) between the survival estimates and the true survival function at all 1000 time points \(\tau _1, \ldots , \tau _{1000}\)

$$\begin{aligned} \text {MSE} = \frac{1}{100,000} \sum _{i=1}^{100,000} \frac{1}{1000}\sum _{j=1}^{1,000} {\left( {{\hat{S}}}(\tau _j \,|\,{\mathbf {x}}_i) - S(\tau _j \,|\,{\mathbf {x}}_i) \right) }^2. \end{aligned}$$
(19)

Here \({{\hat{S}}}(\tau _j \,|\,{\mathbf {x}}_i)\) and \(S(\tau _j \,|\,{\mathbf {x}}_i)\) are the estimated and true survival functions, respectively, for individual i (in the test set) at time \(\tau _j\). So, in this regard, the discrete-time survival estimates are represented by step functions, as illustrated in Fig. 2. Note that the MSE (19) is only applicable for simulated data, as the true survival functions \(S(\tau _j \,|\,{\mathbf {x}}_i)\) are not known in real-world applications.

The second metric is the time-dependent concordance (Antolini et al. 2005), which evaluates a method’s ability to correctly rank individuals’ survival estimates according to their event times. This is achieved by estimating the probability of correctly ranking two arbitrary individuals

$$\begin{aligned} \text {P}({\hat{S}}(T_i \,|\,{\mathbf {x}}_i)< {\hat{S}}(T_i \,|\,{\mathbf {x}}_j) \,|\,T_i < T_j, D_i = 1), \end{aligned}$$

where \(T_i\) and \(T_j\) are the known potentially right-censored event times of individuals i and j. In other words, if individual i experiences the event of interest while individual j is still at risk, a good set of predictions should give a lower survival estimate for individual i than individual j at the event time of individual i. Note that, contrary to the MSE, larger values for the concordance are considered better.

4.2 Comparison of discrete-time methods

We start by comparing the two discrete methods from Sect. 2.2, that parameterize the PMF and the discrete-time hazards. We refer to them as PMF and Logistic-Hazard, respectively.

Fig. 4
figure 4

Median MSE and concordance for each grid size in the simulation study in Sect. 4.2. The number above each plot gives the size of the training set. The full lines use an equidistant grid, while the dotted lines use Kaplan–Meier quantiles for discretization. Note that the plots are not on the same scale

In Fig. 4 we plot the median test scores of the two methods versus the grid size used for discretization. The number above each plot gives the size of the training set used to fit the methods. The full lines represent equidistant grids, while the dotted lines are from grids obtained with quantiles from Kaplan–Meier survival estimates. We have also included the constant hazard interpolation (CHI) of the survival estimates from the Logistic-Hazard method (see Sect. 3.2).

For the smallest training set of size 3000 we see that the best performance (smallest MSE and highest concordance) is obtained with a grid of size 25 and that the finer grids of size 100 and 250 result in much worse performance (higher MSE and lower concordance). For the two larger training sets, the finer grids have generally performance on par with or better than that of grid of size 25. This is reasonable as coarser grids require fewer parameters in the neural networks, and the networks with very fine grids are therefore more likely to overfit the data. Nevertheless, the coarsest grid of size 5 seems to only work well for the interpolated estimates, and does very poorly for the discrete estimates. The discretization grids from Kaplan–Meier quantiles seem to give slightly better scores than the equidistant grids for the Logistic-Hazard with the coarsest grids; in particular for the smaller training sets. This difference is, however, quite small. Comparing the discrete survival estimates from Logistic-Hazard (blue lines) with the CHI estimates (orange lines), we see that the two lines overlap for finer grids. This is expected as the effect of interpolation decreases as the grids become finer.

In general, the PMF method does not perform as well as the Logistic-Hazard.

4.3 Comparison of interpolation schemes

Fig. 5
figure 5

MSE and concordance from the simulation study in Sect. 4.3. The scores are plotted from best to worst. The number above each plot gives the size of the training set. Note that the plots are not on the same scale

In the following, we compare the interpolation schemes for the discrete-time hazard method Logistic-Hazard. The experiments are not shown for the PMF method as the results are very similar.

In Sect. 3.2 we presented two methods for interpolation of discrete survival estimates. The first assumes constant density in each interval (denoted CDI for constant density interpolation), while the second assumes constant hazard in each interval (denoted CHI for constant hazard interpolation). In our simulation study, we have four grid sizes and two discretization schemes. As the hyperparameter tuning was repeated 10 times this gives 80 fitted models for each method on each data set. In Fig. 5, we plot the scores of these 80 models sorted from best to worst, as this both tells us the best performance, in addition to the stability of the methods. The figure contains results from the discrete survival estimates (Logistic-Hazard), the constant density interpolation (CDI), and the constant hazard interpolation (CHI).

Clearly, there is almost no difference in performance between the two interpolation schemes, while the discrete estimates have slightly worse best-case performance and much worse worst-case performance. So the interpolation primarily helps with stability in performance across discretization schemes, but similar performance can be obtained for discrete predictions given careful hyperparameter tuning.

As the two interpolation schemes perform the same, we will in the further simulations only include the CHI estimates as they and the continuous-time PC-Hazard method both assume constant hazard rate, simplifying the comparison between the methods.

4.4 Comparison with PC-Hazard

Fig. 6
figure 6

Median MSE and concordance for each grid size of the simulation study in Sect. 4.4. The number above each plot gives the size of the training set. The full lines use an equidistant grid, while the dotted lines use Kaplan–Meier quantiles for discretization. Note that the plots are not on the same scale

Finally, we compare the previous methods with our proposed continuous-time hazard method from Sect. 3.3, PC-Hazard. In Fig. 6 we plot the MSE and concordance for the interpolated Logistic-Hazard (CHI) method and the continuous-time PC-Hazard method. Note that the y-axis is compressed compared to the previous figures. First, we notice that PC-Hazard does better for the coarsest grid with only five grid points, while Logistic-Hazard (CHI) typically performs best with 25 grid points. In general, the differences in performance between the two models are very small. We note, however, that the best performing model for the smallest training set is the PC-Hazard with a Kaplan–Meier grid of size 5, both in terms of MSE and concordance. Finally, we again see that the Kaplan–Meier quantiles seem to give slightly better performance than the equidistant discretization when the grids are coarse.

In Fig. 7 in Appendix A, we have included a plot of the same type as Fig. 5 for the Logistic-Hazard (CHI) method, the Logistic-Hazard method, the PMF method, and the PC-Hazard method. The figure again shows that the PMF method performs slightly worse than the other methods, while the PC-Hazard method performs similarly to the Logistic-Hazard (CHI) estimates.

4.5 Summary of simulations

To summarize the results of the simulations, we have shown that the size of the discretization grid (number of \(\tau _j\)’s) has a large impact on the performance of the methods, and therefore needs to be carefully tuned. Finer grids enable the methods to reduce bias in the predictions but require more parameters in the neural networks (higher variance). By defining the discretization grid with Kaplan–Meier quantiles, the performance for the coarser grids typically improves, while it has no apparent effect for finer grids.

Interpolation of the discrete-time survival estimates alleviates some of the sensitivity to the discretization scheme. For the coarser grids, interpolation was generally found to improve performance, while it does affect the performance for finer grids. The performance of the two proposed interpolation schemes, CHI and CDI, was more or less indistinguishable.

Comparing the three methods, we found that PMF did not perform as well as the Logistic-Hazard, both in terms of best-case performance and stability to discretization-grid configurations. PC-Hazard was found to be competitive with the interpolated Logistic-Hazard method and even performed better for the smallest training set. But the differences between all methods were small, and the size of the training sets and the grid size were shown to have a much larger impact on the performance than the choice of method.

5 Experiments with real data

We now compare the methods discussed in this paper to other methods in the literature, in particular DeepHit (Lee et al. 2018), DeepSurv (Katzman et al. 2018), Cox-Time (Kvamme et al. 2019), CoxCC (Kvamme et al. 2019), Random Survival Forests (RSF, Ishwaran et al. 2008), and a regular Cox regression.

We conduct the comparison on five common real-world data sets: the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT), the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC), the Rotterdam tumor bank and German Breast Cancer Study Group (Rot. & GBSG), the Assay Of Serum Free Light Chain (FLCHAIN), and the National Wilm’s Tumor Study (NWTCO). Katzman et al. (2018) made the first three data sets available in their python package DeepSurv, and we have made no further preprocessing of the data. FLCHAIN and NWTCO were made available in the survival package of R (Therneau 2015), but we use the same version of FLCHAIN as Kvamme et al. (2019). No alterations were made to the NWTCO data set. The size, the number of covariates, and the proportion of censored individuals in each data set are given in Table 1.

Table 1 Data sets for comparing survival methods

5.1 Hyperparameter tuning and evaluation

The experiments were conducted by five-fold cross-validation. For each split, a separate hyperparameter search was conducted, meaning that for each method we may end up with separate hyperparameters for each split.

For hyperparameter tuning, we set aside 20% of the data in each split as a validation set (corresponding to 16% of the full data set). For each method considered, we used the tuning criteria proposed in the original papers for choosing the best hyperparameters.

For the methods presented in this paper, we cannot use the validation loss for the hyperparameter search, as done in the simulations in Sect. 4.1. This is because we now need to include the discretization scheme as part of the hyperparameter search, and the losses (negative log-likelihoods) are dependent on the granularity of the discretization scheme (dependent on the number of output nodes). Instead, we will use the integrated Brier score (IBS) by Graf et al. (1999) computed over 100 equidistant points between the minimum and maximum observed times in the validation set. The IBS considers both discrimination and calibration of the survival estimates, and accounts for censored individuals by weighting the score by the inverse of the estimated censoring distribution.

For evaluation, we will consider the IBS in addition to the time-dependent concordance (Antolini et al. 2005). In contrast to the IBS, the concordance only evaluate the discriminative capabilities of a method’s predictions. It is interesting to study both metrics as there might be a trade-off between well-calibrated estimates and good discriminative performance. For these real data sets, we do not know the true survival function and we can therefore not use the MSE (19) reported in the simulation studies.

The experiments were conducted using the same hyperparameter search and training strategy as presented in Section 6.1 of the paper by Kvamme et al. (2019), but we decrease the learning rate by 0.8 at the start of each cycle, as this was found to give more stable training. For the methods requiring discretization of the time scale, the hyperparameter search considered grid sizes of 5, 25, 50, 100, and 200, both with equidistant spacing and by using the Kaplan–Meier quantiles. The best parameter configuration for each method in each of the five cross-validation splits was fitted 10 times, and we calculated the median concordance and integrated Brier score (IBS) of the 10 repetitions and averaged these over the five folds.

5.2 Results

The results are presented in Tables 2 and 3. In terms of concordance, we see that DeepHit and PC-Hazard perform very well. The three Logistic-Hazard methods and Cox-Time all perform close to PC-Hazard, while the PMF, RSF and the other Cox methods perform slightly worse. The concordances of the two proposed interpolation schemes, CHI and CDI, are very similar, but the CDI method tends to give slightly better scores. There does, however, not seem to be much performance gain in interpolation for the concordance.

Table 2 Concordance from 5-fold cross-validation on real-world data sets
Table 3 Integrated Brier score from 5-fold cross-validation on real-world data sets

Examining the IBS in Table 3 (smaller is better) we again find that PC-Hazard performs very well. But now, DeepHit does quite poorly. This is expected as DeepHit is designed for discrimination rather than well-calibrated estimates (see Kvamme et al. 2019). In general, the PMF, the RSF, and the three proportional Cox methods seem to have slightly higher IBS than the Hazard methods, but again the differences are quite small. Cox-Time performs quite well on all data sets except for FLCHAIN and NWTCO. Comparing the interpolation schemes of Logistic-Hazard, it seems that CDI still performs slightly better than CHI, although both are quite close to the discrete estimates of Logistic-Hazard.

In summary, all three methods discussed in this paper are competitive with existing survival methodology. However, the interpolated Logistic-Hazard and the PC-Hazard seem to give the most stable high performance considering both discrimination and calibration.

6 Discussion

In this paper, we have explored survival methodology built on neural networks for discrete-time data, and how it can be applied for continuous-time prediction. We have compared two existing discrete-time survival methods that minimize the negative log-likelihood of right-censored event times, where the first method (Lee et al. 2018) parameterize the event time probability mass function (PMF), while the second method (Gensheimer and Narasimhan 2019) parameterize the discrete hazard rate (Logistic-Hazard). Through empirical studies of simulated and real data sets, we found that the Logistic-Hazard method typically performs better than the PMF parametrization, both in terms of discrimination and calibration of the survival predictions.

We proposed two interpolation schemes for the discrete methods. Both schemes were found to improve predictions for methods with a coarse discretization of the time scale. In particular, as coarser discretization reduces the number of network parameters, the interpolation schemes gave the largest improvements when applied to smaller data sets. The two interpolation schemes were found to perform very similarly.

We also proposed a new continuous-time method that assumes constant hazard in predefined time intervals (PC-Hazard). The method was found to perform very well compared to existing methods, both in terms of discrimination and calibration. Furthermore, in a simulation study, we found that the method continued to perform better for coarser discretization grids than the interpolated Logistic-Hazard method. This was particularly beneficial for the smallest training set in the simulation study.

All three methods investigated in this paper need some form of discretization or coarsening of the time scale. In that regard, we proposed a simple scheme that uses the quantiles of the event time distribution estimated by Kaplan–Meier, and showed through simulations that the quantile-based grids typically outperformed equidistant grids for coarser grids.

In summary, we found that all three methods perform quite similarly, and the choice of discretization has a larger impact on performance than the choice of method. For the discrete methods, interpolation of the survival predictions will typically make the performance less sensitive to the discretization scheme.

Some interesting further development of our work would be to extend the models to allow for left-truncated event times and competing risks. While these are common topics in survival analysis, the literature on neural network extensions is quite limited.

Finally, we have only considered time-to-event prediction given the information that is available at the outset of a study. One could also be interested in predicting the remaining time until an event given observation of time-dependent covariates and that the event has not yet happened by a time \(t_0\). The landmarking approach of van Houwelingen and Putter (2011) offers one possible framework for such dynamic predictions.