Median Analysis of Repeated Measures Associated with Recurrent Events in Presence of Terminal Event

Rajeshwari Sundaram; Ling Ma; Subhashis Ghoshal

doi:10.1515/ijb-2016-0057

Publicly Available Published by De Gruyter April 28, 2017

Median Analysis of Repeated Measures Associated with Recurrent Events in Presence of Terminal Event

Rajeshwari Sundaram , Ling Ma and Subhashis Ghoshal

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2016-0057

Abstract

Recurrent events are often encountered in medical follow up studies. In addition, such recurrences have other quantities associated with them that are of considerable interest, for instance medical costs of the repeated hospitalizations and tumor size in cancer recurrences. These processes can be viewed as point processes, i.e. processes with arbitrary positive jump at each recurrence. An analysis of the mean function for such point processes have been proposed in the literature. However, such point processes are often skewed, leading to median as a more appropriate measure than the mean. Furthermore, the analysis of recurrent event data is often complicated by the presence of death. We propose a semiparametric model for assessing the effect of covariates on the quantiles of the point processes. We investigate both the finite sample as well as the large sample properties of the proposed estimators. We conclude with a real data analysis of the medical cost associated with the treatment of ovarian cancer.

Keywords: recurrent events; quantile regression; informative censoring; estimating equations; survival analysis

1 Introduction

Recurrent event data frequently arise in medical and epidemiological studies, where each subject may experience a number of “failures” over the course of follow up. Examples of recurrent failure events include tumor recurrences, recurrent infections, and repeated hospitalizations. In most of these examples, an additional process denoting the size of the tumor or cost associated with treating infection or hospitalization, is also of interest. These processes are no more simple recurrent events with jumps of 0 or 1 but are point processes with arbitrary positive jumps. An additional challenge in such studies is that the follow-up time is also informatively censored by terminal events, like death. Statistical methods for analyzing recurrent events has received considerable attention starting with the works of Andersen and Gill [1], Prentice et al. [2], Pepe and Cai [3], Lin et al. [4] and Wang et al. [5]. Approaches to analyze recurrent events data in presence of terminal events based on correlated marginal modeling approach to model these two processes was proposed by Ghosh and Lin [6] where they used a partially specified correlation structure for the dependence of the two processes. Huang and Wang [7], Sun et al. [8] and Huang et al. [9] proposed a joint modeling approach where the dependence between the recurrent times and failure times was captured through a frailty term with unknown distribution function. Other approaches have been proposed by Lancaster and Intrator [10], Liu et al. [11], Ye et al. [12] and Rondeau et al. [13], where the dependence condition between the recurrent event process and survival terms was relaxed, but the frailty term was assumed to be parametric. However, very limited work has been done in dealing with point processes in presence of terminal event. Strawderman [14] proposed nonparametric estimate for the mean of a point process. Lin et al. [15] proposed a semiparametric transformation model for the point processes based on transformation models and Sundaram [16] proposed modeling in presence of terminal events.

Motivated by medical costs data to assess the cost over survival time, methods for assessing mean cost using a linear regression was proposed by Lin [17] as well as proportional means model by Lin [18]. Modeling of cost-accumulation process has also been studied by many in the literature using a joint modeling approach accounting for informative observation times, for example, Liu et al. [19] and references therein.

Medical costs are usually skewed to the right and may have excess of near-zero costs. Consequently, quantiles may be more appropriate for assessing costs. Motivated by these issues, Huang [20] proposed a calibration regression model for censored lifetime medical costs. Quantile regression for censored medical cost has been studied by Bang and Tsiatis [21]. Recently, quantile regression for the gap times associated with recurrent events has been studied by Luo et al. [22]. However, assessing cost associated with recurrent hospitalizations have not been studied using quantile regression. We propose a joint modeling approach for assessing the quantiles of the cost process associated with a recurrent event process with a frailty that captures the dependence between the recurrent event process and the survival times. We further model the cost associated with each recurrent event using a gamma distribution.

The outline of the paper is as follows. In Section 2, we present our notation and the proposed modeling approach for complete data as well as for data in presence of a terminal event. We also discuss the large sample properties in Section 3. The finite sample properties are investigated through simulations in Section 4 and an analysis of the cost associated with ovarian cancer patients from the SEER-Medicare medical costs is presented in Section 5. In this paper, we considered the incident part of the database.

2 Notation, semiparametric model

2.1 Notation

Let {N(t):0≤t≤τ,} denote the observed counting process for an individual up to time τ, which denotes the end of the study. At this stage, we assume it to be fixed. The event times, i.e., the times of occurrence of the recurrent events are denoted by 0<t1<t2<⋯tm<τ. We assume that the recurrent event process N(⋅) follows a non-homogeneous Poisson process with intensity/rate function λ0(t)eU′γ, where U is a vector of covariates and γ the corresponding vector of regression coefficients and λ0(⋅) denotes the baseline intensity function. The cumulative rate function Λ(t) is given by

Λ(t)=Λ0(t)eU′γ=∫0tλ0(s)eU′γds.

A medical cost ωj is associated to each event time tj. We will assume for now that ωj are i.i.d. Gamma(α,β) and moreover, ωj are independent of N(⋅). Recall that the gamma density for the cost ωj is given by

f(ω)=βαΓ(α)e−βωωα−1, where ω>0 and Γ(α)=∫xα−1e−xdx.

Thus, the total cost S(t)=∑j=1N(t)ωj given N(t)=k follows Gamma(αk,β).

2.2 Modeling in presence of independent right censored data

Note that S(t)=0 if and only if N(t)=0. Consequently, for x≥0, we have

P(S(t)≤x|U)=P(N(t)=0|U)+∑k=1∞e−Λ0(t)eU′γ(Λ0(t)eU′γ)kk!βαkΓ(αk)∫0xe−βyyαk−1dy=e−Λ0(t)eU′γ+∑k=1∞e−Λ0(t)eU′γ(Λ0(t)eU′γ)kk!1Γ(αk)∫0βxe−yyαk−1dy.

Using the convention that for k=0,

1Γ(αk)∫0xe−βyyαk−1dy=1 for any x≥0,

one can write

P(S(t)≤xβ|U)=∑k=0∞e−Λ0(t)eU′γ(Λ0(t)eU′γ)kk!1Γ(αk)∫0xe−yyαk−1dy.

This implies that β acts like a scale parameter in the distribution of S(t).

We are interested in inference concerning the median of S(t), i.e., the quantity

Q(t;Λ0(⋅),γ,α,β|U)={0 if e−Λ0(t)eU′γ≥12solution ofeq.(1) below if e−Λ0(t)eU′γ<12,

where the equation to be solved (for the variable x) is given by

(1)∑k=0∞e−Λ0(t)eU′γ(Λ0(t)eU′γ)kk!βαkΓ(αk)∫0xe−βyyαk−1dy=12.

Consequently,

Q(t;Λ0(⋅),γ,α,β|U)=βQˉ(Λ0(t)eU′γ,α),

where

Qˉ(s,α)={0 if s≤ln(2),solution of eq.(2) below if s>ln(2),

where the equation to be solved (for x) is given by

(2)∑k=0∞e−sskk!1Γ(αk)∫0xe−yyαk−1dy=12.

Let

G(x,y,α)=∑k=0∞e−yykk!1Γ(αk)∫0xe−ssαk−1ds−12.

Note that G(x,y,α) is strictly increasing and continuous in x and when y>ln2, for all α, we have

limx→0G(x,y,α)=e−y−12<0 and limx→∞G(x,y,α)=12>0.

Consequently, eq. (2) has unique solution in x when y>ln2.

2.3 Modeling in presence of terminal events

Let us fix notation first. As before, let U denote the covariate vector of interest and Z the underlying frailty capturing the dependence across various processes. Let (U,Z,C,D,N(⋅),S(⋅)) denote various processes of interest, where D denotes death and C denotes the independent censoring variable. We assume that given (U,Z), the processes (C,D,N(⋅)) are mutually independent.

The total observation period is denoted by [0,τ]. Recall that the intensity λ for the recurrent event process is given by

λ(t)=zλ0(t)eU′γ, Λ0(t)=∫0tλ0(s)ds with Λ0(τ)=1.

Moreover, we also assume

E(Z|U)=E(Z)

and the model for the death process is given by

h(t)=zh0(t)eU′η.

We also denote

D˜=min{C,D,τ},Δi=I(Di≤D˜),μZ=E(Z) and Uˉ=(1,U).

We adapt the Huang and Wang [7] approach for estimation for (Λ0,H0,γ,η). We present them below for sake of completeness. They proposed the following estimate Λˆ0(t) for Λ0(t),namely,

Λˆ0(t)=∏s(l)\gt1−d(l)R(l),

where {s(l)} are the ordered and distinct values of event times {tij} and

d(l)=# of events at s(l) and R(l)=∑i,jI{tij≤s(l)≤D˜i}.

Adapting Huang and Wang [7] approach, estimates for (logμZ,γ)=γ˜ are obtained through solving

1n∑i=1nwiUˉi′[mi{Λˆ0(D˜i)}−1−eUˉi′γ˜]=0,

where wi are weights associated with each subject chosen to improve efficiency and mi=Ni(D˜i) is the total number of observed events for individual i. We next present the estimates for the death model as proposed in Huang and Wang [7]. Denote

Zˆi=miΛˆ0(D˜i)eUi′γˆ.

Estimate of η1 is given by the solution of the following estimating equation

1n∑i=1nΔiUi−∑j=1nUjZˆjeUj′ηI(D˜j≥D˜i)∑j=1nZˆjeUj′ηI(D˜j≥D˜i)=0

and

Hˆ0(t)=∑1≤i≤n,D˜i≤tΔi∑k=1nZˆkeUk′η˜I(D˜k≥D˜i).

We will now present our estimate for the median cost in the presence of death,

where only S(t∧C∧D) is observable. Note that incurring medical cost beyond death is not possible. So, to identify the median cost in this setup, the appropriate cost process is S(t∧D). So,

P(S(t∧D)≤x|U)=P(N(t∧D)=0)+∑k=1∞P(N(t∧D))=k|U)FS(x;αk,β),

where FS(x;αk,β)=βαk/Γ(αk)∫0xe−βyyαk−1dy, is the cumulative distribution function of a Gamma(αk,β) distributed random variable. We want the median to be dependent on the covariates and so we take

P(S(t∧D)≤x|U)=∫P(S(t∧D)≤x|U,Z)dFZ(z)=∫P(S(t)≤x,t≤D|U,Z)dFZ(z)+∫P(S(t)≤x,t>D|U,Z)dFZ(z)=I+II.

Note now that due to the assumption of independence of (C,D,N(⋅)) given (U,Z), we have

Similarly,

II=∫∫0tP(S(D)≤x|D=d,U,Z)dFD(d|U,Z)dFZ(z)=∫∫0tP(S(d)≤x|U,Z)dFD(d|U,Z)dFZ(z).

So, the median at time t is given by the solution to the equation

∫P(S(t)≤x|U,Z)SD(t|U,Z)dFZ(z)+∫∫0tP(S(d)≤x|U,Z)dFD(d|U,Z)dFZ(z)=12.

Focusing on P(S(t)≤x|U,Z) first, we note

P(S(t)≤x|U,Z)=P(N(t)=0|U,Z)+∑k=1∞P(N(t)=k|U,Z)FS(x;αk,β,U)=e−ZΛ0(t)eU′γ+∑k=1∞e−ZΛ0(t)eU′γ(ZΛ0(t)eU′γ)kk!FS(x;αk,β,U).

Taking expected value we have the equation

(3)EZ{e−ZΛ0(t)eU′γ+∑k=1∞e−ZΛ0(t)eU′γ(ZΛ0(t)eU′γ)kk!FS(x;αk,β,U)}SD(t|U,Z)+EZ∫0t{e−ZΛ0(d)eU′γ+∑k=1∞e−ZΛ0(d)eU′γ(ZΛ0(d)eU′γ)kk!FS(x;αk,β,U)}dFD(d|U,Z)=12.

Finally, we replace all quantities by their respective estimates to obtain

1n∑i=1n{e−ZˆiΛˆ0(t)eU′γˆ+∑k=1∞e−ZˆiΛˆ0(t)eU′γˆ(ZˆiΛˆ0(t)eU′γˆ)kk!FS(x;αˆk,βˆ,U)}SˆD(t|U,Zˆi)+1n∑i=1n[∑j=1n{e−ZˆiΛˆ0(D˜j)eU′γˆ+]∑k=1∞e−ZˆiΛˆ0(D˜j)eU′γˆ(ZˆiΛˆ0(D˜j)eU′γˆ)kk!FS(x;αˆk,βˆ,U)}I(D˜j≤t)dFˆD(D˜j|U,Zˆi))=12.

Consequently, solving the above equation for x yields the estimate for the median Q(t;Λ0(⋅),H0(⋅),γ,η,α,β|U).

Remark: In practice, direct plugin of Zˆ1,⋯,Zˆn did not yield good numerical results. However, computing the expectation with respect to the distribution of Z in eq. (3) by estimating its density fZ by using Kernel density estimation of Zˆ1,⋯,Zˆn yielded much improved results. We applied the methods by Sheather and Jones [23] for bandwidth selection and Wand and Jones [24] for kernel density estimation.

3 Asymptotic results

We first focus on the independent right censoring case discussed in Section 2.2. Let Λˆ0,αˆ,βˆ,γˆ be the estimates of the corresponding parameters based on observed data. We can then estimates Q(t;Λ0(⋅),α,β,γ|U) by

Qˆ(t)=Q(t;Λˆ0(⋅),αˆ,βˆ,γˆ|U)=βˆQˉ(Λˆ0(t)eU′γ,αˆ).

In order to find the asymptotic distribution of Qˆ, we apply the delta method to the joint asymptotic distribution of (Λˆ,αˆ,βˆ,γˆ). An estimating equation approach for estimating (Λˆ0(⋅),γˆ) was proposed in Wang et al. [5], where they also obtained asymptotic normality. Following the same proof, one can deduce joint asymptotic normality provided one finds the corresponding covariance term. The estimators (αˆ,βˆ) are separately obtained from the cost data and they are also asymptotically normal. Thus, (Λˆ0,αˆ,βˆ,γˆ) are jointly asymptotically normal with the cross-covariance between (Λˆ0(⋅),γˆ) and (αˆ,βˆ) being zero. Thus we need to calculate the partial derivatives ∂Q∂l,∂Q∂γ,∂Q∂α,∂Q∂β where l=Λ0(t) and evaluate at the true values of (Λ0(t),γ,α,β) to obtain asymptotic variance of Qˆ. Since the true values of these parameters are unknown, we will finally substitute the corresponding estimates Λˆ0,γˆ,αˆ,βˆ. A calculation of derivatives yield

∂Q∂l=β∂Qˉ(y,α)∂y|y=leU′γeU′γ,∂Q∂γ=β∂Qˉ(y,α)∂y|y=leU′γleU′γz∂Q∂α=β∂Qˉ(y,α)∂α|y=leU′γ and ∂Q∂β=Qˉ(leU′γ,α).

Thus, we would like to compute ∂Qˉ(y,α)∂y,∂Qˉ(y,α)∂α. As Qˉ(y,α) is defined implicitly as the solution of eq. (2) (or equivalently, G(x,y,α)=0), we apply the implicit function theorem to calculate the derivatives. This yields

∂Q‾∂y=∂G∂y|(Q‾(y,α),y,α)∂G∂x|(Q‾(y,α),y,α) and ∂Q‾∂α=∂G∂α|(Q‾(y,α),y,α)∂G∂x|(Q‾(y,α),y,α).

Now we can easily compute ∂G∂y as

∂G∂y=∑k=0∞e−yykk!1Γ(α(k+1))I(x;α(k+1))−1Γ(αk)I(x;αk),

where I(x;r)=∫0xe−uur−1du is the incomplete gamma integral and I(x;r)Γ(r)=1 if r=0. Likewise,

∂G∂α=∑k=1∞e−yykk!1Γ(αk)k∫0xe−uuαk−1lnudu−Γ′(αk)Γ(αk)I(x;αk)

and

∂G∂x=∑k=1∞e−yykk!1Γ(αk)e−xxαk−1.

Plugging in, and recalling l=Λ0(t) and denoting yˆ=Λˆ0(t)eU′γˆ, we obtain

4Jˆ1Jˆ2Jˆ3Jˆ4:=∂Q∂l∂Q∂γ∂Q∂α∂Q∂β(Λˆ0(t),γˆ,αˆ,βˆ)=βˆeU′γˆ∑k=0∞yˆkk!I(Qˉ(yˆ,αˆ);αˆ(k+1))Γ(αˆ(k+1)−I(Qˉ(yˆ,αˆ);αˆk)Γ(αˆk∑k=1∞yˆkk!e−Qˉ(yˆ,αˆ)Γ(αˆk)[Qˉ(yˆ,αˆ)]αˆk−1zΛˆ0(t)βˆeU′γˆ∑k=0∞yˆkk!I(Qˉ(yˆ,αˆ);αˆ(k+1))Γ(αˆ(k+1)−I(Qˉ(yˆ,αˆ);αˆk)Γ(αˆk∑k=1∞yˆkk!e−Qˉ(yˆ,αˆ)Γ(αˆk)[Qˉ(yˆ,αˆ)]αˆk−1βˆ∑k=1∞yˆkk!αˆΓ′(αˆk)I(Q(yˆ,αˆ))Γ2(αˆk)−∫0Qˉ(yˆ,αˆ)e−uuαˆk−1lnuduΓ(αˆk)∑k=1∞yˆkk!1Γ(αˆk)e−Qˉ(yˆ,αˆ)[Qˉ(yˆ,αˆ)]αˆk−1Qˉ(yˆ,αˆ)

Let as in Wang et al. [5],

b(t)=∑j=1N(t)∫tT01(tj≤u≤Y)dQ(u)R2(u)−1{t<tj≤T0}R(tj)

and

e=∫wxˉ′mb(y)dV(w,xˉ,m,y)F(y)+wxˉ′[mF−1(y)−exˉ′γ.

Let Σ=D(e) be the dispersion matrix and let Ψ=E−∂e∂γ. Then,

n(γˆ−γ)=1n∑i=1nΨ−1ei+op(1).

Consequently, the asymptotic dispersion matrix for Λˆ0(t)γˆ is given by

E(d2(t))E(d(t)e(t))′(Ψ′)−1Ψ−1E(d(t)e(t))Ψ−1Σ(Ψ′)−1, where d(t)=F(t)[c+Λ0(T0)b(t).

If N i.i.d. samples from Gamma(α,β) distribution are R1,⋯,RN and (αˆ,βˆ) is the maximum likelihood estimate of (α,β), then (αˆ,βˆ) has asymptotic dispersion matrix

1N(Γ′(α))2−Γ′′(α)Γ(α)Γ2(α)1β1βαβ2.

In our case, for n individual, we have N=∑i=1nNi(T0) and the cost data is

ω11,⋯,ω1m1;⋯,ωn1,⋯,ωnmn.

Let Nn→ξ in probability and let ξˆ=Nn. The variance-covariance matrix can then be approximated by

1nnN(Γ′(αˆ))2−Γ′′(αˆ)Γ(αˆ)Γ2(αˆ)1βˆ1βˆαˆβˆ.

Thus, the asymptotic dispersion matrix of (Λˆ0(t),γˆ,αˆ,βˆ) can be approximated by

1nEˆ(d2(t))Eˆ[d(t)e(t)′(Ψ′)−100Eˆ[d(t)Ψ−1e(t)Ψˆ−1ΣˆΨˆ−10000n{(Γ′(αˆ))2−Γ′′(αˆ)Γ(αˆ)}NΓ2(αˆ)nNβˆ00nNβˆnαˆNβˆ2,

where Eˆ(⋅),Ψˆ,Σˆ are bootstrap estimates as defined in Wang et al. [5]. Thus, Qˆ(t) is asymptotically normal with mean Q(t) and its asymptotic variance can be approximated by

1nJˆ12Eˆ(d2(t))+2Jˆ1Eˆ[d(t)e(t)′(Ψ′)−1]Jˆ2+Jˆ2′Ψˆ−1ΣˆΨˆ−1+1NJˆ32Γ′(αˆ)2−Γ′′(αˆ)Γ(αˆ)Γ2(αˆ)+2Jˆ3Jˆ4βˆ+Jˆ42αˆβˆ2.

The above can be used to test hypothesis and construct confidence interval about Q(⋅). All this holds provided Λ0(t)eU′γ>ln2, or equivalently, median is the unique solution of eq. (2). This will hold provided |z| is bounded and t is appropriately large, which we will assume.

Remark 1: The asymptotic distribution in presence of terminal events follows similar steps to the independent censoring case. Using the asymptotic properties of the underlying parameters (Λˆ0,αˆ,βˆ,γˆ,ηˆ,H0ˆ) established in Huang and Wang [7] and the delta method, we can use similar steps as before and establish the asymptotic results under reasonable regularity conditions.

Remark 2: In practice, the covariance structure established through asymptotic theory has been shown to perform poorly due to the complexity of the structure, which requires various numerical approximations reducing the efficiency, see Wang et al. [5] and Huang and Wang [7]. Consequently, we used bootstrapping to estimate the variance in our inference procedure.

4 Simulation studies

We conducted simulations to investigate the finite sample properties of the proposed estimates for the median cost process. We generated data from a recurrent event process with the following intensity process

λ(t)=zλ0(t)exp(U′γ) withλ0(t)=14t−(1/2),

with the covariate U∼N(0,1) and z∼Gamma(α=2,β=0.2) such that the mean is 10. We generated the survival times for the terminal event from a proportional hazards model with baseline hazards as constant and a covariate effect 0.5 and an administrative censoring of τ=4. The estimation was investigated for 30% and 50% censoring of the survival time. Further, we generated the individual costs ω∼Gamma(α,β) with α=25,β=0.5. We generated samples of size n=250 and n=500. For each of the samples, we used m=200 number of bootstraps with 1000 replicates. We are not presenting the estimates for the recurrent events and the survival times model as they have been previously investigated in Huang and Wang [7]. Also, the estimates of the parameters for the cost model is obtained by maximization of the gamma likelihood based on complete data. Hence, their finite sample performance is also well-studied in the literature. We focus on the performance of the estimate for the median cost. In Table 1, we present the quantile estimates for the two sample sizes in absence of death, ie. only censoring by τ=4.

Table 1

Estimated median cost at 5-, 10-, 25-, 50-, 75-, 90- and 95-th percentiles of event times in absence of terminal event.

		n=250				n=500
t	Q(t)	Bias	SSD	BSD	CP	Bias	SSD	BSD	CP
0.01	0	0	0	0.30	1	0	0	0.01	1
0.04	45.84	−0.06	1.54	1.60	0.971	−0.09	1.09	1.11	0.954
0.25	112.83	0.50	5.16	5.15	0.943	0.23	3.69	3.66	0.949
1.00	241.35	−0.19	7.23	7.05	0.945	0.04	4.95	5.00	0.953
2.25	365.89	−0.16	8.93	8.97	0.948	0.06	6.22	6.37	0.958
3.24	440.98	−0.28	10.01	10.01	0.954	0.04	6.91	7.09	0.959
3.61	465.95	−0.18	10.29	10.34	0.952	0.08	7.14	7.32	0.954

We do find the biases to be small for all sample sizes investigated. The coverage probability based on Wald type confidence intervals appears closer to the nominal level with increasing sample size. In Table 2, we present the results for simulation in presence of terminal events. We compared the performance based on two differing censoring percentages of death, 30% and 50%, ie, 70% were observed dead during the follow-up and 50% were observed dead for the two censoring setups. Again, the biases for estimating the median costs are small irrespective of sample size, with coverage probability getting closer to the nominal level as the sample size increases. Furthermore, the coverage probabilities for the confidence intervals appear reasonable for both the 30% and 50% censoring.

Table 2

Estimated median cost at 5-, 10-, 25-, 50-, 75-, 90- and 95-th percentiles of event times with 30% and 50% censoring of death.

30%censoring of death
		n=250				n=500
t	Q(t)	Bias	SSD	BSD	CP	Bias	SSD	BSD	CP
0.002	0	0	0	0	1	0	0	0	1
0.010	0	0	0	0.00	1	0	0	0	1
0.064	47.17	−0.06	2.08	2.18	0.983	−0.32	1.46	1.46	0.957
0.300	99.19	−0.22	5.36	5.85	0.997	−0.98	3.66	3.76	0.979
0.930	156.35	−0.91	9.69	9.84	0.969	−2.37	6.71	6.63	0.942
1.960	195.90	−3.12	12.34	12.05	0.938	−4.66	8.69	8.48	0.929
2.680	208.19	−2.88	12.69	12.49	0.950	−4.70	8.80	8.65	0.925
50%censoring of death
		n=250				n=500
t	Q(t)	Bias	SSD	BSD	CP	Bias	SSD	BSD	CP
0.005	0	0	0	0	1	0	0	0	1
0.019	0	0.43	3.75	4.78	0.996	0.03	1.00	1.48	1
0.120	57.50	1.10	4.43	4.78	0.960	0.24	2.57	2.88	0.959
0.533	138.69	0.58	9.13	9.35	0.952	−0.58	6.46	6.69	0.957
1.478	214.01	1.83	13.33	13.24	0.949	−0.44	9.06	9.34	0.955
2.637	264.11	0.89	16.01	15.79	0.949	−1.60	10.89	11.16	0.955
3.228	281.48	0.28	17.08	16.64	0.943	−2.29	11.65	11.83	0.942

We also conducted simulation studies to assess the robustness of the proposed method on the specification of parametric model for cost. Specifically, we conducted analysis using Gamma distribution while the underlying true model is a mixture of Gamma and Half-Normal distributions, with proportion of half-normal equalling 0.1. The estimation results under this mis-specification with informative terminal event are given in Table 3. The results appear reasonable with coverage probabilties close to the nominal level and no significant bias.

In running our simulations, we also investigated the computational time for calculating the proposed estimate. They appear reasonable, with the running time of one-sample estimation of Q(t) in presence of terminal event with 30% censoring and n=250 takes around 2 min in Windows 10, Intel i7-6700 CPU @3.40GHz, 64.0 GB RAM using R x64 3.3.1.

Table 3

Estimated median cost at 5-, 10-, 25-, 50-, 75-, 90- and 95-th percentiles of event times with 30% and 50% censoring of death when parametric model of cost is mis-specified.

30%censoring of death
		n=250				n=500
t	Q(t)	Bias	SSD	BSD	CP	Bias	SSD	BSD	CP
0.002	0	0	0	0	1	0	0	0	1
0.010	0	0	0	0.00	1	0	0	0	1
0.064	47.50	−0.45	2.11	2.21	0.980	−0.68	1.48	1.47	0.926
0.300	98.94	0.05	5.40	5.88	0.995	−0.74	3.68	3.79	0.978
0.930	153.51	2.07	9.60	9.86	0.966	0.45	6.69	6.65	0.963
1.960	191.43	1.42	12.25	12.05	0.951	−0.18	8.66	8.50	0.940
2.680	204.15	1.20	12.61	12.50	0.956	−0.71	8.76	8.68	0.954
50%censoring of death
		n=250				n=500
t	Q(t)	Bias	SSD	BSD	CP	Bias	SSD	BSD	CP
0.005	0	0	0	0	1	0	0	0	1
0.019	0	0.38	3.37	4.67	0.997	0	0	1.51	1
0.120	58.09	0.51	4.24	4.79	0.951	−0.30	2.55	2.88	0.941
0.533	138.34	0.84	9.31	9.28	0.944	−0.28	6.61	6.69	0.949
1.478	211.98	3.73	13.36	13.20	0.945	1.65	9.17	9.35	0.954
2.637	260.94	4.03	16.15	15.77	0.937	1.61	10.97	11.17	0.954
3.228	278.24	3.42	17.16	16.62	0.940	0.97	11.77	11.84	0.953

5 Application to the SEER-Medicare data

The Surveillance, Epidemiology and End-Results (SEER) -Medicare linked data are population-based data for studying cancer epidemiology and quality of cancer-related health services. The SEER-Medicare linked data consist of a linkage of two large population-based databases, SEER and Medicare. We considered data of cancer incidence diagnosed between 1991 and 2010. The Medicare data contain information on medical costs between 1986 and 2004. The linked data consist of cancer patients in the SEER data who were enrolled in Medicare during the study period of the Medicare data. Details of each data and linkage are discussed in Warren et al. [25]. Although the linkage criterion sounds simple, it creates a left truncated and right censored sample because the two data sets have different starting times. In the SEER-Medicare linked data, patients diagnosed with cancer before 1991 form a prevalent cohort, because only those patients who survived through 1991 were included. Patients diagnosed with cancer after January 1, 1991 form an incident cohort, because those patients were recruited at the onset of disease. Patients survived through 2010 were considered censored. Our sample consisted of n=33,417 patients with an average observation window of 1,774 days. These patients had in total 87,533 hospitalizations with an average number of 2.6 hospitalizations per patient.

We present our analysis of repeated hospitalization and accumulated cost over these hospitalizations. We focused on the association between the stage at diagnosis, as determined by AJCC and age at diagnosis on the repeated hospitalization, as well as the accumulated cost over these repeated hospitalizations. We corrected the time from diagnosis for the length of stay in the hospital for each hospitalization as the patient was not at risk for hospitalization during that time. The cost was modeled using Gamma distribution such that the mean of the distribution was modeled as

logμ=ξ1Age+∑j=14ξ(j+1)I{Stage=j}+ξ6LOS,

where LOS refers to the length of stay in a hospital, measured in days. The repeated hospitalization was modeled using non-homogenous Poisson process with the following covariates U=(Age,I{Stage=2},I{Stage=3},I{Stage=4}) and the time to death being modeled as the proportional hazards model with covariate U. We present our results for the cost model, recurrent hospitalizations model as well as the time to death model analysis in Tables 4 and 5, respectively. We used 1,000 bootstrap samples in our analysis.

Table 4

Analysis of the SEER-Medicare data: cost associated with ovarian cancer.

Par	Est	SD	95% C.I.
Shape: α	1.186	0.009	(1.169, 1.204)
Age: ξ1	−0.008	0.001	(−0.009, −0.007)
Stage 1: ξ2	10.286	0.046	(10.195, 10.377)
Stage 2: ξ3	10.315	0.049	(10.220, 10.410)
Stage 3: ξ4	10.367	0.047	(10.275, 10.458)
Stage 4: ξ5	10.247	0.048	(10.153, 10.340)
LOS: ξ6	1.377	0.024	(1.330, 1.425)

Table 5

Estimation results for medical costs.

(a) Event process
Par	Est	SD	95% C.I.
Age: γ1	0.032	0.001	(0.030, 0.034)
Stage 2: γ2	0.401	0.046	(0.310, 0.491)
Stage 3: γ3	0.901	0.034	(0.835, 0.968)
Stage 4: γ4	1.172	0.034	(1.105, 1.239)
(b) Death process
Par	Est	SD	95% C.I.
Age: η1	0.054	0.001	(0.052, 0.057)
Stage 2: η2	0.745	0.060	(0.628, 0.861)
Stage 3: η3	1.546	0.044	(1.461, 1.632)
Stage 4: η4	2.083	0.045	(1.994, 2.172)

The estimated effects of the covariates in Table 4 indicates that the stage of diagnosis significantly increased the mean cost of hospitalization. Furthermore, the length of stay in hospital has significant positive association with the cost of hospitalization as expected. We also found that age had a small negative impact on cost of hospitalization after adjusting for the stage and length of stay.

In our analysis of the repeated hospitalization, we find each of the covariates age, as well as Stage at diagnosis to have significantly positively association with intensity of repeat hospitalization indicating that older women were at higher risk for more number of hospitalization, as well as that Stages 2 through 4 had significantly more number of hospitalizations than Stage 1. Our model for the time to death indicates that the higher stages were associated with increased risk for death, so also was age of diagnosis significantly associated with increased risk of death.

In Figure 1, we present the histogram of the cost associated with each stage of illness. As can be seen, the cost data has highly right skewed, thus indicating a need for looking at more robust measures of centrality than the mean. We made comparisons of the estimated cumulative median cost with the approach of Lin [17], which models the cumulative mean cost under proportional means model. We applied their method assuming each recurrent event as a terminal event for the subject.

Figure 1

Histogram of costs by stage: (a): Stage 1; (b) Stage 2; (c) Stage 3; (d): Stage 4.

Figure 2

Comparison of the mean cost process [17] and the median cost process (ours) by stage.

Next, in Figure 2 we compared the Lin estimates for the mean cumulative cost with the median cumulative cost proposed here for a woman diagnosed at average age of 70 years. We also provide the survival distribution for each stage of cancer. Observe that the mean cumulative cost estimate for each stage is much higher than the cumulative median cost estimate, thus confirming the observation that the costs are sensitive to the high values observed in Figure 1. We further observe that the cumulative median cost for the Stages 3 and 4 are much higher than Stages 1 and 2 earlier on. We observe that the cost for Stage 3 crosses over that for Stage 4 at early time, whereas the cumulative median cost for Stages 1 and 2 crosses over that for Stages 3 and 4 at a later time point. This may be a consequence of the difference in survival times seen in Figure 2(c) with the probability of survival longer is much higher among individuals diagnosed with Stages 1 and 2 as compared to individuals diagnosed with Stages 3 and 4. The estimates of Lin [17] does not capture this cross over, as an underlying assumption of their method is the proportionality across means with respect to the covariates. Finally, we present in Table 6 the median cumulative cost with 95% confidence interval for each stage evaluated at various percentiles of observed times of hospitalizations. The estimates do indicate overall increasing costs by stage. They further indicate a significantly higher cost than Stage 1 for Stage 4 from 1.4 years and that of Stages 2 and 3 from approximately 4 years.

Table 6

Estimated median cost at 5-, 10-, 25-, 50-, 75-, 90- and 95-th percentiles of event times for patients diagnosed at age 70 at stages 1–4.

t	Qˆ(t)	BSD	Wald C.I.
Diagnosed at stage 1
0.01	0	0	(0, 0)
0.03	0	0	(0, 0)
0.20	0	0	(0, 0)
1.38	0	0	(0, 0)
3.98	5,467.89	1,140.37	(3,232.76, 7,703.21)
7.71	21,783.21	1,435.39	(18,969.85, 24,596.58)
10.44	29,883.30	1,651.10	(26,647.15, 33,119.45)
Diagnosed at stage 2
0.01	0	0	(0, 0)
0.03	0	0	(0, 0)
0.20	0	0	(0, 0)
1.38	0	0	(0, 0)
3.98	11,185.42	1,305.14	(8,627.36, 13,743.48)
7.71	24,189.58	1,612.19	(21,029.69, 27,349.48)
10.44	29,577.06	1,781.50	(26,085.32, 33,068.81)
Diagnosed at stage 3
0.01	0	0	(0, 0)
0.03	0	0	(0, 0)
0.20	0	0	(0, 0)
1.38	2,216.994	1,131.396	(−0.543, 4,434.053)
3.98	17,631.72	1,264.64	(15,153.02, 20,110.42)
7.71	26,798.57	1,461.89	(23,933.27, 29,663.86)
10.44	30,063.22	1,555.36	(27,014.71, 33,111.73)
Diagnosed at stage 4
0.01	0	0	(0, 0)
0.03	0	0	(0, 0)
0.20	0	0	(0, 0)
1.38	4,933.15	973.54	(3,025.02, 6,841.29)
3.98	15,565.83	1,116.94	(13,376.63, 17,755.03)
7.71	21,001.93	1,239.37	(18,572.76, 23,431.10)
10.44	22,898.22	1,303.43	(20,343.50, 25,452.95)

6 Discussion

In this article, we proposed a regression analysis approach to recurring medical cost with incomplete follow-up data. This conceptually simple regression model is semiparametric in the sense that the underlying intensity model for the recurrent times are completely unspecified. We further focused on more robust quantile process estimation rather than the usual mean process. An advantage is that one can easily use the proposed median process estimation to estimate the higher or lower order quantiles (example, 5th percentile, 95th percentile etc). This allows us to assess the effects of covariates on the higher or lower quantile values (example, high expenders or low expenders) rather than on the mean process where the covariate effects may be averaged over. Additionally, we have proposed an approach that tracks expenses with random recurrent times rather than at fixed ad hoc scheduled time.

Our proposed method is computationally easy to implement and exhibited good finite sample properties. In our real data analysis, we focused on the incident part of the data collected by SEER-Medicare program. However, it would be interesting to also include the prevalent part of the data, which leads to the terminal time being left truncated and right censored, causing the cost process to be length biased and informatively censored. Finally, we have used a parametric model to estimate the costs. We did find gamma distribution to be a reasonable choice but it would be interesting to fit a semi-continuous distribution for the costs. In this paper, we have focused on the cost associated with hospitalizations. However, these methods can be easily adapted to handle other processes like tumor sizes associated with recurrent events.

Our proposed method is based on the Huang and Wang [7] approach, which assumes that given the covariates and the frailty, the recurrent event process and the survival term are independent. However, one can also adapt the approach of Ye et al. [12] where this conditional independence is relaxed but assumes that the frailty term is parametrically distributed. This is something we are interested in investigating, along with incorporation of correlated frailty term which allows for different dependence structure between the recurrent event process and the survival term as well as between the cost and the recurrent event process.

Finally, to fully capture the dependencies between cost of hospitalization and the recurrent hospitalization, a nonparametric approach to jointly modeling the repeated hospitalization and the cost associated with it is interesting and worth pursuing in the future.

Funding statement: This work was supported by the Intramural Research Program of the U.S., National Institutes of Health, Eunice Kennedy Shriver National Institute of Child Health and Human Development.

Acknowledgements:

The author acknowledges that this study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Maryland (http://biowulf.nih.gov).

1. Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Stat 1982;10:1100–20.10.1214/aos/1176345976Search in Google Scholar

2. Prentice RL, Williams BJ, Peterson AV. On the regression analysis of multivariate failure time data. Biometrika 1981;68:373–9.10.1093/biomet/68.2.373Search in Google Scholar

3. Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. J Am Stat Assoc 1993;88:811–20.10.1080/01621459.1993.10476346Search in Google Scholar

4. Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. J R Stat Soc Ser B Stat Methodol 2000;62:711–30.10.1111/1467-9868.00259Search in Google Scholar

5. Wang MC, Qin J, Chiang CT. Analyzing recurrent event data with informative censoring. J Am Stat Assoc 2001;96:1057–65.10.1198/016214501753209031Search in Google Scholar PubMed PubMed Central

6. Ghosh D, Lin DY. Semiparametric analysis of recurrent events data in the presence of dependent censoring. Biometrics 2003;59:877–85.10.1111/j.0006-341X.2003.00102.xSearch in Google Scholar

7. Huang CY, Wang MC. Joint modeling and estimation for recurrent event processes and failure time data. J Am Stat Assoc 2004;99:1153–65.10.1198/016214504000001033Search in Google Scholar PubMed PubMed Central

8. Sun J, Tong X, He X. Regression analysis of panel count data with dependent observation times. Biometrics 2007;63:1053–9.10.1111/j.1541-0420.2007.00808.xSearch in Google Scholar PubMed

9. Huang CY, Qin J, Wang MC. Semiparametric analysis for recurrent event data with time-dependent covariates and informative censoring. Biometrics 2010;66:39–49.10.1111/j.1541-0420.2009.01266.xSearch in Google Scholar PubMed PubMed Central

10. Lancaster T, Intrator O. Panel data with survival: hospitalization of hiv-positive patients. J Am Stat Assoc 1998;93:46–53.10.1080/01621459.1998.10474086Search in Google Scholar

11. Liu L, Wolfe RA, Huang X. Shared frailty models for recurrent events and a terminal event. Biometrics 2004;60:747–56.10.1111/j.0006-341X.2004.00225.xSearch in Google Scholar PubMed

12. Ye Y, Kalbfleisch JD, Schaubel DE. Semiparametric analysis of correlated recurrent and terminal events. Biometrics 2007;63:78–87.10.1111/j.1541-0420.2006.00677.xSearch in Google Scholar PubMed

13. Rondeau V, Mathoulin-Pelissier S, Jacqmin-Gadda H, Brouste V, Soubeyran P. Joint frailty models for recurring events and death using maximum penalized likelihood estimation: application on cancer events. Biostatistics 2007;8:708–21.10.1093/biostatistics/kxl043Search in Google Scholar PubMed

14. Strawderman RL. Estimating the mean of an increasing stochastic process at a censored stopping time. J Am Stat Assoc 2000;95:1192–208.10.1080/01621459.2000.10474320Search in Google Scholar

15. Lin DY, Wei LJ, Ying Z. Semiparametric transformation models for point processes. J Am Stat Assoc 2001;96:620–8.10.1198/016214501753168299Search in Google Scholar

16. Sundaram R. Robust estimation for analyzing recurrent-event data in the presence of terminal events. In: Datta, Segal, Fine, Biswas, editors. Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics. New Jersey: John Wiley and Sons, 2007;245–64.Search in Google Scholar

17. Lin DY. Linear regression analysis of censored medical costs. Biostatistics 2000;1:35–47.10.1093/biostatistics/1.1.35Search in Google Scholar PubMed

18. Lin DY. Proportional means regression for censored medical costs. Biometrics 2000;56:775–8.10.1111/j.0006-341X.2000.00775.xSearch in Google Scholar

19. Liu L, Huang X, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics 2008;64:950–8.10.1111/j.1541-0420.2007.00954.xSearch in Google Scholar PubMed

20. Huang Y. Calibration regression of censored lifetime medical cost. J Am Stat Assoc 2002;97:318–27.10.1198/016214502753479446Search in Google Scholar

21. Bang H, Tsiatis AA. Median regression with censored cost data. Biometrics 2002;58:643–9.10.1111/j.0006-341X.2002.00643.xSearch in Google Scholar

22. Luo X, Huang CY, Wang L. Quantile regression for recurrent gap time data. Biometrics 2013;69:375–85.10.1111/biom.12010Search in Google Scholar PubMed PubMed Central

23. Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. J R Stat Soc Ser B (Methodol) 1991;53:683–90.10.1111/j.2517-6161.1991.tb01857.xSearch in Google Scholar

24. Wand MP, Jones MC. Kernel smoothing. London: Chapman and Hall/CRC, 1994.10.1201/b14876Search in Google Scholar

25. Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the seer-medicare data: content, research applications, and generalizability to the united states elderly population. Med Care 2002;40:IV-3–18.10.1097/00005650-200208001-00002Search in Google Scholar

Published Online: 2017-4-28

Median Analysis of Repeated Measures Associated with Recurrent Events in Presence of Terminal Event

Abstract

1 Introduction

2 Notation, semiparametric model

2.1 Notation

2.2 Modeling in presence of independent right censored data

2.3 Modeling in presence of terminal events

3 Asymptotic results

4 Simulation studies

5 Application to the SEER-Medicare data

6 Discussion

Acknowledgements:

Journal and Issue

Articles in the same Issue