Abstract
Recurrent events are often encountered in medical follow up studies. In addition, such recurrences have other quantities associated with them that are of considerable interest, for instance medical costs of the repeated hospitalizations and tumor size in cancer recurrences. These processes can be viewed as point processes, i.e. processes with arbitrary positive jump at each recurrence. An analysis of the mean function for such point processes have been proposed in the literature. However, such point processes are often skewed, leading to median as a more appropriate measure than the mean. Furthermore, the analysis of recurrent event data is often complicated by the presence of death. We propose a semiparametric model for assessing the effect of covariates on the quantiles of the point processes. We investigate both the finite sample as well as the large sample properties of the proposed estimators. We conclude with a real data analysis of the medical cost associated with the treatment of ovarian cancer.
1 Introduction
Recurrent event data frequently arise in medical and epidemiological studies, where each subject may experience a number of “failures” over the course of follow up. Examples of recurrent failure events include tumor recurrences, recurrent infections, and repeated hospitalizations. In most of these examples, an additional process denoting the size of the tumor or cost associated with treating infection or hospitalization, is also of interest. These processes are no more simple recurrent events with jumps of
Motivated by medical costs data to assess the cost over survival time, methods for assessing mean cost using a linear regression was proposed by Lin [17] as well as proportional means model by Lin [18]. Modeling of cost-accumulation process has also been studied by many in the literature using a joint modeling approach accounting for informative observation times, for example, Liu et al. [19] and references therein.
Medical costs are usually skewed to the right and may have excess of near-zero costs. Consequently, quantiles may be more appropriate for assessing costs. Motivated by these issues, Huang [20] proposed a calibration regression model for censored lifetime medical costs. Quantile regression for censored medical cost has been studied by Bang and Tsiatis [21]. Recently, quantile regression for the gap times associated with recurrent events has been studied by Luo et al. [22]. However, assessing cost associated with recurrent hospitalizations have not been studied using quantile regression. We propose a joint modeling approach for assessing the quantiles of the cost process associated with a recurrent event process with a frailty that captures the dependence between the recurrent event process and the survival times. We further model the cost associated with each recurrent event using a gamma distribution.
The outline of the paper is as follows. In Section 2, we present our notation and the proposed modeling approach for complete data as well as for data in presence of a terminal event. We also discuss the large sample properties in Section 3. The finite sample properties are investigated through simulations in Section 4 and an analysis of the cost associated with ovarian cancer patients from the SEER-Medicare medical costs is presented in Section 5. In this paper, we considered the incident part of the database.
2 Notation, semiparametric model
2.1 Notation
Let
A medical cost
Thus, the total cost
2.2 Modeling in presence of independent right censored data
Note that
Using the convention that for
one can write
This implies that
We are interested in inference concerning the median of
where the equation to be solved (for the variable
Consequently,
where
where the equation to be solved (for
Let
Note that
Consequently, eq. (2) has unique solution in
2.3 Modeling in presence of terminal events
Let us fix notation first. As before, let
The total observation period is denoted by
Moreover, we also assume
and the model for the death process is given by
We also denote
We adapt the Huang and Wang [7] approach for estimation for
where
Adapting Huang and Wang [7] approach, estimates for
where
Estimate of
and
We will now present our estimate for the median cost in the presence of death,
where only
where
Note now that due to the assumption of independence of
Similarly,
So, the median at time
Focusing on
Taking expected value we have the equation
Finally, we replace all quantities by their respective estimates to obtain
Consequently, solving the above equation for
Remark: In practice, direct plugin of
3 Asymptotic results
We first focus on the independent right censoring case discussed in Section 2.2. Let
In order to find the asymptotic distribution of
Thus, we would like to compute
Now we can easily compute
where
and
Plugging in, and recalling
Let as in Wang et al. [5],
and
Let
Consequently, the asymptotic dispersion matrix for
If
In our case, for
Let
Thus, the asymptotic dispersion matrix of
where
The above can be used to test hypothesis and construct confidence interval about
Remark 1: The asymptotic distribution in presence of terminal events follows similar steps to the independent censoring case. Using the asymptotic properties of the underlying parameters
Remark 2: In practice, the covariance structure established through asymptotic theory has been shown to perform poorly due to the complexity of the structure, which requires various numerical approximations reducing the efficiency, see Wang et al. [5] and Huang and Wang [7]. Consequently, we used bootstrapping to estimate the variance in our inference procedure.
4 Simulation studies
We conducted simulations to investigate the finite sample properties of the proposed estimates for the median cost process. We generated data from a recurrent event process with the following intensity process
with the covariate
t | Bias | SSD | BSD | CP | Bias | SSD | BSD | CP | |
0.01 | 0 | 0 | 0 | 0.30 | 1 | 0 | 0 | 0.01 | 1 |
0.04 | 45.84 | 1.54 | 1.60 | 0.971 | 1.09 | 1.11 | 0.954 | ||
0.25 | 112.83 | 0.50 | 5.16 | 5.15 | 0.943 | 0.23 | 3.69 | 3.66 | 0.949 |
1.00 | 241.35 | 7.23 | 7.05 | 0.945 | 0.04 | 4.95 | 5.00 | 0.953 | |
2.25 | 365.89 | 8.93 | 8.97 | 0.948 | 0.06 | 6.22 | 6.37 | 0.958 | |
3.24 | 440.98 | 10.01 | 10.01 | 0.954 | 0.04 | 6.91 | 7.09 | 0.959 | |
3.61 | 465.95 | 10.29 | 10.34 | 0.952 | 0.08 | 7.14 | 7.32 | 0.954 |
We do find the biases to be small for all sample sizes investigated. The coverage probability based on Wald type confidence intervals appears closer to the nominal level with increasing sample size. In Table 2, we present the results for simulation in presence of terminal events. We compared the performance based on two differing censoring percentages of death,
t | Bias | SSD | BSD | CP | Bias | SSD | BSD | CP | |
0.002 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
0.010 | 0 | 0 | 0 | 0.00 | 1 | 0 | 0 | 0 | 1 |
0.064 | 47.17 | 2.08 | 2.18 | 0.983 | 1.46 | 1.46 | 0.957 | ||
0.300 | 99.19 | 5.36 | 5.85 | 0.997 | 3.66 | 3.76 | 0.979 | ||
0.930 | 156.35 | 9.69 | 9.84 | 0.969 | 6.71 | 6.63 | 0.942 | ||
1.960 | 195.90 | 12.34 | 12.05 | 0.938 | 8.69 | 8.48 | 0.929 | ||
2.680 | 208.19 | 12.69 | 12.49 | 0.950 | 8.80 | 8.65 | 0.925 | ||
t | Bias | SSD | BSD | CP | Bias | SSD | BSD | CP | |
0.005 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
0.019 | 0 | 0.43 | 3.75 | 4.78 | 0.996 | 0.03 | 1.00 | 1.48 | 1 |
0.120 | 57.50 | 1.10 | 4.43 | 4.78 | 0.960 | 0.24 | 2.57 | 2.88 | 0.959 |
0.533 | 138.69 | 0.58 | 9.13 | 9.35 | 0.952 | 6.46 | 6.69 | 0.957 | |
1.478 | 214.01 | 1.83 | 13.33 | 13.24 | 0.949 | 9.06 | 9.34 | 0.955 | |
2.637 | 264.11 | 0.89 | 16.01 | 15.79 | 0.949 | 10.89 | 11.16 | 0.955 | |
3.228 | 281.48 | 0.28 | 17.08 | 16.64 | 0.943 | 11.65 | 11.83 | 0.942 |
We also conducted simulation studies to assess the robustness of the proposed method on the specification of parametric model for cost. Specifically, we conducted analysis using Gamma distribution while the underlying true model is a mixture of Gamma and Half-Normal distributions, with proportion of half-normal equalling
In running our simulations, we also investigated the computational time for calculating the proposed estimate. They appear reasonable, with the running time of one-sample estimation of Q(t) in presence of terminal event with 30% censoring and n=250 takes around 2 min in Windows 10, Intel i7-6700 CPU @3.40GHz, 64.0 GB RAM using R x64 3.3.1.
t | Bias | SSD | BSD | CP | Bias | SSD | BSD | CP | |
0.002 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
0.010 | 0 | 0 | 0 | 0.00 | 1 | 0 | 0 | 0 | 1 |
0.064 | 47.50 | 2.11 | 2.21 | 0.980 | 1.48 | 1.47 | 0.926 | ||
0.300 | 98.94 | 0.05 | 5.40 | 5.88 | 0.995 | 3.68 | 3.79 | 0.978 | |
0.930 | 153.51 | 2.07 | 9.60 | 9.86 | 0.966 | 0.45 | 6.69 | 6.65 | 0.963 |
1.960 | 191.43 | 1.42 | 12.25 | 12.05 | 0.951 | 8.66 | 8.50 | 0.940 | |
2.680 | 204.15 | 1.20 | 12.61 | 12.50 | 0.956 | 8.76 | 8.68 | 0.954 | |
t | Bias | SSD | BSD | CP | Bias | SSD | BSD | CP | |
0.005 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
0.019 | 0 | 0.38 | 3.37 | 4.67 | 0.997 | 0 | 0 | 1.51 | 1 |
0.120 | 58.09 | 0.51 | 4.24 | 4.79 | 0.951 | 2.55 | 2.88 | 0.941 | |
0.533 | 138.34 | 0.84 | 9.31 | 9.28 | 0.944 | 6.61 | 6.69 | 0.949 | |
1.478 | 211.98 | 3.73 | 13.36 | 13.20 | 0.945 | 1.65 | 9.17 | 9.35 | 0.954 |
2.637 | 260.94 | 4.03 | 16.15 | 15.77 | 0.937 | 1.61 | 10.97 | 11.17 | 0.954 |
3.228 | 278.24 | 3.42 | 17.16 | 16.62 | 0.940 | 0.97 | 11.77 | 11.84 | 0.953 |
5 Application to the SEER-Medicare data
The Surveillance, Epidemiology and End-Results (SEER) -Medicare linked data are population-based data for studying cancer epidemiology and quality of cancer-related health services. The SEER-Medicare linked data consist of a linkage of two large population-based databases, SEER and Medicare. We considered data of cancer incidence diagnosed between 1991 and 2010. The Medicare data contain information on medical costs between 1986 and 2004. The linked data consist of cancer patients in the SEER data who were enrolled in Medicare during the study period of the Medicare data. Details of each data and linkage are discussed in Warren et al. [25]. Although the linkage criterion sounds simple, it creates a left truncated and right censored sample because the two data sets have different starting times. In the SEER-Medicare linked data, patients diagnosed with cancer before 1991 form a prevalent cohort, because only those patients who survived through 1991 were included. Patients diagnosed with cancer after January 1, 1991 form an incident cohort, because those patients were recruited at the onset of disease. Patients survived through 2010 were considered censored. Our sample consisted of
We present our analysis of repeated hospitalization and accumulated cost over these hospitalizations. We focused on the association between the stage at diagnosis, as determined by AJCC and age at diagnosis on the repeated hospitalization, as well as the accumulated cost over these repeated hospitalizations. We corrected the time from diagnosis for the length of stay in the hospital for each hospitalization as the patient was not at risk for hospitalization during that time. The cost was modeled using Gamma distribution such that the mean of the distribution was modeled as
where LOS refers to the length of stay in a hospital, measured in days. The repeated hospitalization was modeled using non-homogenous Poisson process with the following covariates
Par | Est | SD | 95% C.I. |
Shape: | 1.186 | 0.009 | (1.169, 1.204) |
Age: | 0.001 | ( | |
Stage 1: | 10.286 | 0.046 | (10.195, 10.377) |
Stage 2: | 10.315 | 0.049 | (10.220, 10.410) |
Stage 3: | 10.367 | 0.047 | (10.275, 10.458) |
Stage 4: | 10.247 | 0.048 | (10.153, 10.340) |
LOS: | 1.377 | 0.024 | (1.330, 1.425) |
(a) Event process | |||
Par | Est | SD | 95% C.I. |
Age: | 0.032 | 0.001 | (0.030, 0.034) |
Stage 2: | 0.401 | 0.046 | (0.310, 0.491) |
Stage 3: | 0.901 | 0.034 | (0.835, 0.968) |
Stage 4: | 1.172 | 0.034 | (1.105, 1.239) |
(b) Death process | |||
Par | Est | SD | 95% C.I. |
Age: | 0.054 | 0.001 | (0.052, 0.057) |
Stage 2: | 0.745 | 0.060 | (0.628, 0.861) |
Stage 3: | 1.546 | 0.044 | (1.461, 1.632) |
Stage 4: | 2.083 | 0.045 | (1.994, 2.172) |
The estimated effects of the covariates in Table 4 indicates that the stage of diagnosis significantly increased the mean cost of hospitalization. Furthermore, the length of stay in hospital has significant positive association with the cost of hospitalization as expected. We also found that age had a small negative impact on cost of hospitalization after adjusting for the stage and length of stay.
In our analysis of the repeated hospitalization, we find each of the covariates age, as well as Stage at diagnosis to have significantly positively association with intensity of repeat hospitalization indicating that older women were at higher risk for more number of hospitalization, as well as that Stages 2 through 4 had significantly more number of hospitalizations than Stage 1. Our model for the time to death indicates that the higher stages were associated with increased risk for death, so also was age of diagnosis significantly associated with increased risk of death.
In Figure 1, we present the histogram of the cost associated with each stage of illness. As can be seen, the cost data has highly right skewed, thus indicating a need for looking at more robust measures of centrality than the mean. We made comparisons of the estimated cumulative median cost with the approach of Lin [17], which models the cumulative mean cost under proportional means model. We applied their method assuming each recurrent event as a terminal event for the subject.
Next, in Figure 2 we compared the Lin estimates for the mean cumulative cost with the median cumulative cost proposed here for a woman diagnosed at average age of 70 years. We also provide the survival distribution for each stage of cancer. Observe that the mean cumulative cost estimate for each stage is much higher than the cumulative median cost estimate, thus confirming the observation that the costs are sensitive to the high values observed in Figure 1. We further observe that the cumulative median cost for the Stages 3 and 4 are much higher than Stages 1 and 2 earlier on. We observe that the cost for Stage 3 crosses over that for Stage 4 at early time, whereas the cumulative median cost for Stages 1 and 2 crosses over that for Stages 3 and 4 at a later time point. This may be a consequence of the difference in survival times seen in Figure 2(c) with the probability of survival longer is much higher among individuals diagnosed with Stages 1 and 2 as compared to individuals diagnosed with Stages 3 and 4. The estimates of Lin [17] does not capture this cross over, as an underlying assumption of their method is the proportionality across means with respect to the covariates. Finally, we present in Table 6 the median cumulative cost with
t | BSD | Wald C.I. | |
Diagnosed at stage 1 | |||
0.01 | 0 | 0 | (0, 0) |
0.03 | 0 | 0 | (0, 0) |
0.20 | 0 | 0 | (0, 0) |
1.38 | 0 | 0 | (0, 0) |
3.98 | 5,467.89 | 1,140.37 | (3,232.76, 7,703.21) |
7.71 | 21,783.21 | 1,435.39 | (18,969.85, 24,596.58) |
10.44 | 29,883.30 | 1,651.10 | (26,647.15, 33,119.45) |
Diagnosed at stage 2 | |||
0.01 | 0 | 0 | (0, 0) |
0.03 | 0 | 0 | (0, 0) |
0.20 | 0 | 0 | (0, 0) |
1.38 | 0 | 0 | (0, 0) |
3.98 | 11,185.42 | 1,305.14 | (8,627.36, 13,743.48) |
7.71 | 24,189.58 | 1,612.19 | (21,029.69, 27,349.48) |
10.44 | 29,577.06 | 1,781.50 | (26,085.32, 33,068.81) |
Diagnosed at stage 3 | |||
0.01 | 0 | 0 | (0, 0) |
0.03 | 0 | 0 | (0, 0) |
0.20 | 0 | 0 | (0, 0) |
1.38 | 2,216.994 | 1,131.396 | ( |
3.98 | 17,631.72 | 1,264.64 | (15,153.02, 20,110.42) |
7.71 | 26,798.57 | 1,461.89 | (23,933.27, 29,663.86) |
10.44 | 30,063.22 | 1,555.36 | (27,014.71, 33,111.73) |
Diagnosed at stage 4 | |||
0.01 | 0 | 0 | (0, 0) |
0.03 | 0 | 0 | (0, 0) |
0.20 | 0 | 0 | (0, 0) |
1.38 | 4,933.15 | 973.54 | (3,025.02, 6,841.29) |
3.98 | 15,565.83 | 1,116.94 | (13,376.63, 17,755.03) |
7.71 | 21,001.93 | 1,239.37 | (18,572.76, 23,431.10) |
10.44 | 22,898.22 | 1,303.43 | (20,343.50, 25,452.95) |
6 Discussion
In this article, we proposed a regression analysis approach to recurring medical cost with incomplete follow-up data. This conceptually simple regression model is semiparametric in the sense that the underlying intensity model for the recurrent times are completely unspecified. We further focused on more robust quantile process estimation rather than the usual mean process. An advantage is that one can easily use the proposed median process estimation to estimate the higher or lower order quantiles (example, 5th percentile, 95th percentile etc). This allows us to assess the effects of covariates on the higher or lower quantile values (example, high expenders or low expenders) rather than on the mean process where the covariate effects may be averaged over. Additionally, we have proposed an approach that tracks expenses with random recurrent times rather than at fixed ad hoc scheduled time.
Our proposed method is computationally easy to implement and exhibited good finite sample properties. In our real data analysis, we focused on the incident part of the data collected by SEER-Medicare program. However, it would be interesting to also include the prevalent part of the data, which leads to the terminal time being left truncated and right censored, causing the cost process to be length biased and informatively censored. Finally, we have used a parametric model to estimate the costs. We did find gamma distribution to be a reasonable choice but it would be interesting to fit a semi-continuous distribution for the costs. In this paper, we have focused on the cost associated with hospitalizations. However, these methods can be easily adapted to handle other processes like tumor sizes associated with recurrent events.
Our proposed method is based on the Huang and Wang [7] approach, which assumes that given the covariates and the frailty, the recurrent event process and the survival term are independent. However, one can also adapt the approach of Ye et al. [12] where this conditional independence is relaxed but assumes that the frailty term is parametrically distributed. This is something we are interested in investigating, along with incorporation of correlated frailty term which allows for different dependence structure between the recurrent event process and the survival term as well as between the cost and the recurrent event process.
Finally, to fully capture the dependencies between cost of hospitalization and the recurrent hospitalization, a nonparametric approach to jointly modeling the repeated hospitalization and the cost associated with it is interesting and worth pursuing in the future.
Funding statement: This work was supported by the Intramural Research Program of the U.S., National Institutes of Health, Eunice Kennedy Shriver National Institute of Child Health and Human Development.
Acknowledgements:
The author acknowledges that this study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Maryland (http://biowulf.nih.gov).
1. Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Stat 1982;10:1100–20.10.1214/aos/1176345976Search in Google Scholar
2. Prentice RL, Williams BJ, Peterson AV. On the regression analysis of multivariate failure time data. Biometrika 1981;68:373–9.10.1093/biomet/68.2.373Search in Google Scholar
3. Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. J Am Stat Assoc 1993;88:811–20.10.1080/01621459.1993.10476346Search in Google Scholar
4. Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. J R Stat Soc Ser B Stat Methodol 2000;62:711–30.10.1111/1467-9868.00259Search in Google Scholar
5. Wang MC, Qin J, Chiang CT. Analyzing recurrent event data with informative censoring. J Am Stat Assoc 2001;96:1057–65.10.1198/016214501753209031Search in Google Scholar PubMed PubMed Central
6. Ghosh D, Lin DY. Semiparametric analysis of recurrent events data in the presence of dependent censoring. Biometrics 2003;59:877–85.10.1111/j.0006-341X.2003.00102.xSearch in Google Scholar
7. Huang CY, Wang MC. Joint modeling and estimation for recurrent event processes and failure time data. J Am Stat Assoc 2004;99:1153–65.10.1198/016214504000001033Search in Google Scholar PubMed PubMed Central
8. Sun J, Tong X, He X. Regression analysis of panel count data with dependent observation times. Biometrics 2007;63:1053–9.10.1111/j.1541-0420.2007.00808.xSearch in Google Scholar PubMed
9. Huang CY, Qin J, Wang MC. Semiparametric analysis for recurrent event data with time-dependent covariates and informative censoring. Biometrics 2010;66:39–49.10.1111/j.1541-0420.2009.01266.xSearch in Google Scholar PubMed PubMed Central
10. Lancaster T, Intrator O. Panel data with survival: hospitalization of hiv-positive patients. J Am Stat Assoc 1998;93:46–53.10.1080/01621459.1998.10474086Search in Google Scholar
11. Liu L, Wolfe RA, Huang X. Shared frailty models for recurrent events and a terminal event. Biometrics 2004;60:747–56.10.1111/j.0006-341X.2004.00225.xSearch in Google Scholar PubMed
12. Ye Y, Kalbfleisch JD, Schaubel DE. Semiparametric analysis of correlated recurrent and terminal events. Biometrics 2007;63:78–87.10.1111/j.1541-0420.2006.00677.xSearch in Google Scholar PubMed
13. Rondeau V, Mathoulin-Pelissier S, Jacqmin-Gadda H, Brouste V, Soubeyran P. Joint frailty models for recurring events and death using maximum penalized likelihood estimation: application on cancer events. Biostatistics 2007;8:708–21.10.1093/biostatistics/kxl043Search in Google Scholar PubMed
14. Strawderman RL. Estimating the mean of an increasing stochastic process at a censored stopping time. J Am Stat Assoc 2000;95:1192–208.10.1080/01621459.2000.10474320Search in Google Scholar
15. Lin DY, Wei LJ, Ying Z. Semiparametric transformation models for point processes. J Am Stat Assoc 2001;96:620–8.10.1198/016214501753168299Search in Google Scholar
16. Sundaram R. Robust estimation for analyzing recurrent-event data in the presence of terminal events. In: Datta, Segal, Fine, Biswas, editors. Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics. New Jersey: John Wiley and Sons, 2007;245–64.Search in Google Scholar
17. Lin DY. Linear regression analysis of censored medical costs. Biostatistics 2000;1:35–47.10.1093/biostatistics/1.1.35Search in Google Scholar PubMed
18. Lin DY. Proportional means regression for censored medical costs. Biometrics 2000;56:775–8.10.1111/j.0006-341X.2000.00775.xSearch in Google Scholar
19. Liu L, Huang X, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics 2008;64:950–8.10.1111/j.1541-0420.2007.00954.xSearch in Google Scholar PubMed
20. Huang Y. Calibration regression of censored lifetime medical cost. J Am Stat Assoc 2002;97:318–27.10.1198/016214502753479446Search in Google Scholar
21. Bang H, Tsiatis AA. Median regression with censored cost data. Biometrics 2002;58:643–9.10.1111/j.0006-341X.2002.00643.xSearch in Google Scholar
22. Luo X, Huang CY, Wang L. Quantile regression for recurrent gap time data. Biometrics 2013;69:375–85.10.1111/biom.12010Search in Google Scholar PubMed PubMed Central
23. Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. J R Stat Soc Ser B (Methodol) 1991;53:683–90.10.1111/j.2517-6161.1991.tb01857.xSearch in Google Scholar
24. Wand MP, Jones MC. Kernel smoothing. London: Chapman and Hall/CRC, 1994.10.1201/b14876Search in Google Scholar
25. Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the seer-medicare data: content, research applications, and generalizability to the united states elderly population. Med Care 2002;40:IV-3–18.10.1097/00005650-200208001-00002Search in Google Scholar
© 2017 Walter de Gruyter GmbH, Berlin/Boston