Abstract
Length-biased data are frequently encountered in prevalent cohort studies. Many statistical methods have been developed to estimate the covariate effects on the survival outcomes arising from such data while properly adjusting for length-biased sampling. Among them, regression methods based on the proportional hazards model have been widely adopted. However, little work has focused on checking the proportional hazards model assumptions with length-biased data, which is essential to ensure the validity of inference. In this article, we propose a statistical tool for testing the assumed functional form of covariates and the proportional hazards assumption graphically and analytically under the setting of length-biased sampling, through a general class of multiparameter stochastic processes. The finite sample performance is examined through simulation studies, and the proposed methods are illustrated with the data from a cohort study of dementia in Canada.
Similar content being viewed by others
References
Asgharian M, M’Lan CE, Wolfson DB (2002) Length-biased sampling with right censoring: an unconditional approach. J Am Stat Assoc 97:201–209
Asgharian M, Wolfson DB, Zhang X (2006) Checking stationarity of the incidence rate using prevalent cohort survival data. Stat Med 25:1751–1767
Borgan O, Zhang Y (2015) Using cumulative sums of martingale residuals for model checking in nested case-control studies. Biometrics 71:696–703
Chan KCG, Chen YQ, Di CZ (2012) Proportional mean residual life model for right-censored length-biased data. Biometrika 99:995–1000
Cox DR (1972) Regression models and life-tables (with discussion). J R Stat Soc Ser B (Methodol 34:187–220
de Una-Alvarez J, Otero-Giraldez MS, Alvarez-Llorente G (2003) Estimation under length-bias and right-censoring: an application to unemployment duration analysis for married women. J Appl Stat 30:283–291
Huang CY, Qin J (2012) Composite partial likelihood estimation under length-biased sampling, with application to a prevalent cohort study of dementia. J Am Stat Assoc 107:946–957
Huang CY, Luo X, Follmann DA (2011) A model checking method for the proportional hazards model with recurrent gap time data. Biostatistics 12:535–547
Kosorok MR (2008) Introduction to empirical processes and semiparametric inference. Springer, New York
Lancaster T (1979) Econometric methods for the duration of unemployment. Econometrica 47:939–956
Lin DY, Wei LJ, Ying ZL (1993) Checking the cox model with cumulative sums of martingale-based residuals. Biometrika 80:557–572
Lu W, Liu M, Chen YH (2014) Testing goodness-of-fit for the proportional hazards model based on nested case-control data. Biometrics 70:845–851
Pepe MS, Fleming TR (1991) Weighted kaplan-meier statistics: large sample and optimality considerations. J R Stat Soc DB 53:341–352
Qin J, Shen Y (2010) Statistical methods for analyzing right-censored length-biased data under cox model. Biometrics 66:382–392
Qin J, Ning J, Liu H, Shen Y (2011) Maximum likelihood estimations and em algorithms with length-biased data. J Am Stat Assoc 106:1434–1449
Shen Y, Ning J, Qin J (2009) Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc 104:1192–1202
Spiekerman CF, Lin DY (1996) Checking the marginal cox model for correlated failure time data. Biometrika 83:143–156
Tsai WY (2009) Pseudo-partial likelihood for proportional hazards models with biased-sampling data. Biometrika 96:601–615
Wang MC (1996) Hazards regression analysis for length-biased data. Biometrika 83:343–354
Wang HJ, Wang L (2014) Quantile regression analysis of length-biased survival data. Stat 3:31–47
Wolfson C, Wolfson DB, Asgharian M, M’Lan CE, Ostbye T, Rockwood K, Hogan DB (2001) A reevaluation of the duration of survival after the onset of dementia. N Engl J Med 344:1111–1116
Zelen M, Feinleib M (1969) On the theory of screening for chronic diseases. Biometrika 56:601–614
Acknowledgements
This work was partially supported by the U.S. National Institutes of Health, Grants CA193878 and CA016672. The authors thank Professor Asgharian and the investigators from the Canadian Study of Health and Aging for generously sharing the dementia data. The data reported in this article were collected as part of the Canadian Study of Health and Aging. The core study was funded by the Seniors Independence Research Program, through the National Health Research and Development Program (NHRDP) of Health Canada Project 6606-3954-MC(S). Additional funding was provided by Pfizer Canada, Incorporated, through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada. The authors also acknowledge the Texas Advanced Computing Center at The University of Texas at Austin for providing HPC resources that contributed to the research results reported within this paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix A: regularity conditions
We assume the following regularity conditions for the large sample properties:
-
(1)
\((Y_i,A_i,\delta _i,\varvec{Z}_i)\) are independent and identically distributed for \(i=1,\ldots ,n\).
-
(2)
The parameters \(\varvec{\beta }_0\) belong to an interior of a known compact set.
-
(3)
The covariates \(\varvec{Z}\) are bounded, and \(\pmb {\alpha }=0\) almost surely if \(\pmb {\alpha }^\top \varvec{Z}=0\) with probability one.
-
(4)
The differentiable baseline cumulative hazard function \(\varLambda _0(\tau )<\infty \) where \(\tau \) satisfies \(\Pr (Y\ge \tau )>0\).
-
(5)
\(\varGamma (\varvec{\beta })\) is positive definite.
-
(6)
\(0<w_C(\tau )<\infty \) and \(\int _0^\tau \left[ \left\{ \int _t^\tau S_C(u)\mathrm {d}u\right\} ^2/\left\{ S_C^2(t)S_V(t)\right\} \right] \mathrm {d}S_C(t)<\infty \) where \(S_V(\cdot )\) is the survival function of the residual survival time.
Appendix B: Proof of Theorem 1
By applying the Taylor series expansions, we have
and
The stochastic processes \(\varvec{G}(t,\varvec{z})\) can be expressed in two terms, \(\varvec{G}_1(t,\varvec{z})\) and \(\varvec{G}_2(t,\varvec{z})\), as follows.
We exploit the Taylor expansion and empirical process approximation techniques. It is straightforward that the first term can be approximated by
Then, we re-express the second term based on equations (5) and (6).
where
and
The second equation can be derived by plugging in Eq. (5). The third equation naturally follows by replacing \(\sqrt{n}(\widehat{\varvec{\beta }}-\varvec{\beta }_0)\) with Eq. (6) after some algebra (Qin and Shen 2010). Note that the leading term in the last equation can be rewritten as
where
in which \(\varLambda _C(t)\) is the cumulative hazard function of the residual censoring time and \(S_V(t)\) is the survival function of the residual survival time. The last equation can be obtained by expressing \(\{\widehat{w}_C(y)-w_C(y)\}\) as an i.i.d. sum of martingales (Pepe and Fleming 1991). Finally, the general class of stochastic processes \(G(t,\varvec{z})\) can be asymptotically represented by
Under the regularity conditions, for any given \(\varvec{z}\), \(\varvec{G}_i^*(t,\varvec{z})\) is a mean zero process bounded on \([0,\tau ]\). This process can be classified as a Donsker class (Kosorok 2008). Thus, as \(n\rightarrow \infty \), the summation of \(\varvec{G}_i^*(t,\varvec{z})\) in (8) converges weakly to a mean zero Gaussian process, for which the asymptotic covariance function is \(\text{ E }\left\{ \varvec{G}_i^*(t_1,\varvec{z}_1)\varvec{G}_i^*(t_2,\varvec{z}_2)^\top \right\} \).
Appendix C: Proof of Theorem 2
We note that \(\varGamma _Z(\varvec{\beta }_0,t,\varvec{z}) \{\varGamma (\varvec{\beta }_0)\}^{-1}\) in the third term of (7) converges in probability to a non-random function. Conditional on the observed data, the process (7) is a linear combination of independent normally distributed processes with mean zero. Thus, given that \(\widehat{\varvec{\beta }}\) is a consistent estimator for \(\varvec{\beta }_0\), we can show that \(n^{-1}\sum _{i=1}^n\widehat{\varvec{G}}_i^*(t_1,\varvec{z}_1)\widehat{\varvec{G}}_i^*(t_2,\varvec{z}_2)^\top \) converges in probability to the asymptotic covariance function \(\text{ E }\left\{ \varvec{G}_i^*(t_1,\varvec{z}_1)\varvec{G}_i^*(t_2,\varvec{z}_2)^\top \right\} \) as \(n\rightarrow \infty \). By applying the multiplier central limit theorem (Kosorok 2008), it follows that conditional on the observed data and \(\varvec{z}\), \(n^{-1/2}\widetilde{\varvec{G}}_m(t,\varvec{z})\) and \(n^{-1/2}\sum _{i=1}^n{\varvec{G}}_i^*(t,\varvec{z})\) converge to the same mean zero Gaussian process.
Rights and permissions
About this article
Cite this article
Lee, C.H., Ning, J. & Shen, Y. Model diagnostics for the proportional hazards model with length-biased data. Lifetime Data Anal 25, 79–96 (2019). https://doi.org/10.1007/s10985-018-9422-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-018-9422-y