Skip to main content
Log in

Model diagnostics for the proportional hazards model with length-biased data

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

Length-biased data are frequently encountered in prevalent cohort studies. Many statistical methods have been developed to estimate the covariate effects on the survival outcomes arising from such data while properly adjusting for length-biased sampling. Among them, regression methods based on the proportional hazards model have been widely adopted. However, little work has focused on checking the proportional hazards model assumptions with length-biased data, which is essential to ensure the validity of inference. In this article, we propose a statistical tool for testing the assumed functional form of covariates and the proportional hazards assumption graphically and analytically under the setting of length-biased sampling, through a general class of multiparameter stochastic processes. The finite sample performance is examined through simulation studies, and the proposed methods are illustrated with the data from a cohort study of dementia in Canada.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Asgharian M, M’Lan CE, Wolfson DB (2002) Length-biased sampling with right censoring: an unconditional approach. J Am Stat Assoc 97:201–209

    Article  MathSciNet  MATH  Google Scholar 

  • Asgharian M, Wolfson DB, Zhang X (2006) Checking stationarity of the incidence rate using prevalent cohort survival data. Stat Med 25:1751–1767

    Article  MathSciNet  Google Scholar 

  • Borgan O, Zhang Y (2015) Using cumulative sums of martingale residuals for model checking in nested case-control studies. Biometrics 71:696–703

    Article  MathSciNet  MATH  Google Scholar 

  • Chan KCG, Chen YQ, Di CZ (2012) Proportional mean residual life model for right-censored length-biased data. Biometrika 99:995–1000

    Article  MathSciNet  MATH  Google Scholar 

  • Cox DR (1972) Regression models and life-tables (with discussion). J R Stat Soc Ser B (Methodol 34:187–220

    MATH  Google Scholar 

  • de Una-Alvarez J, Otero-Giraldez MS, Alvarez-Llorente G (2003) Estimation under length-bias and right-censoring: an application to unemployment duration analysis for married women. J Appl Stat 30:283–291

    Article  MathSciNet  MATH  Google Scholar 

  • Huang CY, Qin J (2012) Composite partial likelihood estimation under length-biased sampling, with application to a prevalent cohort study of dementia. J Am Stat Assoc 107:946–957

    Article  MathSciNet  MATH  Google Scholar 

  • Huang CY, Luo X, Follmann DA (2011) A model checking method for the proportional hazards model with recurrent gap time data. Biostatistics 12:535–547

    Article  MATH  Google Scholar 

  • Kosorok MR (2008) Introduction to empirical processes and semiparametric inference. Springer, New York

    Book  MATH  Google Scholar 

  • Lancaster T (1979) Econometric methods for the duration of unemployment. Econometrica 47:939–956

    Article  MATH  Google Scholar 

  • Lin DY, Wei LJ, Ying ZL (1993) Checking the cox model with cumulative sums of martingale-based residuals. Biometrika 80:557–572

    Article  MathSciNet  MATH  Google Scholar 

  • Lu W, Liu M, Chen YH (2014) Testing goodness-of-fit for the proportional hazards model based on nested case-control data. Biometrics 70:845–851

    Article  MathSciNet  MATH  Google Scholar 

  • Pepe MS, Fleming TR (1991) Weighted kaplan-meier statistics: large sample and optimality considerations. J R Stat Soc DB 53:341–352

    MathSciNet  MATH  Google Scholar 

  • Qin J, Shen Y (2010) Statistical methods for analyzing right-censored length-biased data under cox model. Biometrics 66:382–392

    Article  MathSciNet  MATH  Google Scholar 

  • Qin J, Ning J, Liu H, Shen Y (2011) Maximum likelihood estimations and em algorithms with length-biased data. J Am Stat Assoc 106:1434–1449

    Article  MathSciNet  MATH  Google Scholar 

  • Shen Y, Ning J, Qin J (2009) Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc 104:1192–1202

    Article  MathSciNet  MATH  Google Scholar 

  • Spiekerman CF, Lin DY (1996) Checking the marginal cox model for correlated failure time data. Biometrika 83:143–156

    Article  MathSciNet  MATH  Google Scholar 

  • Tsai WY (2009) Pseudo-partial likelihood for proportional hazards models with biased-sampling data. Biometrika 96:601–615

    Article  MathSciNet  MATH  Google Scholar 

  • Wang MC (1996) Hazards regression analysis for length-biased data. Biometrika 83:343–354

    Article  MathSciNet  MATH  Google Scholar 

  • Wang HJ, Wang L (2014) Quantile regression analysis of length-biased survival data. Stat 3:31–47

    Article  Google Scholar 

  • Wolfson C, Wolfson DB, Asgharian M, M’Lan CE, Ostbye T, Rockwood K, Hogan DB (2001) A reevaluation of the duration of survival after the onset of dementia. N Engl J Med 344:1111–1116

    Article  Google Scholar 

  • Zelen M, Feinleib M (1969) On the theory of screening for chronic diseases. Biometrika 56:601–614

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the U.S. National Institutes of Health, Grants CA193878 and CA016672. The authors thank Professor Asgharian and the investigators from the Canadian Study of Health and Aging for generously sharing the dementia data. The data reported in this article were collected as part of the Canadian Study of Health and Aging. The core study was funded by the Seniors Independence Research Program, through the National Health Research and Development Program (NHRDP) of Health Canada Project 6606-3954-MC(S). Additional funding was provided by Pfizer Canada, Incorporated, through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada. The authors also acknowledge the Texas Advanced Computing Center at The University of Texas at Austin for providing HPC resources that contributed to the research results reported within this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chi Hyun Lee.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 256 KB)

Appendices

Appendix A: regularity conditions

We assume the following regularity conditions for the large sample properties:

  1. (1)

    \((Y_i,A_i,\delta _i,\varvec{Z}_i)\) are independent and identically distributed for \(i=1,\ldots ,n\).

  2. (2)

    The parameters \(\varvec{\beta }_0\) belong to an interior of a known compact set.

  3. (3)

    The covariates \(\varvec{Z}\) are bounded, and \(\pmb {\alpha }=0\) almost surely if \(\pmb {\alpha }^\top \varvec{Z}=0\) with probability one.

  4. (4)

    The differentiable baseline cumulative hazard function \(\varLambda _0(\tau )<\infty \) where \(\tau \) satisfies \(\Pr (Y\ge \tau )>0\).

  5. (5)

    \(\varGamma (\varvec{\beta })\) is positive definite.

  6. (6)

    \(0<w_C(\tau )<\infty \) and \(\int _0^\tau \left[ \left\{ \int _t^\tau S_C(u)\mathrm {d}u\right\} ^2/\left\{ S_C^2(t)S_V(t)\right\} \right] \mathrm {d}S_C(t)<\infty \) where \(S_V(\cdot )\) is the survival function of the residual survival time.

Appendix B: Proof of Theorem 1

By applying the Taylor series expansions, we have

$$\begin{aligned}&n^{-1/2}\sum _{i=1}^n\int _0^t\widehat{E}_{\varvec{Z}}(\varvec{\beta },u,\varvec{z})\mathrm {d}N_i(u) \nonumber \\&\quad =n^{-1/2}\sum _{i=1}^n\int _0^t\widehat{E}_{\varvec{Z}} (\varvec{\beta }_0,u,\varvec{z})\mathrm {d}N_i(u)\nonumber \\&\qquad +n^{-1} \sum _{i=1}^n\frac{\partial }{\partial \varvec{\beta }}\int _0^t\widehat{E}_{\varvec{Z}}(\varvec{\beta }_0,u,\varvec{z})\mathrm {d}N_i(u)\sqrt{n}(\varvec{\beta }-\varvec{\beta }_0)+o_p(1), \nonumber \\&\quad =n^{-1/2}\sum _{i=1}^n\int _0^t\widehat{E}_{\varvec{Z}} (\varvec{\beta }_0,u,\varvec{z})\mathrm {d}N_i(u) \nonumber \\&\qquad +\,n^{-1}\sum _{i=1}^n\int _0^t\left[ \frac{\widehat{S}_{\varvec{Z}}^{(1)}(\varvec{\beta }_0,u,\varvec{z})}{\widehat{S}^{(0)}(\varvec{\beta }_0,u)}-\frac{\widehat{S}_{\varvec{Z}}^{(0)}(\varvec{\beta }_0,u,\varvec{z})\widehat{S}^{(1)}(\varvec{\beta }_0,u)}{\{\widehat{S}^{(0)}(\varvec{\beta }_0,u)\}^2}\right] \mathrm {d}N_i(u)\sqrt{n}(\varvec{\beta }-\varvec{\beta }_0)\nonumber \\&\qquad +o_p(1), \end{aligned}$$
(5)

and

$$\begin{aligned}&n^{-1/2}\sum _{i=1}^n\int _0^\tau \left\{ \varvec{Z}_i-\widehat{E}(\varvec{\beta },u)\right\} \mathrm {d}N_i(u) \nonumber \\&\quad =n^{-1/2}\sum _{i=1}^n\int _0^\tau \left\{ \varvec{Z}_i-\widehat{E}(\varvec{\beta }_0,u)\right\} \mathrm {d}N_i(u)+n^{-1} \nonumber \\&\quad \qquad \sum _{i=1}^n\frac{\partial }{\partial \varvec{\beta }}\int _0^\tau \left\{ \varvec{Z}_i-\widehat{E}(\varvec{\beta }_0,u)\right\} \mathrm {d}N_i(u) \sqrt{n}(\varvec{\beta }-\varvec{\beta }_0)+o_p(1) \nonumber \\&\quad =n^{-1/2}\sum _{i=1}^n\int _0^\tau \left\{ \varvec{Z}_i-\widehat{E}(\varvec{\beta }_0,u)\right\} \mathrm {d}N_i(u) \nonumber \\&\qquad -\,n^{-1}\sum _{i=1}^n\int _0^\tau \left[ \frac{\widehat{S}^{(2)}(\varvec{\beta }_0,u)}{\widehat{S}^{(0)}(\varvec{\beta }_0,u)}-\left\{ \frac{\widehat{S}^{(1)}(\varvec{\beta }_0,u)}{\widehat{S}^{(0)}(\varvec{\beta }_0,u)}\right\} ^2\right] \mathrm {d}N_i(u)\sqrt{n}(\varvec{\beta }-\varvec{\beta }_0)+o_p(1). \end{aligned}$$
(6)

The stochastic processes \(\varvec{G}(t,\varvec{z})\) can be expressed in two terms, \(\varvec{G}_1(t,\varvec{z})\) and \(\varvec{G}_2(t,\varvec{z})\), as follows.

$$\begin{aligned}&\varvec{G}(t,\varvec{z}) \\&=\sum _{i=1}^n f(\varvec{Z}_i) I(\varvec{Z}_i\le \varvec{z})\widehat{M}_i(t) \\&=\sum _{i=1}^n f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})N_i(t) \\&\qquad -f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})\int _0^t \widehat{w}_C(u)R_i(u)\{\widehat{w}_C(Y_i)\}^{-1}\exp (\widehat{\varvec{\beta }}^\top \varvec{Z}_i)\mathrm {d}\widehat{\varLambda }_0(u) \\&=\sum _{i=1}^n f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})N_i(t)-\int _0^t\frac{\widehat{S}_Z^{(0)}(\widehat{\varvec{\beta }},u,\varvec{z})}{\widehat{S}^{(0)}(\widehat{\varvec{\beta }},u)}\mathrm {d}N_i(u) \\&=\sum _{i=1}^n \int _0^t \left\{ f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})- E_Z(\varvec{\beta }_0,u,\varvec{z})\right\} \mathrm {d}N_i(u) \\&\quad +\sum _{i=1}^n \int _0^t \left\{ E_Z(\varvec{\beta }_0,u,\varvec{z}) - \widehat{E}_Z(\widehat{\varvec{\beta }},u,\varvec{z})\right\} \mathrm {d}N_i(u) \\&=\varvec{G}_1(t,\varvec{z})+\varvec{G}_2(t,\varvec{z}) \end{aligned}$$

We exploit the Taylor expansion and empirical process approximation techniques. It is straightforward that the first term can be approximated by

$$\begin{aligned} n^{-1/2}\varvec{G}_1(t,\varvec{z})&=n^{-1/2}\sum _{i=1}^n \int _0^t \left\{ f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})- E_Z(\varvec{\beta }_0,u,\varvec{z})\right\} \mathrm {d}N_i(u) \\&=n^{-1/2}\sum _{i=1}^n \int _0^t\left\{ f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})-e_Z(\varvec{\beta }_0,u,\varvec{z})\right\} \mathrm {d}M_i(u)+o_p(1). \end{aligned}$$

Then, we re-express the second term based on equations (5) and (6).

$$\begin{aligned} n^{-1/2}\varvec{G}_2(t,\varvec{z})&=n^{-1/2}\sum _{i=1}^n \int _0^t E_Z(\varvec{\beta }_0,u,\varvec{z}) \mathrm {d}N_i(u) \\&\quad -n^{-1/2}\sum _{i=1}^n \int _0^t \widehat{E}_Z(\widehat{\varvec{\beta }},u,\varvec{z}) \mathrm {d}N_i(u) \\&=n^{-1/2}\sum _{i=1}^n\int _0^t \left\{ E_Z(\varvec{\beta }_0,u,\varvec{z}) - \widehat{E}_Z(\varvec{\beta }_0,u,\varvec{z})\right\} \mathrm {d}N_i(u) \\&\quad -\,\varGamma _Z(\varvec{\beta }_0,t,\varvec{z})\sqrt{n}(\widehat{\varvec{\beta }}-\varvec{\beta }_0) +o_p(1) \\&=n^{-1/2}\sum _{i=1}^n \int _0^t \left\{ E_Z(\varvec{\beta }_0,u,\varvec{z})-\widehat{E}_Z(\varvec{\beta }_0,u,\varvec{z})\right\} \mathrm {d}N_i(u) \\&\quad +\,\varGamma _Z(\varvec{\beta }_0,t,\varvec{z}) \{\varGamma (\varvec{\beta }_0)\}^{-1}n^{-1/2}\widehat{U}(\varvec{\beta }_0)+o_p(1) \end{aligned}$$

where

$$\begin{aligned} \varGamma _Z(\varvec{\beta },t,\varvec{z})=\text{ E }\left\{ \int _0^t\left[ \frac{{S}_{\varvec{Z}}^{(1)}(\varvec{\beta },u,\varvec{z})}{{S}^{(0)}(\varvec{\beta },u)}-\frac{{S}_{\varvec{Z}}^{(0)}(\varvec{\beta },u,\varvec{z}){S}^{(1)}(\varvec{\beta },u)}{\{{S}^{(0)}(\varvec{\beta },u)\}^2}\right] \mathrm {d}N_i(u)\right\} \end{aligned}$$

and

$$\begin{aligned} \varGamma (\varvec{\beta })=-\text{ E }\left\{ \int _0^\tau \left[ \frac{{S}^{(2)}(\varvec{\beta },u)}{{S}^{(0)}(\varvec{\beta },u)}-\left\{ \frac{{S}^{(1)}(\varvec{\beta },u)}{{S}^{(0)}(\varvec{\beta },u)}\right\} ^2\right] \mathrm {d}N_i(u)\right\} . \end{aligned}$$

The second equation can be derived by plugging in Eq. (5). The third equation naturally follows by replacing \(\sqrt{n}(\widehat{\varvec{\beta }}-\varvec{\beta }_0)\) with Eq. (6) after some algebra (Qin and Shen 2010). Note that the leading term in the last equation can be rewritten as

$$\begin{aligned}&n^{-1/2}\sum _{i=1}^n \int _0^t \left\{ E_Z(\varvec{\beta }_0,u,\varvec{z})-\widehat{E}_Z(\varvec{\beta }_0,u,\varvec{z})\right\} \mathrm {d}N_i(u) \\&=n^{-1/2}\sum _{i=1}^n \int _0^t \left\{ \frac{S_Z^{(0)}(\varvec{\beta }_0,u)-\widehat{S}_Z^{(0)}(\varvec{\beta }_0,u)}{S^{(0)}(\varvec{\beta }_0,u)}\right\} \mathrm {d}N_i(u)+o_p(1) \\&=n^{-1/2}\sum _{i=1}^n\int _0^t \frac{\sum _{k=1}^nf(\varvec{Z}_k)I(\varvec{Z}_k\le \varvec{z})R_k(u)\exp (\varvec{\beta }_0^\top \varvec{Z}_k)\left\{ \frac{1}{w_C(Y_k)}-\frac{1}{\widehat{w}_C(Y_k)}\right\} }{\sum _{k=1}^nR_k(u)\{w_C(Y_k)\}^{-1}\exp (\varvec{\beta }_0^\top \varvec{Z}_k)}\,\mathrm {d}N_i(u)\!+\!o_p(1)\\&=n^{-1/2}\sum _{i=1}^n\int _0^t \frac{\sum _{k=1}^n f(\varvec{Z}_k)I(\varvec{Z}_k\le \varvec{z})w_C(u) R_k(u) \exp (\varvec{\beta }_0^\top \varvec{Z}_k)}{nS^{(0)}(\varvec{\beta }_0,u)} \frac{\left\{ \widehat{w}_C(Y_k)\!-\!w_C(Y_k)\right\} }{\{\widehat{w}_C(Y_k)\}^{2}}\mathrm {d}N_i(u)\! +\!o_p(1) \\&=n^{-1/2}\sum _{i=1}^n \int _0^t H(\varvec{\beta }_0,u)\frac{\mathrm {d}M_{C_i}(u)}{\pi (u)}+o_p(1), \end{aligned}$$

where

$$\begin{aligned} H(\varvec{\beta },t)&=\lim _{n\rightarrow \infty } \frac{1}{n^2}\sum _{i=1}^n\sum _{k=1}^n \\&\quad \frac{f(\varvec{Z}_k)I(\varvec{Z}_k\le \varvec{z})w_C(Y_i) R_k(Y_i) \exp (\varvec{\beta }^\top \varvec{Z}_k)\{{w}_C(Y_k)\}^{-2}h_k(t)}{S^{(0)}(\varvec{\beta },Y_i)} \\ M_{C_i}(t)&=I(V_i\le t,\delta _i=0)-\int _0^tI(V_i\ge u)\mathrm {d}\varLambda _C(u) \\ h_k(t)&=I(Y_k\ge t)\int _t^{Y_k}S_C(u)\mathrm {d}u \\ \pi (t)&=S_C(t)S_V(t), \end{aligned}$$

in which \(\varLambda _C(t)\) is the cumulative hazard function of the residual censoring time and \(S_V(t)\) is the survival function of the residual survival time. The last equation can be obtained by expressing \(\{\widehat{w}_C(y)-w_C(y)\}\) as an i.i.d. sum of martingales (Pepe and Fleming 1991). Finally, the general class of stochastic processes \(G(t,\varvec{z})\) can be asymptotically represented by

$$\begin{aligned} n^{-1/2}\varvec{G}(t,\varvec{z})&=n^{-1/2}\sum _{i=1}^n \int _0^t\left\{ f(\varvec{Z}_i)I(\varvec{Z}_i\le \varvec{z})-e_Z(\varvec{\beta }_0,u)\right\} \mathrm {d}M_i(u) \nonumber \\&+\,n^{-1/2}\sum _{i=1}^n \int _0^t H(\varvec{\beta }_0,u)\frac{\mathrm {d}M_{C_i}(u)}{\pi (u)} \nonumber \\&+\,\varGamma _Z(\varvec{\beta }_0,t,\varvec{z}) \{\varGamma (\varvec{\beta }_0)\}^{-1}n^{-1/2} \sum _{i=1}^n\int _0^\infty \left\{ \varvec{Z}_i-e(\varvec{\beta }_0,u)\right\} \mathrm {d}M_i(u)+o_p(1) \end{aligned}$$
(7)
$$\begin{aligned}&=n^{-1/2}\sum _{i=1}^n\varvec{G}_i^*(t,\varvec{z})+o_p(1). \end{aligned}$$
(8)

Under the regularity conditions, for any given \(\varvec{z}\), \(\varvec{G}_i^*(t,\varvec{z})\) is a mean zero process bounded on \([0,\tau ]\). This process can be classified as a Donsker class (Kosorok 2008). Thus, as \(n\rightarrow \infty \), the summation of \(\varvec{G}_i^*(t,\varvec{z})\) in (8) converges weakly to a mean zero Gaussian process, for which the asymptotic covariance function is \(\text{ E }\left\{ \varvec{G}_i^*(t_1,\varvec{z}_1)\varvec{G}_i^*(t_2,\varvec{z}_2)^\top \right\} \).

Appendix C: Proof of Theorem 2

We note that \(\varGamma _Z(\varvec{\beta }_0,t,\varvec{z}) \{\varGamma (\varvec{\beta }_0)\}^{-1}\) in the third term of (7) converges in probability to a non-random function. Conditional on the observed data, the process (7) is a linear combination of independent normally distributed processes with mean zero. Thus, given that \(\widehat{\varvec{\beta }}\) is a consistent estimator for \(\varvec{\beta }_0\), we can show that \(n^{-1}\sum _{i=1}^n\widehat{\varvec{G}}_i^*(t_1,\varvec{z}_1)\widehat{\varvec{G}}_i^*(t_2,\varvec{z}_2)^\top \) converges in probability to the asymptotic covariance function \(\text{ E }\left\{ \varvec{G}_i^*(t_1,\varvec{z}_1)\varvec{G}_i^*(t_2,\varvec{z}_2)^\top \right\} \) as \(n\rightarrow \infty \). By applying the multiplier central limit theorem (Kosorok 2008), it follows that conditional on the observed data and \(\varvec{z}\), \(n^{-1/2}\widetilde{\varvec{G}}_m(t,\varvec{z})\) and \(n^{-1/2}\sum _{i=1}^n{\varvec{G}}_i^*(t,\varvec{z})\) converge to the same mean zero Gaussian process.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, C.H., Ning, J. & Shen, Y. Model diagnostics for the proportional hazards model with length-biased data. Lifetime Data Anal 25, 79–96 (2019). https://doi.org/10.1007/s10985-018-9422-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-018-9422-y

Keywords

Navigation