Skip to main content
Log in

Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Its key idea is to select informative variables using correlations between the response and the covariates. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariate measurement error are often accompanying with survival analysis. Even though many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates that are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem and propose the model-free feature screening method in the presence of the censored response and error-prone covariates. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Numerical studies are reported to assess the performance of the proposed method. Finally, we implement the proposed method to a real dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov N, Czaki F (eds) 2nd international symposium on information theory. Akademiai Kaido, Bydapest, pp 267–281

    Google Scholar 

  • Buckley J, James I (1979) Linear regression with censored data. Biometrika 66:429–436

    Article  Google Scholar 

  • Candes E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Ann Stat 35:2313–2404

    MATH  Google Scholar 

  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear model. CRC Press, New York

    Book  Google Scholar 

  • Chen L-P (2018) Semiparametric estimation for the accelerated failure time model with length-biased sampling and covariate measurement error. Stat 7:e209. https://doi.org/10.1002/sta4.209

    Article  MathSciNet  Google Scholar 

  • Chen L-P (2019a) Pseudo likelihood estimation for the additive hazards model with data subject to left-truncation and right-censoring. Stat Its Interface 12:135–148

  • Chen L-P (2019b) Semiparametric estimation for cure survival model with left-truncated and right-censored data and covariate measurement error. Stat Probab Lett 154:108547. https://doi.org/10.1016/j.spl.2019.06.023

  • Chen L-P (2019c) Statistical analysis with measurement error or misclassification: strategy, method and application by Grace Y. Yi. Biometrics 75:1045–1046. https://doi.org/10.1111/biom.13130

  • Chen L-P (2020) Semiparametric estimation for the transformation model with length-biased data and covariate measurement error. J Stat Comput Simul 90:420–442. https://doi.org/10.1080/00949655.2019.1687700

    Article  MathSciNet  MATH  Google Scholar 

  • Chen L-P, Yi GY (2020) Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Ann Inst Stat Math. https://doi.org/10.1007/s10463-020-00755-2 (To appear)

    Article  Google Scholar 

  • Chen X, Chen X, Wang H (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 119:118–138

    Article  MathSciNet  Google Scholar 

  • Chen X, Zhang Y, Chen X, Liu Y (2019) A simple model-free survival conditional feature screening. Stat Probab Lett 146:156–160

    Article  MathSciNet  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:409–499

    Article  MathSciNet  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  MathSciNet  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Stat Soc Ser B 70:849–911

    Article  MathSciNet  Google Scholar 

  • Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:1829–1853

    MathSciNet  MATH  Google Scholar 

  • Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604

    MathSciNet  MATH  Google Scholar 

  • Fan J, Feng Y, Wu Y (2010) Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collect 6:70–86

    Google Scholar 

  • Hall P, Miller H (2009) Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat 18:533–550

    Article  MathSciNet  Google Scholar 

  • Lawless JF (2003) Statistical models and methods for lifetime data. Wiley, New York

    MATH  Google Scholar 

  • Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139

    Article  MathSciNet  Google Scholar 

  • Miller RG (1981) Survival analysis. Wiley, New York

    Google Scholar 

  • Rocke DM, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8:557–569

    Article  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of model. Ann Stat 6:461–464

    Article  MathSciNet  Google Scholar 

  • Song R, Lu W, Ma S, Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814

    Article  MathSciNet  Google Scholar 

  • Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35:2769–2794

    Article  MathSciNet  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • van de Vijver MJ, He YD, van’t Veer, L. J., Dai, H., Hart, A. A.M., Voskuil, D. W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E.T., Friend, S.H. and Bernards, R. (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009

  • Yan X, Tang N, Zhao X (2017) The Spearman rank correlation screening for ultrahigh dimensional censored data. arXiv:1702.02708v1

  • Zhong W, Zhu L (2015) An iterative approach to distance correlation-based sure independence screening. J Stat Comput Simul 85:2331–2345

    Article  MathSciNet  Google Scholar 

  • Zhu L, Li L, Li R, Zhu L (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Stat Assoc 106:1464–1475

    Article  MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429

    Article  MathSciNet  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li-Pang Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proof of Theorem 3.1

We first consider \(\text {dcov} \left( Y^*, X_k \right) \) and \(\text {dcov}^*\left( Y^*, X_k^*\right) \) for the kth component of X with \(k=1,\ldots ,p\). Note that the former formulation is based on the true covariates \(X_k\), while the latter formulation is based on the surrogate covariates \(X_k^*\).

Since the error term \(\epsilon _k\), the kth component of \(\epsilon \), follows a normal distribution \(N(0,\sigma _{\epsilon ,kk})\), then its characteristic function is given by

$$\begin{aligned} E\left\{ \exp \left( \mathbf {i} s \epsilon _k \right) \right\} = \exp \left( -\frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) . \end{aligned}$$
(A.1)

By the direct derivations, we have

$$\begin{aligned} \phi ^*_{X_k^*}(s)= & {} E\left\{ \exp \left( \mathbf {i}s X_k^*\right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}s X_k \right) \right\} E\left\{ \exp \left( \mathbf {i}s \epsilon _k \right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}s X_k \right) \right\} , \end{aligned}$$
(A.2)

where the second equality is due to the independence of \(X_k\) and \(\epsilon _k\), and the last equality is due to (A.1).

In addition, we can also derive

$$\begin{aligned} \phi ^*_{Y^*,X_k^*}(r,s)= & {} E\left\{ \exp \left( \mathbf {i}rY^*+ \mathbf {i}s X_k^*\right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}rY^*+ \mathbf {i}s X_k \right) \right\} E\left\{ \exp \left( \mathbf {i}s \epsilon _k \right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}rY^*+ \mathbf {i}s X_k \right) \right\} , \end{aligned}$$
(A.3)

where the second equality is due to the independence of \(\epsilon _k\) and \((X_k, Y^*)\), and the last equality again comes from (A.1). As a result, combining (A.2) and (A.3) with \(\text {dcov}^*\left( Y^*, X_k^*\right) \) gives the same expression as \(\text {dcov} \left( Y^*, X_k \right) \).

The equivalence of \(\text {dcov}^*\left( X_k^*, X_k^*\right) \) and \(\text {dcov} \left( X_k, X_k \right) \) holds by the similar derivations. Therefore, we conclude that \(\text {dcorr} \left( Y^*, X_k \right) \) and \(\text {dcorr}^*\left( Y^*, X_k^*\right) \) are equivalent in the sense that \(\text {dcorr} \left( Y^*, X_k \right) > 0\) if and only if \(\text {dcorr}^*\left( Y^*, X_k^*\right) > 0\). Consequently, the same active features can be determined for \(X^*\) and X. \(\square \)

Appendix B Overview of error correction in the Cox model

In this appendix, we outline the strategy of correcting covariate measurement error when fitting the Cox model. This idea comes from Chen and Yi (2020), and this method can be used to fit the Cox model with covariate measurement error in Sect. 4.3.

For \(i=1,\ldots ,n\), let \(Y_i\) and \(\delta _i\) denote the survival time and the censoring indicator defined in Sect. 2.1. Let \(X_i\) denote q-dimensional vector of unobserved covariates with \(q<n\) after feature screening procedure, and let \(X_i^*\) be the surrogate version of \(X_i\). Based on the Cox model (17) and the unobserved covariates \(X_i\), the likelihood function is given by (e.g., Lawless 2003)

$$\begin{aligned} L(\gamma ) = \prod \limits _{i=1}^{n} \left\{ \lambda _0(Y_i) \exp \left( X_i^\top \gamma \right) \right\} ^{\delta _i} \exp \left\{ - \Lambda _0(Y_i) \exp \left( X_i^\top \gamma \right) \right\} , \end{aligned}$$
(B.1)

where \(\Lambda _0(t) = \int _0^t \lambda _0(u)du\) is called the cumulative baseline hazards function.

Let \(\ell (\gamma ) = \log L(\gamma )\). Since \(\ell (\gamma )\) contains the \(X_i\) whose measurements are unavailable, we want to modify \(\ell (\gamma )\) to be a new function, say \(\ell ^*(\gamma )\), of the observed measurements and the model parameters so that its conditional expectation equals to \(\ell (\gamma )\):

$$\begin{aligned} E\left\{ \ell ^*(\gamma )|\mathbb {X},\mathbb {C},\mathbb {T}\right\} = \ell (\gamma ), \end{aligned}$$
(B.2)

where the expectation is taken with respect to the conditional distribution of \(\mathbb {X}^*\) given \(\left\{ \mathbb {X}, \mathbb {C}, \mathbb {T} \right\} \), where \(\mathbb {X}^*= \{X_1^*,\ldots ,X_n^*\}\), \(\mathbb {X} = \{X_1,\ldots ,X_n\}\), \(\mathbb {C} = \{C_1,\ldots ,C_n\}\), and \(\mathbb {T} = \{T_1,\ldots ,T_n\}\). Such a strategy is useful in yielding an unbiased estimating function and is sometimes called the “corrected” likelihood method or the insertion correction approach (e.g., Carroll et al. 2006, Section 7.4).

Noticing that the \(X_i\) appear in \(\ell (\gamma )\) in linear and exponential forms, we define

$$\begin{aligned} \ell ^*(\gamma )= & {} \sum _{i=1}^{n} \bigg [ \delta _i \log \lambda _0 (Y_i) + \delta _i (X_i^{*\top } \gamma ) - \Lambda _0 (Y_i) \exp (X_i^{*\top } \gamma ) \left\{ m(\gamma )\right\} ^{-1} \bigg ],\qquad \end{aligned}$$
(B.3)

where \(m(z) = \exp (\frac{1}{2} z^\top \Sigma _{\epsilon } z)\) and \(\Sigma _{\epsilon }\) is defined in Sect. 2.2. It is easily seen that \(\ell ^*(\gamma )\) satisfies (B.2).

To use (B.3) to derive an estimator of \(\gamma \), we need to deal with the baseline hazard function \(\lambda _0 \left( \cdot \right) \) and its cumulative function \(\Lambda _0 \left( \cdot \right) \). First, we discretize \(\Lambda _0 \left( \cdot \right) \) so that \(\lambda _0 \left( \cdot \right) \) has a nonzero value if \(t = Y_i\) for \(i = 1,\ldots ,n\); otherwise, \(\lambda _0 (t) =0\). Let \(\lambda _i \) denote \(\lambda _0 (Y_i)\) for \(i = 1,\ldots ,n\). Then \(\Lambda _0 (t)\) is taken as \(\sum \nolimits _{i=1}^{n} \mathbb {I}(Y_i \leqslant t) \lambda _i\). Next, given \(\gamma \), we solve \(\frac{\partial \ell ^*(\gamma )}{\partial \lambda _i} = 0\) for \(i = 1,\ldots ,n\), which leads to an estimator of \(\lambda _i\), given by

$$\begin{aligned} \widehat{\lambda }_i = \frac{\delta _i}{\sum \limits _{k=1}^{n} \mathbb {I}(Y_i \le Y_k) \exp { \left( X_k^{*\top } \gamma \right) } \left\{ m(\gamma ) \right\} ^{-1}}\ \text{ for } i=1,...,n; \end{aligned}$$
(B.4)

and the corresponding estimate of the cumulative baseline hazards function:

$$\begin{aligned} \widehat{\Lambda }_0 (t) = \sum _{i=1}^{n} \mathbb {I}(Y_i \le t) \widehat{\lambda }_i \ . \end{aligned}$$
(B.5)

Finally, plugging (B.4) and (B.5) into (B.3) gives the function

$$\begin{aligned} \widehat{\ell }^*(\gamma )= & {} \sum \limits _{i=1}^{n} \left[ \delta _i \log \widehat{\lambda }_i + \delta _i (X_i^{*\top } \gamma ) - \widehat{\Lambda }_0 (Y_i) \exp (X_i^{*\top } \gamma ) \left\{ m(\gamma )\right\} ^{-1} \right] . \end{aligned}$$

An estimator of \(\gamma \) is then obtained by maximizing \(\widehat{\ell }^*(\gamma )\):

$$\begin{aligned} \widehat{\gamma } = {\mathop {\mathrm{argmax}}\limits _{\gamma }} \widehat{\ell }^*(\gamma ). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, LP. Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error. Comput Stat 36, 857–884 (2021). https://doi.org/10.1007/s00180-020-01039-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-01039-2

Keywords

Navigation