Skip to main content

Advertisement

Log in

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

  • Regular Article
  • Published:
Acta Biotheoretica Aims and scope Submit manuscript

This article was retracted on 27 February 2020

This article has been updated

Abstract

When relating genomic data to survival outcomes, there are three main challenges that are the censored survival outcomes, the high-dimensionality of the genomic data, and the non-normality of data. We propose a method to tackle these challenges simultaneously and obtain a robust estimation of detecting significant genes related to survival outcomes based on Accelerated Failure Time (AFT) model. Specifically, we include a general loss function to the AFT model, adopt model regularization and shrinkage technique, cope with parameters tuning and model selection, and develop an algorithm based on unified Expectation–Maximization approach for easy implementation. Simulation results demonstrate the advantages of the proposed method compared with existing methods when the data has heavy-tailed errors and correlated covariates. Two real case studies on patients are provided to illustrate the application of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Change history

  • 27 February 2020

    The authors have retracted this article [1] because they found a fundamental mistake in the methodology that is not correctable at this time. This mistake is found in the methodology and the derivation of the model with Tukey and Huber���s losses. Because of the error, the findings in the article are not reliable. All authors agree to this retraction.

References

  • Bell D (2011) Integrated genomic analyses of ovarian carcinom. Nature 474(7353):609–615

    Article  Google Scholar 

  • Buckley J, James I (1979) Linear regression with censored data. Biometrika 66(3):429–436

    Article  Google Scholar 

  • Candès E, Tao T (2007) The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35(6):2392–2404

    Article  Google Scholar 

  • Cox D (1972) Regression models and life tables (with discussion). J R Stat Soc 34:187–220

    Google Scholar 

  • Craven P, Wahba G (1978) Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 31(4):377–403

    Article  Google Scholar 

  • Efron B (1967) The two sample problem with censored data. Proc Fifth Berkeley Symp Math Stat Probab 4:831–853

    Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(1):407–451

    Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  Google Scholar 

  • Friedman J, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76(376):817–823

    Article  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Soft 33(1):1–22

    Article  Google Scholar 

  • Gao X, Feng Y (2016) Penalized weighted least absolute deviation regression. Stat Interface 11(1):79–89

    Article  Google Scholar 

  • Goeman JJ, Meijer RJ, Chaturvedi N (2018) Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model. R package version 0.9-51

  • Gui J, Li H (2005) Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21(13):3001–3008

    Article  Google Scholar 

  • Huang J, Ma S, Xie H (2006) Regularized estimation in the Accelerated Failure Time model with high-dimensional covariates. Biometrics 62(3):813–820

    Article  Google Scholar 

  • Kalbeisch J, Prentice R (1980) The statistical analysis of failure time data. Wiley, New York

    Google Scholar 

  • Klein JP, Moeschberger ML (2003) Survival alanalysis: techniques for censored and truncated data, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Koenker R (2004) Quantreg: an r package for quantile regression and related methods. http://cranr-project.org

  • Koenker R (2008) Censored quantile regression redux. J Stat Softw 27(6):1–25

    Article  Google Scholar 

  • Koenker R, Geling O (2001) Reappraising medfly longevity: a quantile regression survival analysis. J Am Stat Assoc 96(454):458–468

    Article  Google Scholar 

  • Li H, Luan Y (2003) Kernel cox regression models for linking gene expression profiles to censored survival data. Pac Symp Biocomput 8(12):65–76

    Google Scholar 

  • Li Y, Dicker L, Zhao SD (2010) A new class of dantzig selectors for censored linear regression models. Harvard University Biostatistics Working paper Series

  • Li Y, Dicker L, Zhao SD (2014) The Dantzig selector for censored linear regression models. Stat Sin 24(1):251–275

    Google Scholar 

  • Ning J, Qin J, Shen Y (2015) Buckley-James-Type estimator with right-censored and length-biased data. Biometrics 67(4):1369–1378

    Article  Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13

    Article  Google Scholar 

  • Thanoon FH (2015) Robust regression by least absolute deviations method. Int J Stat Appl 5(3):109–112

    Google Scholar 

  • Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395

    Article  Google Scholar 

  • Tibshirani R (2011) Regression shrinkage and selection via the lasso. J R Stat Soc 73(3):273–282

    Article  Google Scholar 

  • Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econ Stat 25(3):347–355

    Article  Google Scholar 

  • Wang S, Nan B, Zhu J, Beer D (2010) Doubly penalized buckley-james method for survival data with high-dimensional covariates. Biometrics 64(1):132–140

    Article  Google Scholar 

  • Wei LJ (1992) The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Stat Med 11(14–15):1871–1879

    Article  Google Scholar 

  • Wei LJ, Ying Z, Lin DY (1990) Linear regression analysis of censored survival data based on rank tests. Biometrika 77(4):845–851

    Article  Google Scholar 

  • Wu TT, Wang S (2013) Doubly regularized cox regression for high-dimensional survival data with group structures. Stat Interface 6(2):175–186

    Article  Google Scholar 

  • Xie S, Wan ATK, Zhou Y (2015) Quantile regression methods with varying-coefficient models for censored data. Comput Stat Data Anal 88(C):154–172

    Article  Google Scholar 

  • Yang Y, Zou H (2015) A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput 25(6):1129–1141

    Article  Google Scholar 

  • Ying Z (1993) A large sample study of rank estimation for censored regression data. Ann Stat 21(1):76–99

    Article  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    Article  Google Scholar 

Download references

Acknowledgements

The first author’s research was supported by “the Fundamental Research Funds for the Central Universities (No. BLX201609)”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huanxue Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors have retracted this article because they found a fundamental mistake in the methodology that is not correctable at this time. This mistake is found in the methodology and the derivation of the model with Tukey and Huber's losses. Because of the error, the findings in the article are not reliable. All authors agree to this retraction.

Appendix

Appendix

1. The E–M algorithm iteration process:

$$\begin{aligned} \begin{aligned} Q^{(m)}(\theta )&= \sum _{i \in C} E_{T_i}((T_i-\alpha -X'_i\beta )^2|\theta ^{(m-1)}, T_i> Y_i) + \sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} E_{\epsilon _i}(\alpha ^{(m-1)}+ X'_i\beta ^{(m-1)}+\epsilon _i-\alpha -X'_i\beta )^2|\theta ^{(m-1)}, \epsilon _i> e_i) \\&\quad +\sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} \frac{\int _{e_i}^\infty (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\epsilon _i-\alpha -X'_i\beta )^2 f(\epsilon _i)d\epsilon _i}{1-{\hat{F}}(e_i)} \\&\quad +\sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} \frac{\sum _{j>i} m_j (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2 }{\sum _{j>i} m_j }\\&\quad + \sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} \sum _{j>i} w_{ij} (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2 \\&\quad + \sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \end{aligned} \end{aligned}$$
(36)

where \(\epsilon _i=T_i-\alpha ^{(m-1)}-X'_i\beta ^{(m-1)}\), \(m_i=\frac{\delta _i}{n}\prod _{j<i}(\frac{n-j+1}{n-j})^{1-\delta _j}\) is the Kaplan–Meier type estimator of CDF for sorted \(\epsilon\)’s, \(e_i=Y_i-\alpha ^{(m-1)}-X'_i\beta ^{(m-1)}\), and \(w_{ij}=\frac{m_j}{\sum _{j>i}m_j}\) for \(j>i\). Obviously, \(\sum _{j>i}w_{ij}=1\). We update \(\theta ^{(m)}\) to be \(\theta ^{(m)}=arg min_\theta Q^{(m)}(\theta )\). After taking derivative of \(Q^{(m)}(\theta )\) with respect to \(\alpha\) and \(\beta\), we have

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2}{\partial \alpha }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \alpha }=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2}{\partial \beta }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \beta }=0 \end{array}\right. } \end{aligned}$$
(37)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta ) \\ \quad \quad \quad \quad +\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )X_i \\ \quad \quad \quad \quad +\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(38)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}(\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )\\ \quad \quad \quad \quad +\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}(\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )X_i\\ \quad \quad \quad \quad +\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(39)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}(Y^{**}_i-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}(X_i Y^{**}_i-X_i\alpha -X^2_i\beta )+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(40)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i=1}^n(Y^*_i-\alpha -X'_i\beta )=0\\ \sum _{i=1}^n(X_i Y^*_i-X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(41)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \alpha =\overline{Y}^*-\beta \overline{x}\\ \beta =\frac{\sum _{i=1}^n(X_i-\overline{X})(Y^*_i-\overline{Y}^*)}{\sum _{i=1}^n(X_i-\overline{X})^2} \end{array}\right. } \end{aligned}$$
(42)

where \(Y^*_i=\delta _i T_i+(1-\delta _i)Y^{**}_i\), \(Y^{**}_i=\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j\), for \(i \in C\), and \(\overline{Y}^*=\frac{\sum _{i \in C}Y^{**}_i+\sum _{i \in D}T_i}{n}\).

2. Proof of Proposition 1:

$$\begin{aligned} \begin{aligned} Q(\theta )&=\sum _{i \in C} E_{T_i}(L(T_i, \theta )|\theta , T_i> Y_i) + \sum _{i \in D} L(T_i, \theta ) \\&= \sum _{i \in C} E_{T_i}((T_i-\alpha -X'_i\beta )^2|\theta , T_i> Y_i) + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} E_{\epsilon _i}({\hat{\alpha }}+X'_i{\hat{\beta }}+\epsilon _i-\alpha -X'_i\beta )^2|\theta , \epsilon _i> e_i) + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} \frac{\int _{e_i}^\infty ({\hat{\alpha }}+X'_i{\hat{\beta }}+\epsilon _i-\alpha -X'_i\beta )^2 f(\epsilon _i)d\epsilon _i}{1-{\hat{F}}(e_i)} + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} \frac{\sum _{j>i} m_j ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2 }{\sum _{j>i} m_j } + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} \sum _{j>i} w_{ij} ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2 + \sum _{i \in D} L(T_i, \theta ) \end{aligned} \end{aligned}$$
(43)

where \(\epsilon _i=T_i-{\hat{\alpha }}-X'_i{\hat{\beta }}\), \(m_i=\frac{\delta _i}{n}\prod _{j<i}(\frac{n-j+1}{n-j})^{1-\delta _j}\) is the Kaplan-Meier type estimator of CDF for sorted \(\epsilon\)’s, \(e_i=Y_i-{\hat{\alpha }}-X'_i{\hat{\beta }}\), and \(w_{ij}=\frac{m_j}{\sum _{j>i}m_j}\) for \(j>i\). Obviously, \(\sum _{j>i}w_{ij}=1\). The estimation of \(\theta\) is \(arg min_\theta Q(\theta )\) with certain penalty. When there is no penalty, we have \(\theta =(\alpha , \beta )'\) s.t.

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2}{\partial \alpha }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \alpha }=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2}{\partial \beta }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \beta }=0 \end{array}\right. } \end{aligned}$$
(44)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )X_i+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(45)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}({\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}({\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )X_i+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(46)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}(Y^{**}_i-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}(X_i Y^{**}_i-X_i\alpha -X^2_i\beta )+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(47)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i=1}^n(Y^*_i-\alpha -X'_i\beta )=0\\ \sum _{i=1}^n(X_i Y^*_i-X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$
(48)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \alpha =\overline{Y}^*-\beta \overline{x}\\ \beta =\frac{\sum _{i=1}^n(X_i-\overline{X})(Y^*_i-\overline{Y}^*)}{\sum _{i=1}^n(X_i-\overline{X})^2} \end{array}\right. } \end{aligned}$$
(49)

where \(Y^*_i=\delta _i T_i+(1-\delta _i)Y^{**}_i\), \(Y^{**}_i={\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j\), for \(i \in C\), and \(\overline{Y}^*=\frac{\sum _{i \in C}Y^{**}_i+\sum _{i \in D}T_i}{n}\).

End of proof.

3. Estimated coefficients of Ovarian carcinoma data (see Table 8)

4. Estimated coefficients of Cervical Squamous Cell Carcinoma data (see Table 9).

Table 8 Estimated coefficients with four types of loss function for screened 100 genes based on 134 training data
Table 9 Estimated coefficients with four types of loss function for screened 20 genes based on 59 training data

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Wang, S., Sun, G. et al. RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates. Acta Biotheor 67, 225–251 (2019). https://doi.org/10.1007/s10441-019-09349-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10441-019-09349-9

Keywords

Navigation