Sure independence screening in the presence of missing data

Zambom, Adriano Zanin; Matthews, Gregory J.

doi:10.1007/s00362-019-01115-w

Sure independence screening in the presence of missing data

Regular Article
Published: 29 May 2019

Volume 62, pages 817–845, (2021)
Cite this article

Statistical Papers Aims and scope Submit manuscript

Adriano Zanin Zambom¹ &
Gregory J. Matthews²

404 Accesses
3 Citations
Explore all metrics

Abstract

Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is desirable to screen out unimportant predictors in order to bring the dimension down to a moderate scale. In this paper we consider the case when observations of the predictors are missing at random. We propose performing screening using the marginal linear correlation coefficient between each predictor and the response variable accounting for the missing data using maximum likelihood estimation. This method is shown to have the sure screening property. Moreover, a novel method of screening that uses additional predictors when estimating the correlation coefficient is proposed. Simulations show that simply performing screening using pairwise complete observations is out-performed by both the proposed methods and is not recommended. Finally, the proposed methods are applied to a gene expression study on prostate cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 2

Censored cumulative residual independent screening for ultrahigh-dimensional survival data

Article 26 May 2017

Jing Zhang, Guosheng Yin, … Yuanshan Wu

Data-Adaptive Shrinkage via the Hyperpenalized EM Algorithm

Article 03 June 2015

Philip S. Boonstra, Jeremy M. G. Taylor & Bhramar Mukherjee

Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer

Article Open access 25 January 2017

Jing Zhai, Chiu-Hsieh Hsu & Z. John Daye

References

Abdulghani J, Gu L, Dagvadorj A, Lutz J, Leiby B, Bonuccelli G et al (2008) Stat3 promotes metastatic progression of prostate cancer. Am J Pathol 172(6):1717–1728
Google Scholar
Anderson T (1957) Maximum-likelihood estimation for the multivariate normal distribution when some observations are missing. J Am Stat Assoc 52:200–203
MATH Google Scholar
Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley, Hoboken
MATH Google Scholar
Attouch M, Laksaci A, Messabihi N (2017) Nonparametric relative error regression for spatial random variables. Stat Pap 58(4):987–1008
MathSciNet MATH Google Scholar
Barnett GC, Thompson D, Fachal L, Kerns S, Talbot C, Elliott RM et al (2014) A genome wide association study (GWAS) providing evidence of an association between common genetic variants and late radiotherapy toxicity. Radiother Oncol 111(2):178–185
Google Scholar
Beebe-Dimmer J, Hathcock M, Yee C, Okoth L, Isaacs W, Cooney K et al (2015) The HOXB13 G84E mutation is associated with an increased risk for prostate cancer and other malignancies. Cancer Epidemiol Biomarkers Prev 24(9):1366–1372
Google Scholar
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188
MathSciNet MATH Google Scholar
Browning SR (2008) Missing data imputation and haplotype phase inference for genome-wide association studies. Hum Genet 124(5):439–450
Google Scholar
Candes E, Tao T (2007) The Dantzig selector statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351
MathSciNet MATH Google Scholar
Castro E, Eeles R (2012) The role of BRCA1 and BRCA2 in prostate cancer. Asian J Androl 14(3):409–414
Google Scholar
Cheema J (2014) A review of missing data handling methods in education research. Rev Educ Res 84(4):487–508
Google Scholar
Chen Q, Wang S (2013) Variable selection for multiply imputed data with application to dioxin exposure study. Stat Med 32(21):3646–3659
MathSciNet Google Scholar
Chen X, Chen X, Liu Y (2017) A note on quantile feature screening via distance correlation. Stat Pap. https://doi.org/10.1007/s00362-017-0894-8
Claeskens G, Consentino F (2008) Variable selection with incomplete covariate data. Biometrics 64:1062–1069
MathSciNet MATH Google Scholar
Dai J, Ruczinski I, LeBlanc M, Kooperberg C (2006) Imputation methods to improve inference in SNP association studies. Genet Epidemiol 30(8):690–702
Google Scholar
Dang Y, Chang C, Ido M, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. J R Stat Soc Ser B (Methodological) 39(1):1–38
Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38
MathSciNet MATH Google Scholar
Deters KD, Nho K, Risacher SL, Kim S, Ramanan VK, Crane PK et al (2017) Genome-wide association study of language performance in Alzheimer’s disease. Brain Lang 172:22–29
Google Scholar
Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Thompson D, Ballinger DG et al (2007) Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447(7148):10871093
Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
MathSciNet MATH Google Scholar
Elkashef A, Allison S, Sadiq M, Basheer H, Morais G, Loadman P et al (2016) Polysialic acid sustains cancer cell survival and migratory capacity in a hypoxic environment. Sci Rep 6:33026
Google Scholar
Ewing CM, Ray AM, Lange EM, Zuhlke KA, Robbins CM, Tembe WD et al (2012) Germline mutations in HOXB13 and prostate-cancer risk. N Engl J Med 366(2):141–149 PMID: 22236224
Google Scholar
Faisal S, Tutz G (2017) Missing value imputation for gene expression data by tailored nearest neighbors. Stat Appl Genet Mol Biol 16(2):95–106
MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
MathSciNet MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol 70:849–911
MathSciNet MATH Google Scholar
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20:101–148
MathSciNet MATH Google Scholar
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. Mach Learn Res 10:1829–1853
MathSciNet MATH Google Scholar
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106(494):544–557
MathSciNet MATH Google Scholar
Faria R, Gomes M, Epstein D, White I (2014) A guide to handling missing data in cost-effectiveness analysis conducted within randomised controlled trials. Pharmacoeconomics 32(12):1157–1170
Google Scholar
Fletcher O, Johnson N, Orr N, Hosking FJ, Gibson LJ, Walker K et al (2011) Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl Cancer Inst 103(5):425–435
Google Scholar
Garcia RI, Ibrahim JG, Zhu H (2010a) Variable selection in the Cox regression model with covariates missing at random. Biometrics 66:97–104
MathSciNet MATH Google Scholar
Garcia RI, Ibrahim JG, Zhu H (2010b) Variable selection for regression models with missing data. Stat Sin 20:149–165
MathSciNet MATH Google Scholar
Greenshtein E, Ritov Y (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10:971–988
MathSciNet MATH Google Scholar
Haffmann E, Sorenson B, Sauter D, Lambert I (2015) Role of volume-regulated and calcium-activated anion channels in cell volume homeostasis, cancer and drug resistance. Channels (Austin) 9(6):380–396
Google Scholar
Harel O, Zhou X (2007) Multiple imputation: review of theory, implementation, and software. Stat Med 26(16):3057–3077
MathSciNet Google Scholar
Harel O, Pellowski J, Kalichman S (2012) Are we missing the importance of missing values in HIV prevention randomized clinical trials? Reviews and recommendations. AIDS Behav 16(6):1382–1393
Google Scholar
Hernandez-Caballero M, Sierra-Ramirez J (2015) Single nucleotide polymorphisms of the fto gene and cancer risk: an overview. Mol Biol Rep 42(3):699–704
Google Scholar
Horowitz JL (2015) Variable selection and estimation in high-dimensional models. Can J Econ 48(2):389–407
Google Scholar
Horton N, Kleinman K (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90
MathSciNet Google Scholar
Ibrahim JG, Lipsitz SR, Chen MH (2001) Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88:551–564
MathSciNet MATH Google Scholar
Ibrahim JG, Zhu H, Tang N (2008) Model selection criteria for missing-data problems using the EM algorithm. J Am Stat Assoc 103:1648–1658
MathSciNet MATH Google Scholar
Karimi O, Mohammadzadeh M (2012) Bayesian spatial regression models with closed skew normal correlated errors and missing observations. Stat Pap 53(1):205–218
MathSciNet MATH Google Scholar
Komatsu J, Ichikawa D, Hirajima S, Nagata H, Nishimura Y, Kawaguchi T et al (2015) Overexpression of SMYD2 contributes to malignant outcome in gastric cancer. Br J Cancer 112:357–364
Google Scholar
Kowalski J, Tu XM (2007) Modern applied U statistics. Wiley, New York
MATH Google Scholar
Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105(C):201–216
MathSciNet MATH Google Scholar
Lansangan JRG, Barrios EB (2017) Simultaneous dimension reduction and variable selection in modeling high dimensional data. Comput Stat Data Anal 112:242–256
MathSciNet MATH Google Scholar
Law MH, Bishop DT, Lee JE, Brossard M, Martin NG, Moses EK et al (2015) Genome-wide meta-analysis identifies five new susceptibility loci for cutaneous malignant melanoma. Nat Genet 47(9):987–995
Google Scholar
Li R, Zhong W, Zhu L (2012a) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139 PMID: 25249709
MathSciNet MATH Google Scholar
Li Z, Gopal V, Li X, Davis J, Casella G (2012b) Simultaneous snp identification in association studies with missing data. Ann Appl Stat 6(2):432–456
MathSciNet MATH Google Scholar
Liew A, Law N, Yan H (2011) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12(5):498–513
Google Scholar
Little R, Rubin D (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Chichester
Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Stat Assoc 109(505):266–274
MathSciNet MATH Google Scholar
Liu Y, Wang Y, Feng Y, Wall M (2016) Variable selection and prediction with incomplete high-dimensional data. Ann Appl Stat 10(1):418–450
MathSciNet MATH Google Scholar
Long Q, Johnson B (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610
MathSciNet Google Scholar
Lu J, Lin L (2017) Model-free conditional screening via conditional distance correlation. Stat Pap. https://doi.org/10.1007/s00362-017-0931-7
Luo M, Gong C, Chen C, Hu H, Huang P, Zheng M et al (2015) The Rab2A GTPase promotes breast cancer stem cells and tumorigenesis via Erk signaling activation. Cell Rep 11(1):111–124
Google Scholar
Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39:906–913
Google Scholar
Mills I (2014) HOXB13, RFX6 and prostate cancer risk. Nat Genet 46:94–95
Google Scholar
Nagy R, Boutin TS, Marten J, Human JE, Kerr SM, Campbell A et al (2017) Exploration of haplotype research consortium imputation for genome-wide association studies in 20,032 generation scotland participants. Hum Genet 9(1):23
Google Scholar
Neykov NM, Filzmoser P, Neytchev PN (2014) Ultrahigh dimensional variable selection through the penalized maximum trimmed likelihood estimator. Stat Pap 55(1):187–207
MathSciNet MATH Google Scholar
Paik MC, Tsai W (1997) On using Cox proportional hazard model with missing covariates. Biometrika 84:579–593
MathSciNet MATH Google Scholar
Pencik J, Schlederer M, Gruber W, Unger C, Walker SM, Chalaris A et al (2015) Stat3 regulated ARF expression suppresses prostate cancer metastasis. Nat Commun 6:7736
Google Scholar
Pilie P, Giri V, Cooney K (2016) Hoxb13 and other high penetrant genes for prostate cancer. Asian J Androl 18(4):530–532
Google Scholar
Pritchard CC, Mateo J, Walsh MF, De Sarkar N, Abida W, Beltran H et al (2016) Inherited DNA-repair gene mutations in men with metastatic prostate cancer. N Engl J Med 375(5):443–453 PMID: 27433846
Google Scholar
Rabier C-E, Azas J-M, Elsen J-M, Delmas C (2016) Chi-square processes for gene mapping in a population with family structure. Stat Pap 60(1):239–271
MathSciNet MATH Google Scholar
Rahaman M, Kumarasiri M, Mekonnen L, Yu M, Diab S, Albrecht H et al (2016) Targeting CDK9: a promising therapeutic opportunity in prostate cancer. Endocr Relat Cancer 23(12):T211–T226
Google Scholar
Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley series in probability and mathematical statistics. Wiley, New York
Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley series in probability and statistics. Wiley, New York
Shen C-W, Chen Y-H (2012) Model selection for generalized estimating equations accommodating dropout missingness. Biometrics 68:1046–1054
MathSciNet MATH Google Scholar
Suhre K, Arnold M, Bhagwat AM, Cotton RJ, Engelke R, Raer J et al (2017) Connecting genetic risk to disease end points through the human blood plasma proteome. Nat Commun 8:14357
Google Scholar
Tang N, Xia L, Yan X (2018) Feature screening in ultrahighdimensional partially linear models with missing responses at random. Comput Stat Data Anal 133:208–227
MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasoo. J R Stat Soc Ser B (Methodological) 58(1):267–288
MATH Google Scholar
Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, Morris DS et al (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448:595–599
Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Google Scholar
Trust W (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447:661–678
Google Scholar
Wang Q, Li Y (2018) How to make model-free feature screening approaches for full data applicable to the case of missing response? Scand J Stat 45(2):324–346
MathSciNet MATH Google Scholar
Wang S, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5:468–485
MathSciNet MATH Google Scholar
Wang X, Inzunza H, Chang H, Qi Z, Hu B, Malone D et al (2013) Mutations in the hedgehog pathway genes SMO and PTCH1 in human gastric tumors. PLoS ONE 8(1):e54415
Google Scholar
Wasserman L, Roeder K (2009) High-dimensional variable selection. Ann Stat 37(5A):2178–2201
MathSciNet MATH Google Scholar
Yan Q, Brehm J, Pino-Yanes M, Forno E, Lin J, Oh SS et al (2017) A meta-analysis of genome-wide association studies of asthma in Puerto Ricans. Eur Respir J 49(5):1601505
Google Scholar
Yang H, Liu H (2016) Penalized weighted composite quantile estimators with missing covariates. Stat Pap 57(1):69–88
MathSciNet MATH Google Scholar
Yang X, Belin TR, Boscardin WJ (2005) Imputation and variable selection in linear regression models with missing covariates. Biometrics 61:498–506
MathSciNet MATH Google Scholar
Yang H, Guo C, Lv J (2016) Variable selection for generalized varying coefficient models with longitudinal data. Stat Pap 57(1):115–132
MathSciNet MATH Google Scholar
Yoon D, Lee E, Park T (2007) Robust imputation method for missing values in mocroarray data. BMC Bioinform 8(Suppl 2):S6
Google Scholar
Zambom AZ, Akritas MG (2018) Hypothesis testing sure independence screening for nonparametric regression. Electron J Stat 12(1):767–792
MathSciNet MATH Google Scholar
Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035
MathSciNet Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

California State University, Northridge, Northridge, USA
Adriano Zanin Zambom
Loyola University Chicago, Chicago, USA
Gregory J. Matthews

Authors

Adriano Zanin Zambom
View author publications
You can also search for this author in PubMed Google Scholar
Gregory J. Matthews
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adriano Zanin Zambom.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem 2.1

Proof

Recall that the log-likelihood of ${\varvec{\phi }}_j$ is

$$\begin{aligned}&\ell (\phi _j| \{(X_i, Y_i)\}_{i = 1}^{n_j}, \{Y_k\}_{k = n_j+1}^{n})\\&\quad = -\frac{1}{2\sigma _{j \cdot y}^2}\sum _{i=1}^{n_j}(X_{ij} - \mu _{j \cdot y} - \beta _{jy}Y_i)^2 - \frac{n_j\log (\sigma _{j \cdot y}^2)}{2}\\&\qquad -\frac{1}{2\sigma _{y}^2}\sum _{i=1}^n(Y_i - \mu _{y})^2 - \frac{n\log (\sigma _{y}^2)}{2}. \end{aligned}$$

The inverted hessian of the log-likelihood evaluated at the estimated parameters is

$$\begin{aligned} H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j} = \begin{bmatrix} {\hat{\sigma }}_{y}^2/n&0&0&0&0 \\ 0&2{\hat{\sigma }}_{y}^4/n&0&0&0 \\ 0&0&{\hat{\sigma }}_{j \cdot y}^2(1 + {\bar{y}}^2/s_{y}^2)/n_j&-{\bar{y}}{\hat{\sigma }}_{j \cdot y}^2/(n_js_{y}^2)&0 \\ 0&0&-{\bar{y}}{\hat{\sigma }}_{j \cdot y}^2/(n_js_{y}^2)&{\hat{\sigma }}_{j \cdot y}^2/(n_js_{y}^2)&0\\ 0&0&0&0&2{\hat{\sigma }}_{j \cdot y}^4/n_j \end{bmatrix}, \end{aligned}$$

so that the large sample covariance matrix for ${\varvec{\theta }}_j$ can be written as $D(\rho _j)H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T$, where $D(\rho _j) = \left( \frac{\partial \rho _j}{\partial \mu _y}, \frac{\partial \rho _j}{\partial \sigma _y^2}, \frac{\partial \rho _j}{\partial \mu _{j \cdot y}}, \frac{\partial \rho _j}{\partial \beta _{j \cdot y}}, \frac{\partial \rho _j}{\partial \sigma _{j \cdot y}^2}\right) $. It can be shown that

$$\begin{aligned}&D(\rho _j) {=} \left( 0, \frac{\sigma _{j \cdot y}^2\beta _{j \cdot y}}{2\sqrt{\sigma _{y}^2}(\beta _{j \cdot y}^2\sigma _{y}^2 + \sigma _{j \cdot y}^2)^{3/2}}, 0, \frac{\sigma _{j \cdot y}^2\sqrt{\sigma _{y}^2}}{(\beta _{j \cdot y}^2\sigma _{y}^2 + \sigma _{j \cdot y}^2)^{3/2}}, -\frac{\beta _{j \cdot y}\sqrt{\sigma _{y}^2}}{2(\beta _{j \cdot y}^2\sigma _{y}^2 + \sigma _{j \cdot y}^2)^{3/2}}\right) , \end{aligned}$$

and hence one finds

$$\begin{aligned}&D(\rho _j)H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T \\&\quad = \frac{{{\hat{\sigma }}}_{j \cdot y}^4{{\hat{\beta }}}_{j \cdot y}^2}{4{{\hat{\sigma }}}_{y}^2({{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2 + {{\hat{\sigma }}}_{j \cdot y}^2)^{3}}\frac{2{{\hat{\sigma }}}_y^4}{n} + \frac{{{\hat{\sigma }}}_{j \cdot y}^4{{\hat{\sigma }}}_{y}^2}{({{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2 + {{\hat{\sigma }}}_{j \cdot y}^2)^{3}}\frac{{{\hat{\sigma }}}_{j \cdot y}^2}{n_js_{y}^2} \\&\qquad + \frac{{{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2}{4({{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2 + {{\hat{\sigma }}}_{j \cdot y}^2)^{3}}\frac{2{{\hat{\sigma }}}_{j \cdot y}^4}{n_j}\\&\quad = \Big [2{{\hat{\sigma }}}_y^4s_y^2n_j(s_j^2-s_{jy}^2/s_y^2)^2s_{jy}^2/s_y^4 + 4n{{\hat{\sigma }}}_y^4(s_j^2-s_{jy}^2/s_y^2)^3\\&\qquad + 2ns_y^2{{\hat{\sigma }}}_y^4(s_j^2-s_{jy}^2/s_y^2)^2s_{jy}^2/s_y^4\Big ]/\left[ 4nn_j{{\hat{\sigma }}}_y^2s_y^2({{\hat{\sigma }}}_y^2s_{jy}^2/s_y^4+s_j^2-s_{jy}^2/s_y^2)^3\right] \\&\quad = \frac{(1-{{\tilde{\rho }}})^2{{\hat{\sigma }}}_y^4\left[ 2s_y^2n_j{{\tilde{\rho }}}^2s_j^6/s_y^2 + 4ns_j^6(1-{{\tilde{\rho }}}^2) + 2ns_y^2s_j^6{{\tilde{\rho }}}^2/s_y^2\right] }{4nn_js_y^2{{\hat{\sigma }}}_y^2s_j^6({{\tilde{\rho }}}^2({{\hat{\sigma }}}_y^2/s_y^2)+1)^3}\\&\quad = (1 - {\tilde{\rho }}_j^2)^2\left( \frac{{\hat{\sigma }}_y^2}{s_y^2}\right) \left( \frac{1}{nn_j}\right) \left( \frac{{\tilde{\rho }}_j^2(n_j-n)/2 + n}{\left( {\tilde{\rho }}_j^2(\frac{{\hat{\sigma }}_y^2}{s_y^2} - 1) + 1\right) ^3}\right) . \end{aligned}$$

Since ${\hat{\rho }}_j$ is the maximum likelihood estimator computed from a Normal distribution, it follows that $[D(\rho _j)H^{-1}_{{\varvec{\phi }}}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T]^{-1/2}({\hat{\rho }}_j - \rho _j)$ converges to a standard Normal distribution. $\square $

Proof of Theorem 2.2

Proof

First note that the estimated covariance $s_{jy}$ based on the completely observed pairs is, except for a scale of $(n_j-1)/(n_j)$, a U-statistic (Kowalski and Tu 2007)

$$\begin{aligned} s_{jy}= & {} \frac{1}{n_j}\sum _{i=1}^{n_{j}}(Y_{i} - {\bar{Y}})(X_{ij} - {\bar{X}}_j) = \frac{n_j -1}{n_j}{n_j \atopwithdelims ()2}^{-1}\sum _{i\ne k}^{n_j}\frac{1}{2}(Y_i - Y_k)(X_{ij} - X_{kj})\\= & {} \frac{n_j -1}{n_j} \frac{1}{(n_j)(n_j-1)}\sum _{i\ne k}^{n_j} h_j(Y_i, Y_k, X_{ij}, X_{kj}) := \frac{n_j -1}{n_j}s_{jy}^*, \end{aligned}$$

where ${\bar{X}}_j = \sum _{i=1}^nX_{ij}$ and $h_j(Y_i, Y_k, X_{ij}, X_{kj}) = (Y_i - Y_k)(X_{ij} - X_{kj})$ is the kernel of the U-statistic $s_{jy}^*$. Note that $E(s_{jy}^*) = \sigma _{jy} := \sigma _j\sigma _y\rho _j$.

We follow steps similar to those in Li et al. (2012a). First write

$$\begin{aligned}&s_{jy}^* = s_{jy,1}^{*} + s_{jy,2}^{*}\\&\quad := \frac{1}{n_j(n_j-1)}\sum _{i\ne k}^{n_j} h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) \le M) \\&\qquad + \frac{1}{n_j(n_j-1)}\sum _{i\ne k}^{n_j} h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) > M), \end{aligned}$$

and define

$$\begin{aligned} \sigma _{jy,1}:= & {} E(s_{jy,1}^{*}) = E[ h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) \le M)], \\ \sigma _{jy,2}:= & {} E(s_{jy,2}^{*}) = E[h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) > M)]. \end{aligned}$$

Because $s_{jy,1}^{*}$ can be written as an average of averages of i.i.d. random variables (Serfling 1980 - sec. 5.1.6), for any $t > 0$ and $\epsilon > 0$ we have

$$\begin{aligned}&P(s_{jy,1}^{*} - \sigma _{yj,1} \ge \epsilon ) \\&\quad \le \exp (-t\epsilon )\exp (-t\sigma _{jy,1})E(\exp (ts_{jy,1}^{*}))\\&\quad = \exp (-t\epsilon )\exp (-t\sigma _{jy,1}) E\left( \exp \left( t \frac{1}{n_j!}\sum _{n_j!}\frac{1}{m}\sum _{m}h_j^{(m)}I(h_j^{(m)} \le M)\right) \right) \\&\quad \le \exp (-t\epsilon )\exp (-t\sigma _{jy,1})E^m\left( \exp \left( \frac{1}{m}th_j^{(m)}I(h_j^{(m)} \le M)\right) \right) \\&\quad = \exp (-t\epsilon )E^m\left( \exp \left( \frac{1}{m}t\left( h_j^{(m)}I(h_j^{(m)} \le M) - \sigma _{yj,1}\right) \right) \right) , \end{aligned}$$

where $m = [n_j/2]$ and the last inequality follows from Theorem 5.6.1A in Serfling (1980). Choose $t = 4\epsilon m/M^2$ so that $P(s_{jy,1}^{*} - \sigma _{jy,1} \ge \epsilon ) \le \exp (-2\epsilon ^2 m/M^2)$ and by symmetry of the U-statistics

$$\begin{aligned} P(|s_{jy,1}^{*} - \sigma _{jy,1}| \ge \epsilon ) \le 2\exp (-2\epsilon ^2 m/M^2). \end{aligned}$$

(3)

Now we deal with $s_{jy,2}^*$. Note that using Cauchy–Schwarz and Markov inequalities we have

$$\begin{aligned} \sigma _{jy, 2}^2\le & {} E[(Y_i - Y_k)^2(X_{ij} - X_{kj})^2]P[(Y_i - Y_k)(X_{ij} - X_{kj}) \ge M]\\\le & {} E[(Y_i - Y_k)^2(X_{ij} - X_{kj})^2]E[\exp (s(Y_i - Y_k)(X_{ij} - X_{kj}))]\exp (-sM) \end{aligned}$$

for any $s > 0$. Using assumptions C1, if we choose $M = cn_j^\gamma $ for $0< \gamma < 1/2 - k$, then $\sigma _{jy,2} \le \epsilon /2$ when $n_j$ is sufficiently large. Consequently,

$$\begin{aligned}&P(|s_{jy,2}^{*} - \sigma _{jy,2}|> \epsilon )\\&\quad \le P(|s_{jy,2}^{*}|> \epsilon /2) \le P(\cup \{(Y_i - Y_k)(X_{ij} - X_{kj})> M\}\\&\quad \le n_jP((Y_i - Y_k)(X_{ij} - X_{kj})> M)\\&\quad = n_jP[\exp (s(Y_i - Y_k)(X_{ij} - X_{kj})) > \exp (sM)]\\&\quad \le n_j\exp (-sM)E(\exp \{s(Y_i - Y_k)(X_{ij} - X_{kj})\}) = n_jC\exp (-sM), \end{aligned}$$

for any $s > 0$. Hence

$$\begin{aligned}&P(|s_{jy}^{*} - \sigma _{jy}|> 2\epsilon )\nonumber \\&\quad = P(|s_{jy,1}^{*} + s_{jy,2}^{*} - \sigma _{jy,1} - \sigma _{jy,2}| \ge 2\epsilon )\nonumber \\&\quad \le P(|s_{jy,1}^{*} - \sigma _{jy,1}|> \epsilon ) + P(|s_{jy,2}^{*} - \sigma _{jy,2}| > \epsilon )\nonumber \\&\quad \le O(\exp (-c_1\epsilon ^2n_j^{1-2\gamma }) + n_j\exp (-c_2n_j^\gamma )). \end{aligned}$$

(4)

Recall ${{\hat{\rho }}}_j = \frac{s_{jy}}{s_{j}s_y}\frac{{\hat{\sigma }}_y}{s_{y}}\frac{s_j}{{{\hat{\sigma }}}_j}$. Using similar arguments, one can show that the convergence rate of $s_{y}, s_j, {\hat{\sigma }}_y$ and ${\hat{\sigma }}_j$ have the same form of (4) and hence by Lemma S4 in Liu et al. (2014) so does ${{\hat{\rho }}}_j$, so that we have

$$\begin{aligned}&P(|{\hat{\rho }}_j - \rho _{j} | \ge cn_j^{-\kappa })\\&\quad \le P(|{\hat{\rho }}_j - \rho _{j}| \ge c n_j^{-\kappa })\\&\quad = O([\exp (-c_1n_j^{1-2(\gamma +\kappa )}) + n_j\exp (-c_2n_j^\gamma )]). \\&P(|{\hat{\rho }}_j - \rho _{j} | \ge cn_j^{-\kappa }, \text { for all } j)\\&\quad \le \sum _{j = 1}^d P(|{\hat{\rho }}_j - \rho _{j}| \ge c n_j^{-\kappa })\\&\quad = \sum _{j = 1}^dO([\exp (-c_1n_j^{1-2(\gamma +\kappa )}) + n_j\exp (-c_2n_j^\gamma )]). \end{aligned}$$

Letting $\epsilon = cn_j^{-\kappa }$ we have

$$\begin{aligned}&P(\max _{j = 1, \ldots , d}|{\hat{\rho }}_j - \rho _{j} | \ge c n_j^{-\kappa })\\&\quad \le d \max _{j = 1, \ldots , d} P(|{\hat{\rho }}_j - \rho _{j}| \ge cn_j^{-\kappa })\\&\quad = d\max _{j = 1, \ldots , d}O(\exp (-c_1 n_j^{-2\kappa }n_j^{1-2\gamma }) + n_j\exp (-c_2n_j^\gamma ))\\&\quad \le O(d\exp (-c_1\min _jn_j^{1-2(\gamma +\kappa )}) + \max _j\{n_j\exp (-c_2n_j^\gamma )\}). \end{aligned}$$

If ${\mathcal {A}} \not \subseteq \hat{{\mathcal {A}}}$, then there exists a $j \in {\mathcal {A}}$ such that ${\hat{\rho }}_j < cn_j^{-\kappa }$. From condition C2 it follows that $|{\hat{\rho }}_j - \rho _j| > cn_j^{-\kappa }$ for some $j \in {\mathcal {A}}$. This implies that $\{{\mathcal {A}} \not \subseteq \hat{{\mathcal {A}}}\} \subseteq \{|{\hat{\rho }}_j - \rho _j| > cn_j^{-\kappa } \text { for some } j \in {\mathcal {A}}\}$. Then

$$\begin{aligned}&P({\mathcal {A}} \subseteq \hat{{\mathcal {A}}})\\&\quad \ge P(|{\hat{\rho }}_j - \rho _{j}| \le cn_j^{-\kappa }, \text { for all } j \in {\mathcal {A}}) \\&\quad = 1 - P(|{\hat{\rho }}_j - \rho _{j}|> cn_j^{-\kappa }, \text { for some } j \in {\mathcal {A}})\\&\quad \ge 1 - \sum _{j \in {\mathcal {A}}}P(|{\hat{\rho }}_j - \rho _{j}| > cn_j^{-\kappa })\\&\quad = 1 - \sum _{j \in {\mathcal {A}}}O(\exp (-c_1\min _jn_j^{1-2(\gamma +\kappa )}) + \max _j\{n_j\exp (-c_2n_j^\gamma )\}). \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zambom, A.Z., Matthews, G.J. Sure independence screening in the presence of missing data. Stat Papers 62, 817–845 (2021). https://doi.org/10.1007/s00362-019-01115-w

Download citation

Received: 19 September 2018
Revised: 17 May 2019
Published: 29 May 2019
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00362-019-01115-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sure independence screening in the presence of missing data

Abstract

Access this article

Similar content being viewed by others

Censored cumulative residual independent screening for ultrahigh-dimensional survival data

Data-Adaptive Shrinkage via the Hyperpenalized EM Algorithm

Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sure independence screening in the presence of missing data

Abstract

Access this article

Similar content being viewed by others

Censored cumulative residual independent screening for ultrahigh-dimensional survival data

Data-Adaptive Shrinkage via the Hyperpenalized EM Algorithm

Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation