Skip to main content
Log in

Sure independence screening in the presence of missing data

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is desirable to screen out unimportant predictors in order to bring the dimension down to a moderate scale. In this paper we consider the case when observations of the predictors are missing at random. We propose performing screening using the marginal linear correlation coefficient between each predictor and the response variable accounting for the missing data using maximum likelihood estimation. This method is shown to have the sure screening property. Moreover, a novel method of screening that uses additional predictors when estimating the correlation coefficient is proposed. Simulations show that simply performing screening using pairwise complete observations is out-performed by both the proposed methods and is not recommended. Finally, the proposed methods are applied to a gene expression study on prostate cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Abdulghani J, Gu L, Dagvadorj A, Lutz J, Leiby B, Bonuccelli G et al (2008) Stat3 promotes metastatic progression of prostate cancer. Am J Pathol 172(6):1717–1728

    Google Scholar 

  • Anderson T (1957) Maximum-likelihood estimation for the multivariate normal distribution when some observations are missing. J Am Stat Assoc 52:200–203

    MATH  Google Scholar 

  • Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley, Hoboken

    MATH  Google Scholar 

  • Attouch M, Laksaci A, Messabihi N (2017) Nonparametric relative error regression for spatial random variables. Stat Pap 58(4):987–1008

    MathSciNet  MATH  Google Scholar 

  • Barnett GC, Thompson D, Fachal L, Kerns S, Talbot C, Elliott RM et al (2014) A genome wide association study (GWAS) providing evidence of an association between common genetic variants and late radiotherapy toxicity. Radiother Oncol 111(2):178–185

    Google Scholar 

  • Beebe-Dimmer J, Hathcock M, Yee C, Okoth L, Isaacs W, Cooney K et al (2015) The HOXB13 G84E mutation is associated with an increased risk for prostate cancer and other malignancies. Cancer Epidemiol Biomarkers Prev 24(9):1366–1372

    Google Scholar 

  • Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188

    MathSciNet  MATH  Google Scholar 

  • Browning SR (2008) Missing data imputation and haplotype phase inference for genome-wide association studies. Hum Genet 124(5):439–450

    Google Scholar 

  • Candes E, Tao T (2007) The Dantzig selector statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351

    MathSciNet  MATH  Google Scholar 

  • Castro E, Eeles R (2012) The role of BRCA1 and BRCA2 in prostate cancer. Asian J Androl 14(3):409–414

    Google Scholar 

  • Cheema J (2014) A review of missing data handling methods in education research. Rev Educ Res 84(4):487–508

    Google Scholar 

  • Chen Q, Wang S (2013) Variable selection for multiply imputed data with application to dioxin exposure study. Stat Med 32(21):3646–3659

    MathSciNet  Google Scholar 

  • Chen X, Chen X, Liu Y (2017) A note on quantile feature screening via distance correlation. Stat Pap. https://doi.org/10.1007/s00362-017-0894-8

  • Claeskens G, Consentino F (2008) Variable selection with incomplete covariate data. Biometrics 64:1062–1069

    MathSciNet  MATH  Google Scholar 

  • Dai J, Ruczinski I, LeBlanc M, Kooperberg C (2006) Imputation methods to improve inference in SNP association studies. Genet Epidemiol 30(8):690–702

    Google Scholar 

  • Dang Y, Chang C, Ido M, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. J R Stat Soc Ser B (Methodological) 39(1):1–38

    Google Scholar 

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Deters KD, Nho K, Risacher SL, Kim S, Ramanan VK, Crane PK et al (2017) Genome-wide association study of language performance in Alzheimer’s disease. Brain Lang 172:22–29

    Google Scholar 

  • Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Thompson D, Ballinger DG et al (2007) Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447(7148):10871093

    Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    MathSciNet  MATH  Google Scholar 

  • Elkashef A, Allison S, Sadiq M, Basheer H, Morais G, Loadman P et al (2016) Polysialic acid sustains cancer cell survival and migratory capacity in a hypoxic environment. Sci Rep 6:33026

    Google Scholar 

  • Ewing CM, Ray AM, Lange EM, Zuhlke KA, Robbins CM, Tembe WD et al (2012) Germline mutations in HOXB13 and prostate-cancer risk. N Engl J Med 366(2):141–149 PMID: 22236224

    Google Scholar 

  • Faisal S, Tutz G (2017) Missing value imputation for gene expression data by tailored nearest neighbors. Stat Appl Genet Mol Biol 16(2):95–106

    MathSciNet  MATH  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol 70:849–911

    MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20:101–148

    MathSciNet  MATH  Google Scholar 

  • Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. Mach Learn Res 10:1829–1853

    MathSciNet  MATH  Google Scholar 

  • Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106(494):544–557

    MathSciNet  MATH  Google Scholar 

  • Faria R, Gomes M, Epstein D, White I (2014) A guide to handling missing data in cost-effectiveness analysis conducted within randomised controlled trials. Pharmacoeconomics 32(12):1157–1170

    Google Scholar 

  • Fletcher O, Johnson N, Orr N, Hosking FJ, Gibson LJ, Walker K et al (2011) Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl Cancer Inst 103(5):425–435

    Google Scholar 

  • Garcia RI, Ibrahim JG, Zhu H (2010a) Variable selection in the Cox regression model with covariates missing at random. Biometrics 66:97–104

    MathSciNet  MATH  Google Scholar 

  • Garcia RI, Ibrahim JG, Zhu H (2010b) Variable selection for regression models with missing data. Stat Sin 20:149–165

    MathSciNet  MATH  Google Scholar 

  • Greenshtein E, Ritov Y (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10:971–988

    MathSciNet  MATH  Google Scholar 

  • Haffmann E, Sorenson B, Sauter D, Lambert I (2015) Role of volume-regulated and calcium-activated anion channels in cell volume homeostasis, cancer and drug resistance. Channels (Austin) 9(6):380–396

    Google Scholar 

  • Harel O, Zhou X (2007) Multiple imputation: review of theory, implementation, and software. Stat Med 26(16):3057–3077

    MathSciNet  Google Scholar 

  • Harel O, Pellowski J, Kalichman S (2012) Are we missing the importance of missing values in HIV prevention randomized clinical trials? Reviews and recommendations. AIDS Behav 16(6):1382–1393

    Google Scholar 

  • Hernandez-Caballero M, Sierra-Ramirez J (2015) Single nucleotide polymorphisms of the fto gene and cancer risk: an overview. Mol Biol Rep 42(3):699–704

    Google Scholar 

  • Horowitz JL (2015) Variable selection and estimation in high-dimensional models. Can J Econ 48(2):389–407

    Google Scholar 

  • Horton N, Kleinman K (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90

    MathSciNet  Google Scholar 

  • Ibrahim JG, Lipsitz SR, Chen MH (2001) Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88:551–564

    MathSciNet  MATH  Google Scholar 

  • Ibrahim JG, Zhu H, Tang N (2008) Model selection criteria for missing-data problems using the EM algorithm. J Am Stat Assoc 103:1648–1658

    MathSciNet  MATH  Google Scholar 

  • Karimi O, Mohammadzadeh M (2012) Bayesian spatial regression models with closed skew normal correlated errors and missing observations. Stat Pap 53(1):205–218

    MathSciNet  MATH  Google Scholar 

  • Komatsu J, Ichikawa D, Hirajima S, Nagata H, Nishimura Y, Kawaguchi T et al (2015) Overexpression of SMYD2 contributes to malignant outcome in gastric cancer. Br J Cancer 112:357–364

    Google Scholar 

  • Kowalski J, Tu XM (2007) Modern applied U statistics. Wiley, New York

    MATH  Google Scholar 

  • Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105(C):201–216

    MathSciNet  MATH  Google Scholar 

  • Lansangan JRG, Barrios EB (2017) Simultaneous dimension reduction and variable selection in modeling high dimensional data. Comput Stat Data Anal 112:242–256

    MathSciNet  MATH  Google Scholar 

  • Law MH, Bishop DT, Lee JE, Brossard M, Martin NG, Moses EK et al (2015) Genome-wide meta-analysis identifies five new susceptibility loci for cutaneous malignant melanoma. Nat Genet 47(9):987–995

    Google Scholar 

  • Li R, Zhong W, Zhu L (2012a) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139 PMID: 25249709

    MathSciNet  MATH  Google Scholar 

  • Li Z, Gopal V, Li X, Davis J, Casella G (2012b) Simultaneous snp identification in association studies with missing data. Ann Appl Stat 6(2):432–456

    MathSciNet  MATH  Google Scholar 

  • Liew A, Law N, Yan H (2011) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12(5):498–513

    Google Scholar 

  • Little R, Rubin D (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Chichester

  • Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Stat Assoc 109(505):266–274

    MathSciNet  MATH  Google Scholar 

  • Liu Y, Wang Y, Feng Y, Wall M (2016) Variable selection and prediction with incomplete high-dimensional data. Ann Appl Stat 10(1):418–450

    MathSciNet  MATH  Google Scholar 

  • Long Q, Johnson B (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610

    MathSciNet  Google Scholar 

  • Lu J, Lin L (2017) Model-free conditional screening via conditional distance correlation. Stat Pap. https://doi.org/10.1007/s00362-017-0931-7

  • Luo M, Gong C, Chen C, Hu H, Huang P, Zheng M et al (2015) The Rab2A GTPase promotes breast cancer stem cells and tumorigenesis via Erk signaling activation. Cell Rep 11(1):111–124

    Google Scholar 

  • Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39:906–913

    Google Scholar 

  • Mills I (2014) HOXB13, RFX6 and prostate cancer risk. Nat Genet 46:94–95

    Google Scholar 

  • Nagy R, Boutin TS, Marten J, Human JE, Kerr SM, Campbell A et al (2017) Exploration of haplotype research consortium imputation for genome-wide association studies in 20,032 generation scotland participants. Hum Genet 9(1):23

    Google Scholar 

  • Neykov NM, Filzmoser P, Neytchev PN (2014) Ultrahigh dimensional variable selection through the penalized maximum trimmed likelihood estimator. Stat Pap 55(1):187–207

    MathSciNet  MATH  Google Scholar 

  • Paik MC, Tsai W (1997) On using Cox proportional hazard model with missing covariates. Biometrika 84:579–593

    MathSciNet  MATH  Google Scholar 

  • Pencik J, Schlederer M, Gruber W, Unger C, Walker SM, Chalaris A et al (2015) Stat3 regulated ARF expression suppresses prostate cancer metastasis. Nat Commun 6:7736

    Google Scholar 

  • Pilie P, Giri V, Cooney K (2016) Hoxb13 and other high penetrant genes for prostate cancer. Asian J Androl 18(4):530–532

    Google Scholar 

  • Pritchard CC, Mateo J, Walsh MF, De Sarkar N, Abida W, Beltran H et al (2016) Inherited DNA-repair gene mutations in men with metastatic prostate cancer. N Engl J Med 375(5):443–453 PMID: 27433846

    Google Scholar 

  • Rabier C-E, Azas J-M, Elsen J-M, Delmas C (2016) Chi-square processes for gene mapping in a population with family structure. Stat Pap 60(1):239–271

    MathSciNet  MATH  Google Scholar 

  • Rahaman M, Kumarasiri M, Mekonnen L, Yu M, Diab S, Albrecht H et al (2016) Targeting CDK9: a promising therapeutic opportunity in prostate cancer. Endocr Relat Cancer 23(12):T211–T226

    Google Scholar 

  • Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley series in probability and mathematical statistics. Wiley, New York

  • Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley series in probability and statistics. Wiley, New York

  • Shen C-W, Chen Y-H (2012) Model selection for generalized estimating equations accommodating dropout missingness. Biometrics 68:1046–1054

    MathSciNet  MATH  Google Scholar 

  • Suhre K, Arnold M, Bhagwat AM, Cotton RJ, Engelke R, Raer J et al (2017) Connecting genetic risk to disease end points through the human blood plasma proteome. Nat Commun 8:14357

    Google Scholar 

  • Tang N, Xia L, Yan X (2018) Feature screening in ultrahighdimensional partially linear models with missing responses at random. Comput Stat Data Anal 133:208–227

    MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasoo. J R Stat Soc Ser B (Methodological) 58(1):267–288

    MATH  Google Scholar 

  • Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, Morris DS et al (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448:595–599

    Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525

    Google Scholar 

  • Trust W (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447:661–678

    Google Scholar 

  • Wang Q, Li Y (2018) How to make model-free feature screening approaches for full data applicable to the case of missing response? Scand J Stat 45(2):324–346

    MathSciNet  MATH  Google Scholar 

  • Wang S, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5:468–485

    MathSciNet  MATH  Google Scholar 

  • Wang X, Inzunza H, Chang H, Qi Z, Hu B, Malone D et al (2013) Mutations in the hedgehog pathway genes SMO and PTCH1 in human gastric tumors. PLoS ONE 8(1):e54415

    Google Scholar 

  • Wasserman L, Roeder K (2009) High-dimensional variable selection. Ann Stat 37(5A):2178–2201

    MathSciNet  MATH  Google Scholar 

  • Yan Q, Brehm J, Pino-Yanes M, Forno E, Lin J, Oh SS et al (2017) A meta-analysis of genome-wide association studies of asthma in Puerto Ricans. Eur Respir J 49(5):1601505

    Google Scholar 

  • Yang H, Liu H (2016) Penalized weighted composite quantile estimators with missing covariates. Stat Pap 57(1):69–88

    MathSciNet  MATH  Google Scholar 

  • Yang X, Belin TR, Boscardin WJ (2005) Imputation and variable selection in linear regression models with missing covariates. Biometrics 61:498–506

    MathSciNet  MATH  Google Scholar 

  • Yang H, Guo C, Lv J (2016) Variable selection for generalized varying coefficient models with longitudinal data. Stat Pap 57(1):115–132

    MathSciNet  MATH  Google Scholar 

  • Yoon D, Lee E, Park T (2007) Robust imputation method for missing values in mocroarray data. BMC Bioinform 8(Suppl 2):S6

    Google Scholar 

  • Zambom AZ, Akritas MG (2018) Hypothesis testing sure independence screening for nonparametric regression. Electron J Stat 12(1):767–792

    MathSciNet  MATH  Google Scholar 

  • Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035

    MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adriano Zanin Zambom.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Theorem 2.1

Proof

Recall that the log-likelihood of \({\varvec{\phi }}_j\) is

$$\begin{aligned}&\ell (\phi _j| \{(X_i, Y_i)\}_{i = 1}^{n_j}, \{Y_k\}_{k = n_j+1}^{n})\\&\quad = -\frac{1}{2\sigma _{j \cdot y}^2}\sum _{i=1}^{n_j}(X_{ij} - \mu _{j \cdot y} - \beta _{jy}Y_i)^2 - \frac{n_j\log (\sigma _{j \cdot y}^2)}{2}\\&\qquad -\frac{1}{2\sigma _{y}^2}\sum _{i=1}^n(Y_i - \mu _{y})^2 - \frac{n\log (\sigma _{y}^2)}{2}. \end{aligned}$$

The inverted hessian of the log-likelihood evaluated at the estimated parameters is

$$\begin{aligned} H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j} = \begin{bmatrix} {\hat{\sigma }}_{y}^2/n&0&0&0&0 \\ 0&2{\hat{\sigma }}_{y}^4/n&0&0&0 \\ 0&0&{\hat{\sigma }}_{j \cdot y}^2(1 + {\bar{y}}^2/s_{y}^2)/n_j&-{\bar{y}}{\hat{\sigma }}_{j \cdot y}^2/(n_js_{y}^2)&0 \\ 0&0&-{\bar{y}}{\hat{\sigma }}_{j \cdot y}^2/(n_js_{y}^2)&{\hat{\sigma }}_{j \cdot y}^2/(n_js_{y}^2)&0\\ 0&0&0&0&2{\hat{\sigma }}_{j \cdot y}^4/n_j \end{bmatrix}, \end{aligned}$$

so that the large sample covariance matrix for \({\varvec{\theta }}_j\) can be written as \(D(\rho _j)H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T\), where \(D(\rho _j) = \left( \frac{\partial \rho _j}{\partial \mu _y}, \frac{\partial \rho _j}{\partial \sigma _y^2}, \frac{\partial \rho _j}{\partial \mu _{j \cdot y}}, \frac{\partial \rho _j}{\partial \beta _{j \cdot y}}, \frac{\partial \rho _j}{\partial \sigma _{j \cdot y}^2}\right) \). It can be shown that

$$\begin{aligned}&D(\rho _j) {=} \left( 0, \frac{\sigma _{j \cdot y}^2\beta _{j \cdot y}}{2\sqrt{\sigma _{y}^2}(\beta _{j \cdot y}^2\sigma _{y}^2 + \sigma _{j \cdot y}^2)^{3/2}}, 0, \frac{\sigma _{j \cdot y}^2\sqrt{\sigma _{y}^2}}{(\beta _{j \cdot y}^2\sigma _{y}^2 + \sigma _{j \cdot y}^2)^{3/2}}, -\frac{\beta _{j \cdot y}\sqrt{\sigma _{y}^2}}{2(\beta _{j \cdot y}^2\sigma _{y}^2 + \sigma _{j \cdot y}^2)^{3/2}}\right) , \end{aligned}$$

and hence one finds

$$\begin{aligned}&D(\rho _j)H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T \\&\quad = \frac{{{\hat{\sigma }}}_{j \cdot y}^4{{\hat{\beta }}}_{j \cdot y}^2}{4{{\hat{\sigma }}}_{y}^2({{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2 + {{\hat{\sigma }}}_{j \cdot y}^2)^{3}}\frac{2{{\hat{\sigma }}}_y^4}{n} + \frac{{{\hat{\sigma }}}_{j \cdot y}^4{{\hat{\sigma }}}_{y}^2}{({{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2 + {{\hat{\sigma }}}_{j \cdot y}^2)^{3}}\frac{{{\hat{\sigma }}}_{j \cdot y}^2}{n_js_{y}^2} \\&\qquad + \frac{{{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2}{4({{\hat{\beta }}}_{j \cdot y}^2{{\hat{\sigma }}}_{y}^2 + {{\hat{\sigma }}}_{j \cdot y}^2)^{3}}\frac{2{{\hat{\sigma }}}_{j \cdot y}^4}{n_j}\\&\quad = \Big [2{{\hat{\sigma }}}_y^4s_y^2n_j(s_j^2-s_{jy}^2/s_y^2)^2s_{jy}^2/s_y^4 + 4n{{\hat{\sigma }}}_y^4(s_j^2-s_{jy}^2/s_y^2)^3\\&\qquad + 2ns_y^2{{\hat{\sigma }}}_y^4(s_j^2-s_{jy}^2/s_y^2)^2s_{jy}^2/s_y^4\Big ]/\left[ 4nn_j{{\hat{\sigma }}}_y^2s_y^2({{\hat{\sigma }}}_y^2s_{jy}^2/s_y^4+s_j^2-s_{jy}^2/s_y^2)^3\right] \\&\quad = \frac{(1-{{\tilde{\rho }}})^2{{\hat{\sigma }}}_y^4\left[ 2s_y^2n_j{{\tilde{\rho }}}^2s_j^6/s_y^2 + 4ns_j^6(1-{{\tilde{\rho }}}^2) + 2ns_y^2s_j^6{{\tilde{\rho }}}^2/s_y^2\right] }{4nn_js_y^2{{\hat{\sigma }}}_y^2s_j^6({{\tilde{\rho }}}^2({{\hat{\sigma }}}_y^2/s_y^2)+1)^3}\\&\quad = (1 - {\tilde{\rho }}_j^2)^2\left( \frac{{\hat{\sigma }}_y^2}{s_y^2}\right) \left( \frac{1}{nn_j}\right) \left( \frac{{\tilde{\rho }}_j^2(n_j-n)/2 + n}{\left( {\tilde{\rho }}_j^2(\frac{{\hat{\sigma }}_y^2}{s_y^2} - 1) + 1\right) ^3}\right) . \end{aligned}$$

Since \({\hat{\rho }}_j\) is the maximum likelihood estimator computed from a Normal distribution, it follows that \([D(\rho _j)H^{-1}_{{\varvec{\phi }}}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T]^{-1/2}({\hat{\rho }}_j - \rho _j)\) converges to a standard Normal distribution. \(\square \)

Proof of Theorem 2.2

Proof

First note that the estimated covariance \(s_{jy}\) based on the completely observed pairs is, except for a scale of \((n_j-1)/(n_j)\), a U-statistic (Kowalski and Tu 2007)

$$\begin{aligned} s_{jy}= & {} \frac{1}{n_j}\sum _{i=1}^{n_{j}}(Y_{i} - {\bar{Y}})(X_{ij} - {\bar{X}}_j) = \frac{n_j -1}{n_j}{n_j \atopwithdelims ()2}^{-1}\sum _{i\ne k}^{n_j}\frac{1}{2}(Y_i - Y_k)(X_{ij} - X_{kj})\\= & {} \frac{n_j -1}{n_j} \frac{1}{(n_j)(n_j-1)}\sum _{i\ne k}^{n_j} h_j(Y_i, Y_k, X_{ij}, X_{kj}) := \frac{n_j -1}{n_j}s_{jy}^*, \end{aligned}$$

where \({\bar{X}}_j = \sum _{i=1}^nX_{ij}\) and \(h_j(Y_i, Y_k, X_{ij}, X_{kj}) = (Y_i - Y_k)(X_{ij} - X_{kj})\) is the kernel of the U-statistic \(s_{jy}^*\). Note that \(E(s_{jy}^*) = \sigma _{jy} := \sigma _j\sigma _y\rho _j\).

We follow steps similar to those in Li et al. (2012a). First write

$$\begin{aligned}&s_{jy}^* = s_{jy,1}^{*} + s_{jy,2}^{*}\\&\quad := \frac{1}{n_j(n_j-1)}\sum _{i\ne k}^{n_j} h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) \le M) \\&\qquad + \frac{1}{n_j(n_j-1)}\sum _{i\ne k}^{n_j} h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) > M), \end{aligned}$$

and define

$$\begin{aligned} \sigma _{jy,1}:= & {} E(s_{jy,1}^{*}) = E[ h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) \le M)], \\ \sigma _{jy,2}:= & {} E(s_{jy,2}^{*}) = E[h_j(Y_i, Y_k, X_{ij}, X_{kj})I(h_j(Y_i, Y_k, X_{ij}, X_{kj}) > M)]. \end{aligned}$$

Because \(s_{jy,1}^{*}\) can be written as an average of averages of i.i.d. random variables (Serfling 1980 - sec. 5.1.6), for any \(t > 0\) and \(\epsilon > 0\) we have

$$\begin{aligned}&P(s_{jy,1}^{*} - \sigma _{yj,1} \ge \epsilon ) \\&\quad \le \exp (-t\epsilon )\exp (-t\sigma _{jy,1})E(\exp (ts_{jy,1}^{*}))\\&\quad = \exp (-t\epsilon )\exp (-t\sigma _{jy,1}) E\left( \exp \left( t \frac{1}{n_j!}\sum _{n_j!}\frac{1}{m}\sum _{m}h_j^{(m)}I(h_j^{(m)} \le M)\right) \right) \\&\quad \le \exp (-t\epsilon )\exp (-t\sigma _{jy,1})E^m\left( \exp \left( \frac{1}{m}th_j^{(m)}I(h_j^{(m)} \le M)\right) \right) \\&\quad = \exp (-t\epsilon )E^m\left( \exp \left( \frac{1}{m}t\left( h_j^{(m)}I(h_j^{(m)} \le M) - \sigma _{yj,1}\right) \right) \right) , \end{aligned}$$

where \(m = [n_j/2]\) and the last inequality follows from Theorem 5.6.1A in Serfling (1980). Choose \(t = 4\epsilon m/M^2\) so that \(P(s_{jy,1}^{*} - \sigma _{jy,1} \ge \epsilon ) \le \exp (-2\epsilon ^2 m/M^2)\) and by symmetry of the U-statistics

$$\begin{aligned} P(|s_{jy,1}^{*} - \sigma _{jy,1}| \ge \epsilon ) \le 2\exp (-2\epsilon ^2 m/M^2). \end{aligned}$$
(3)

Now we deal with \(s_{jy,2}^*\). Note that using Cauchy–Schwarz and Markov inequalities we have

$$\begin{aligned} \sigma _{jy, 2}^2\le & {} E[(Y_i - Y_k)^2(X_{ij} - X_{kj})^2]P[(Y_i - Y_k)(X_{ij} - X_{kj}) \ge M]\\\le & {} E[(Y_i - Y_k)^2(X_{ij} - X_{kj})^2]E[\exp (s(Y_i - Y_k)(X_{ij} - X_{kj}))]\exp (-sM) \end{aligned}$$

for any \(s > 0\). Using assumptions C1, if we choose \(M = cn_j^\gamma \) for \(0< \gamma < 1/2 - k\), then \(\sigma _{jy,2} \le \epsilon /2\) when \(n_j\) is sufficiently large. Consequently,

$$\begin{aligned}&P(|s_{jy,2}^{*} - \sigma _{jy,2}|> \epsilon )\\&\quad \le P(|s_{jy,2}^{*}|> \epsilon /2) \le P(\cup \{(Y_i - Y_k)(X_{ij} - X_{kj})> M\}\\&\quad \le n_jP((Y_i - Y_k)(X_{ij} - X_{kj})> M)\\&\quad = n_jP[\exp (s(Y_i - Y_k)(X_{ij} - X_{kj})) > \exp (sM)]\\&\quad \le n_j\exp (-sM)E(\exp \{s(Y_i - Y_k)(X_{ij} - X_{kj})\}) = n_jC\exp (-sM), \end{aligned}$$

for any \(s > 0\). Hence

$$\begin{aligned}&P(|s_{jy}^{*} - \sigma _{jy}|> 2\epsilon )\nonumber \\&\quad = P(|s_{jy,1}^{*} + s_{jy,2}^{*} - \sigma _{jy,1} - \sigma _{jy,2}| \ge 2\epsilon )\nonumber \\&\quad \le P(|s_{jy,1}^{*} - \sigma _{jy,1}|> \epsilon ) + P(|s_{jy,2}^{*} - \sigma _{jy,2}| > \epsilon )\nonumber \\&\quad \le O(\exp (-c_1\epsilon ^2n_j^{1-2\gamma }) + n_j\exp (-c_2n_j^\gamma )). \end{aligned}$$
(4)

Recall \({{\hat{\rho }}}_j = \frac{s_{jy}}{s_{j}s_y}\frac{{\hat{\sigma }}_y}{s_{y}}\frac{s_j}{{{\hat{\sigma }}}_j}\). Using similar arguments, one can show that the convergence rate of \(s_{y}, s_j, {\hat{\sigma }}_y\) and \({\hat{\sigma }}_j\) have the same form of (4) and hence by Lemma S4 in Liu et al. (2014) so does \({{\hat{\rho }}}_j\), so that we have

$$\begin{aligned}&P(|{\hat{\rho }}_j - \rho _{j} | \ge cn_j^{-\kappa })\\&\quad \le P(|{\hat{\rho }}_j - \rho _{j}| \ge c n_j^{-\kappa })\\&\quad = O([\exp (-c_1n_j^{1-2(\gamma +\kappa )}) + n_j\exp (-c_2n_j^\gamma )]). \\&P(|{\hat{\rho }}_j - \rho _{j} | \ge cn_j^{-\kappa }, \text { for all } j)\\&\quad \le \sum _{j = 1}^d P(|{\hat{\rho }}_j - \rho _{j}| \ge c n_j^{-\kappa })\\&\quad = \sum _{j = 1}^dO([\exp (-c_1n_j^{1-2(\gamma +\kappa )}) + n_j\exp (-c_2n_j^\gamma )]). \end{aligned}$$

Letting \(\epsilon = cn_j^{-\kappa }\) we have

$$\begin{aligned}&P(\max _{j = 1, \ldots , d}|{\hat{\rho }}_j - \rho _{j} | \ge c n_j^{-\kappa })\\&\quad \le d \max _{j = 1, \ldots , d} P(|{\hat{\rho }}_j - \rho _{j}| \ge cn_j^{-\kappa })\\&\quad = d\max _{j = 1, \ldots , d}O(\exp (-c_1 n_j^{-2\kappa }n_j^{1-2\gamma }) + n_j\exp (-c_2n_j^\gamma ))\\&\quad \le O(d\exp (-c_1\min _jn_j^{1-2(\gamma +\kappa )}) + \max _j\{n_j\exp (-c_2n_j^\gamma )\}). \end{aligned}$$

If \({\mathcal {A}} \not \subseteq \hat{{\mathcal {A}}}\), then there exists a \(j \in {\mathcal {A}}\) such that \({\hat{\rho }}_j < cn_j^{-\kappa }\). From condition C2 it follows that \(|{\hat{\rho }}_j - \rho _j| > cn_j^{-\kappa }\) for some \(j \in {\mathcal {A}}\). This implies that \(\{{\mathcal {A}} \not \subseteq \hat{{\mathcal {A}}}\} \subseteq \{|{\hat{\rho }}_j - \rho _j| > cn_j^{-\kappa } \text { for some } j \in {\mathcal {A}}\}\). Then

$$\begin{aligned}&P({\mathcal {A}} \subseteq \hat{{\mathcal {A}}})\\&\quad \ge P(|{\hat{\rho }}_j - \rho _{j}| \le cn_j^{-\kappa }, \text { for all } j \in {\mathcal {A}}) \\&\quad = 1 - P(|{\hat{\rho }}_j - \rho _{j}|> cn_j^{-\kappa }, \text { for some } j \in {\mathcal {A}})\\&\quad \ge 1 - \sum _{j \in {\mathcal {A}}}P(|{\hat{\rho }}_j - \rho _{j}| > cn_j^{-\kappa })\\&\quad = 1 - \sum _{j \in {\mathcal {A}}}O(\exp (-c_1\min _jn_j^{1-2(\gamma +\kappa )}) + \max _j\{n_j\exp (-c_2n_j^\gamma )\}). \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zambom, A.Z., Matthews, G.J. Sure independence screening in the presence of missing data. Stat Papers 62, 817–845 (2021). https://doi.org/10.1007/s00362-019-01115-w

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-019-01115-w

Keywords

Navigation