Abstract
Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is desirable to screen out unimportant predictors in order to bring the dimension down to a moderate scale. In this paper we consider the case when observations of the predictors are missing at random. We propose performing screening using the marginal linear correlation coefficient between each predictor and the response variable accounting for the missing data using maximum likelihood estimation. This method is shown to have the sure screening property. Moreover, a novel method of screening that uses additional predictors when estimating the correlation coefficient is proposed. Simulations show that simply performing screening using pairwise complete observations is out-performed by both the proposed methods and is not recommended. Finally, the proposed methods are applied to a gene expression study on prostate cancer.
Similar content being viewed by others
References
Abdulghani J, Gu L, Dagvadorj A, Lutz J, Leiby B, Bonuccelli G et al (2008) Stat3 promotes metastatic progression of prostate cancer. Am J Pathol 172(6):1717–1728
Anderson T (1957) Maximum-likelihood estimation for the multivariate normal distribution when some observations are missing. J Am Stat Assoc 52:200–203
Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley, Hoboken
Attouch M, Laksaci A, Messabihi N (2017) Nonparametric relative error regression for spatial random variables. Stat Pap 58(4):987–1008
Barnett GC, Thompson D, Fachal L, Kerns S, Talbot C, Elliott RM et al (2014) A genome wide association study (GWAS) providing evidence of an association between common genetic variants and late radiotherapy toxicity. Radiother Oncol 111(2):178–185
Beebe-Dimmer J, Hathcock M, Yee C, Okoth L, Isaacs W, Cooney K et al (2015) The HOXB13 G84E mutation is associated with an increased risk for prostate cancer and other malignancies. Cancer Epidemiol Biomarkers Prev 24(9):1366–1372
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188
Browning SR (2008) Missing data imputation and haplotype phase inference for genome-wide association studies. Hum Genet 124(5):439–450
Candes E, Tao T (2007) The Dantzig selector statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351
Castro E, Eeles R (2012) The role of BRCA1 and BRCA2 in prostate cancer. Asian J Androl 14(3):409–414
Cheema J (2014) A review of missing data handling methods in education research. Rev Educ Res 84(4):487–508
Chen Q, Wang S (2013) Variable selection for multiply imputed data with application to dioxin exposure study. Stat Med 32(21):3646–3659
Chen X, Chen X, Liu Y (2017) A note on quantile feature screening via distance correlation. Stat Pap. https://doi.org/10.1007/s00362-017-0894-8
Claeskens G, Consentino F (2008) Variable selection with incomplete covariate data. Biometrics 64:1062–1069
Dai J, Ruczinski I, LeBlanc M, Kooperberg C (2006) Imputation methods to improve inference in SNP association studies. Genet Epidemiol 30(8):690–702
Dang Y, Chang C, Ido M, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. J R Stat Soc Ser B (Methodological) 39(1):1–38
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38
Deters KD, Nho K, Risacher SL, Kim S, Ramanan VK, Crane PK et al (2017) Genome-wide association study of language performance in Alzheimer’s disease. Brain Lang 172:22–29
Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Thompson D, Ballinger DG et al (2007) Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447(7148):10871093
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Elkashef A, Allison S, Sadiq M, Basheer H, Morais G, Loadman P et al (2016) Polysialic acid sustains cancer cell survival and migratory capacity in a hypoxic environment. Sci Rep 6:33026
Ewing CM, Ray AM, Lange EM, Zuhlke KA, Robbins CM, Tembe WD et al (2012) Germline mutations in HOXB13 and prostate-cancer risk. N Engl J Med 366(2):141–149 PMID: 22236224
Faisal S, Tutz G (2017) Missing value imputation for gene expression data by tailored nearest neighbors. Stat Appl Genet Mol Biol 16(2):95–106
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol 70:849–911
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20:101–148
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. Mach Learn Res 10:1829–1853
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106(494):544–557
Faria R, Gomes M, Epstein D, White I (2014) A guide to handling missing data in cost-effectiveness analysis conducted within randomised controlled trials. Pharmacoeconomics 32(12):1157–1170
Fletcher O, Johnson N, Orr N, Hosking FJ, Gibson LJ, Walker K et al (2011) Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl Cancer Inst 103(5):425–435
Garcia RI, Ibrahim JG, Zhu H (2010a) Variable selection in the Cox regression model with covariates missing at random. Biometrics 66:97–104
Garcia RI, Ibrahim JG, Zhu H (2010b) Variable selection for regression models with missing data. Stat Sin 20:149–165
Greenshtein E, Ritov Y (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10:971–988
Haffmann E, Sorenson B, Sauter D, Lambert I (2015) Role of volume-regulated and calcium-activated anion channels in cell volume homeostasis, cancer and drug resistance. Channels (Austin) 9(6):380–396
Harel O, Zhou X (2007) Multiple imputation: review of theory, implementation, and software. Stat Med 26(16):3057–3077
Harel O, Pellowski J, Kalichman S (2012) Are we missing the importance of missing values in HIV prevention randomized clinical trials? Reviews and recommendations. AIDS Behav 16(6):1382–1393
Hernandez-Caballero M, Sierra-Ramirez J (2015) Single nucleotide polymorphisms of the fto gene and cancer risk: an overview. Mol Biol Rep 42(3):699–704
Horowitz JL (2015) Variable selection and estimation in high-dimensional models. Can J Econ 48(2):389–407
Horton N, Kleinman K (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90
Ibrahim JG, Lipsitz SR, Chen MH (2001) Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88:551–564
Ibrahim JG, Zhu H, Tang N (2008) Model selection criteria for missing-data problems using the EM algorithm. J Am Stat Assoc 103:1648–1658
Karimi O, Mohammadzadeh M (2012) Bayesian spatial regression models with closed skew normal correlated errors and missing observations. Stat Pap 53(1):205–218
Komatsu J, Ichikawa D, Hirajima S, Nagata H, Nishimura Y, Kawaguchi T et al (2015) Overexpression of SMYD2 contributes to malignant outcome in gastric cancer. Br J Cancer 112:357–364
Kowalski J, Tu XM (2007) Modern applied U statistics. Wiley, New York
Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105(C):201–216
Lansangan JRG, Barrios EB (2017) Simultaneous dimension reduction and variable selection in modeling high dimensional data. Comput Stat Data Anal 112:242–256
Law MH, Bishop DT, Lee JE, Brossard M, Martin NG, Moses EK et al (2015) Genome-wide meta-analysis identifies five new susceptibility loci for cutaneous malignant melanoma. Nat Genet 47(9):987–995
Li R, Zhong W, Zhu L (2012a) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139 PMID: 25249709
Li Z, Gopal V, Li X, Davis J, Casella G (2012b) Simultaneous snp identification in association studies with missing data. Ann Appl Stat 6(2):432–456
Liew A, Law N, Yan H (2011) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12(5):498–513
Little R, Rubin D (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Chichester
Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Stat Assoc 109(505):266–274
Liu Y, Wang Y, Feng Y, Wall M (2016) Variable selection and prediction with incomplete high-dimensional data. Ann Appl Stat 10(1):418–450
Long Q, Johnson B (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610
Lu J, Lin L (2017) Model-free conditional screening via conditional distance correlation. Stat Pap. https://doi.org/10.1007/s00362-017-0931-7
Luo M, Gong C, Chen C, Hu H, Huang P, Zheng M et al (2015) The Rab2A GTPase promotes breast cancer stem cells and tumorigenesis via Erk signaling activation. Cell Rep 11(1):111–124
Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39:906–913
Mills I (2014) HOXB13, RFX6 and prostate cancer risk. Nat Genet 46:94–95
Nagy R, Boutin TS, Marten J, Human JE, Kerr SM, Campbell A et al (2017) Exploration of haplotype research consortium imputation for genome-wide association studies in 20,032 generation scotland participants. Hum Genet 9(1):23
Neykov NM, Filzmoser P, Neytchev PN (2014) Ultrahigh dimensional variable selection through the penalized maximum trimmed likelihood estimator. Stat Pap 55(1):187–207
Paik MC, Tsai W (1997) On using Cox proportional hazard model with missing covariates. Biometrika 84:579–593
Pencik J, Schlederer M, Gruber W, Unger C, Walker SM, Chalaris A et al (2015) Stat3 regulated ARF expression suppresses prostate cancer metastasis. Nat Commun 6:7736
Pilie P, Giri V, Cooney K (2016) Hoxb13 and other high penetrant genes for prostate cancer. Asian J Androl 18(4):530–532
Pritchard CC, Mateo J, Walsh MF, De Sarkar N, Abida W, Beltran H et al (2016) Inherited DNA-repair gene mutations in men with metastatic prostate cancer. N Engl J Med 375(5):443–453 PMID: 27433846
Rabier C-E, Azas J-M, Elsen J-M, Delmas C (2016) Chi-square processes for gene mapping in a population with family structure. Stat Pap 60(1):239–271
Rahaman M, Kumarasiri M, Mekonnen L, Yu M, Diab S, Albrecht H et al (2016) Targeting CDK9: a promising therapeutic opportunity in prostate cancer. Endocr Relat Cancer 23(12):T211–T226
Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley series in probability and mathematical statistics. Wiley, New York
Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley series in probability and statistics. Wiley, New York
Shen C-W, Chen Y-H (2012) Model selection for generalized estimating equations accommodating dropout missingness. Biometrics 68:1046–1054
Suhre K, Arnold M, Bhagwat AM, Cotton RJ, Engelke R, Raer J et al (2017) Connecting genetic risk to disease end points through the human blood plasma proteome. Nat Commun 8:14357
Tang N, Xia L, Yan X (2018) Feature screening in ultrahighdimensional partially linear models with missing responses at random. Comput Stat Data Anal 133:208–227
Tibshirani R (1996) Regression shrinkage and selection via the lasoo. J R Stat Soc Ser B (Methodological) 58(1):267–288
Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, Morris DS et al (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448:595–599
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Trust W (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447:661–678
Wang Q, Li Y (2018) How to make model-free feature screening approaches for full data applicable to the case of missing response? Scand J Stat 45(2):324–346
Wang S, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5:468–485
Wang X, Inzunza H, Chang H, Qi Z, Hu B, Malone D et al (2013) Mutations in the hedgehog pathway genes SMO and PTCH1 in human gastric tumors. PLoS ONE 8(1):e54415
Wasserman L, Roeder K (2009) High-dimensional variable selection. Ann Stat 37(5A):2178–2201
Yan Q, Brehm J, Pino-Yanes M, Forno E, Lin J, Oh SS et al (2017) A meta-analysis of genome-wide association studies of asthma in Puerto Ricans. Eur Respir J 49(5):1601505
Yang H, Liu H (2016) Penalized weighted composite quantile estimators with missing covariates. Stat Pap 57(1):69–88
Yang X, Belin TR, Boscardin WJ (2005) Imputation and variable selection in linear regression models with missing covariates. Biometrics 61:498–506
Yang H, Guo C, Lv J (2016) Variable selection for generalized varying coefficient models with longitudinal data. Stat Pap 57(1):115–132
Yoon D, Lee E, Park T (2007) Robust imputation method for missing values in mocroarray data. BMC Bioinform 8(Suppl 2):S6
Zambom AZ, Akritas MG (2018) Hypothesis testing sure independence screening for nonparametric regression. Electron J Stat 12(1):767–792
Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 2.1
Proof
Recall that the log-likelihood of \({\varvec{\phi }}_j\) is
The inverted hessian of the log-likelihood evaluated at the estimated parameters is
so that the large sample covariance matrix for \({\varvec{\theta }}_j\) can be written as \(D(\rho _j)H^{-1}_{{\varvec{\phi }}_j}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T\), where \(D(\rho _j) = \left( \frac{\partial \rho _j}{\partial \mu _y}, \frac{\partial \rho _j}{\partial \sigma _y^2}, \frac{\partial \rho _j}{\partial \mu _{j \cdot y}}, \frac{\partial \rho _j}{\partial \beta _{j \cdot y}}, \frac{\partial \rho _j}{\partial \sigma _{j \cdot y}^2}\right) \). It can be shown that
and hence one finds
Since \({\hat{\rho }}_j\) is the maximum likelihood estimator computed from a Normal distribution, it follows that \([D(\rho _j)H^{-1}_{{\varvec{\phi }}}|_{{\hat{{\varvec{\phi }}}}_j}D(\rho _j)^T]^{-1/2}({\hat{\rho }}_j - \rho _j)\) converges to a standard Normal distribution. \(\square \)
Proof of Theorem 2.2
Proof
First note that the estimated covariance \(s_{jy}\) based on the completely observed pairs is, except for a scale of \((n_j-1)/(n_j)\), a U-statistic (Kowalski and Tu 2007)
where \({\bar{X}}_j = \sum _{i=1}^nX_{ij}\) and \(h_j(Y_i, Y_k, X_{ij}, X_{kj}) = (Y_i - Y_k)(X_{ij} - X_{kj})\) is the kernel of the U-statistic \(s_{jy}^*\). Note that \(E(s_{jy}^*) = \sigma _{jy} := \sigma _j\sigma _y\rho _j\).
We follow steps similar to those in Li et al. (2012a). First write
and define
Because \(s_{jy,1}^{*}\) can be written as an average of averages of i.i.d. random variables (Serfling 1980 - sec. 5.1.6), for any \(t > 0\) and \(\epsilon > 0\) we have
where \(m = [n_j/2]\) and the last inequality follows from Theorem 5.6.1A in Serfling (1980). Choose \(t = 4\epsilon m/M^2\) so that \(P(s_{jy,1}^{*} - \sigma _{jy,1} \ge \epsilon ) \le \exp (-2\epsilon ^2 m/M^2)\) and by symmetry of the U-statistics
Now we deal with \(s_{jy,2}^*\). Note that using Cauchy–Schwarz and Markov inequalities we have
for any \(s > 0\). Using assumptions C1, if we choose \(M = cn_j^\gamma \) for \(0< \gamma < 1/2 - k\), then \(\sigma _{jy,2} \le \epsilon /2\) when \(n_j\) is sufficiently large. Consequently,
for any \(s > 0\). Hence
Recall \({{\hat{\rho }}}_j = \frac{s_{jy}}{s_{j}s_y}\frac{{\hat{\sigma }}_y}{s_{y}}\frac{s_j}{{{\hat{\sigma }}}_j}\). Using similar arguments, one can show that the convergence rate of \(s_{y}, s_j, {\hat{\sigma }}_y\) and \({\hat{\sigma }}_j\) have the same form of (4) and hence by Lemma S4 in Liu et al. (2014) so does \({{\hat{\rho }}}_j\), so that we have
Letting \(\epsilon = cn_j^{-\kappa }\) we have
If \({\mathcal {A}} \not \subseteq \hat{{\mathcal {A}}}\), then there exists a \(j \in {\mathcal {A}}\) such that \({\hat{\rho }}_j < cn_j^{-\kappa }\). From condition C2 it follows that \(|{\hat{\rho }}_j - \rho _j| > cn_j^{-\kappa }\) for some \(j \in {\mathcal {A}}\). This implies that \(\{{\mathcal {A}} \not \subseteq \hat{{\mathcal {A}}}\} \subseteq \{|{\hat{\rho }}_j - \rho _j| > cn_j^{-\kappa } \text { for some } j \in {\mathcal {A}}\}\). Then
\(\square \)
Rights and permissions
About this article
Cite this article
Zambom, A.Z., Matthews, G.J. Sure independence screening in the presence of missing data. Stat Papers 62, 817–845 (2021). https://doi.org/10.1007/s00362-019-01115-w
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-019-01115-w