当前位置: X-MOL 学术Stat. Pap. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sure independence screening in the presence of missing data
Statistical Papers ( IF 1.3 ) Pub Date : 2019-05-29 , DOI: 10.1007/s00362-019-01115-w
Adriano Zanin Zambom , Gregory J. Matthews

Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is desirable to screen out unimportant predictors in order to bring the dimension down to a moderate scale. In this paper we consider the case when observations of the predictors are missing at random. We propose performing screening using the marginal linear correlation coefficient between each predictor and the response variable accounting for the missing data using maximum likelihood estimation. This method is shown to have the sure screening property. Moreover, a novel method of screening that uses additional predictors when estimating the correlation coefficient is proposed. Simulations show that simply performing screening using pairwise complete observations is out-performed by both the proposed methods and is not recommended. Finally, the proposed methods are applied to a gene expression study on prostate cancer.

中文翻译:

存在缺失数据时的确定独立性筛选

超高维数据集中的变量选择是一个日益普遍的问题,这些数据来自例如全基因组关联研究或基因表达数据。当特征空间的维度比样本大小呈指数级大时,最好筛选出不重要的预测变量,以便将维度降低到中等规模。在本文中,我们考虑了随机丢失预测变量的情况。我们建议使用每个预测变量和响应变量之间的边际线性相关系数进行筛选,使用最大似然估计来解释缺失的数据。该方法被证明具有确定的筛选特性。而且,提出了一种新的筛选方法,该方法在估计相关系数时使用额外的预测变量。模拟表明,仅使用成对完整观测进行筛选的效果优于所提出的两种方法,因此不推荐使用。最后,将所提出的方法应用于前列腺癌的基因表达研究。
更新日期:2019-05-29
down
wechat
bug