Variable selection techniques after multiple imputation in high-dimensional data,Statistical Methods & Applications

当前位置： X-MOL 学术 › Stat. Methods Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Variable selection techniques after multiple imputation in high-dimensional data
Statistical Methods & Applications ( IF 1.1 ) Pub Date : 2019-11-03 , DOI: 10.1007/s10260-019-00493-7
Faisal Maqbool Zahid , Shahla Faisal , Christian Heumann

High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.

中文翻译：

高维数据中多次插补后的变量选择技术

高维数据来自科学研究的各个领域。在此类数据中经常会遇到缺失值。变量选择在高维数据分析中起着关键作用。像许多其他统计技术一样，变量选择需要完整的案例，而不会缺少任何值。可以使用多种用于完整数据的变量选择技术，但是在文献中缺少用于缺少值的数据的类似技术。多重插补是一种处理缺失值并获取完整数据的流行方法。如果将特定的变量选择技术独立应用于每个多重估算数据集，则可能会得到每个数据集的不同模型。在文献中仍不清楚如何在多重估算数据上实施变量选择技术。在本文中，我们建议在所有估算数据集中使用每个候选预测变量的参数估计值的大小进行选择。对这些估计的绝对值之和施加约束，以从模型中选择或删除预测变量。在广泛的仿真研究中，将提出的用于识别信息预测变量的方法与其他方法进行了比较。根据不同数量的估算数据集的命中率（正确识别的信息预测变量的比例）和错误警报率（被称为信息的非信息预测变量的比例）比较性能。所提出的技术简单且易于实现，并且在高维情况下的性能与低维设置中的效果相当。在不同的仿真环境下，所提出的技术可以很好地与现有方法竞争。还针对具有缺失值的真实数据集检查了不同变量选择技术的性能。

更新日期：2019-11-03

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文