High-dimensional regression with potential prior information on variable importance,Statistics and Computing

当前位置： X-MOL 学术 › Stat. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High-dimensional regression with potential prior information on variable importance
Statistics and Computing ( IF 1.6 ) Pub Date : 2022-06-14 , DOI: 10.1007/s11222-022-10110-5
Benjamin G. Stokell , Rajen D. Shah

There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include the ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number M of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a \(\log M\) price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and in time series settings. An R package is available on github.

中文翻译：

具有关于变量重要性的潜在先验信息的高维回归

在各种设置中，可能会获得关于预测变量在高维回归设置中的重要性的模糊先验信息。示例包括由经验方差提供的变量排序（通常通过标准化丢弃），在时间序列设置中拟合自回归模型时预测变量的滞后，或变量的缺失程度。虽然这样的排序可能与变量的真正重要性不匹配，但我们认为使用它们几乎不会丢失，并且可能会获得很多。我们提出了一个简单的方案，涉及拟合由排序指示的一系列模型。我们表明，当使用岭回归时，拟合所有模型的计算成本不超过一次拟合岭回归的计算成本，并描述 Lasso 回归的策略，该策略利用先前的拟合来大大加快拟合整个模型序列的速度。我们建议通过交叉验证来选择最终估计器，并提供关于从多个测试集中选择的测试集上性能最佳估计器质量的一般结果高维线性回归设置中的竞争估计器M。我们的结果不需要稀疏假设，并表明与未知的最佳估计量相比，只产生了\(\log M\)价格。当应用于丢失或损坏的数据以及时间序列设置时，我们展示了我们的方法的有效性。github 上有一个 R 包。

更新日期：2022-06-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11