Automatically Identifying Relevant Variables for Linear Regression with the Lasso Method: A Methodological Primer for its Application with R and a Performance Contrast Simulation with Alternative Selection Strategies,Communication Methods and Measures

当前位置： X-MOL 学术 › Communication Methods and Measures › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automatically Identifying Relevant Variables for Linear Regression with the Lasso Method: A Methodological Primer for its Application with R and a Performance Contrast Simulation with Alternative Selection Strategies
Communication Methods and Measures ( IF 11.4 ) Pub Date : 2019-10-28 , DOI: 10.1080/19312458.2019.1677882
Sebastian Scherr ₁ , Jing Zhou ₂

Affiliation

ABSTRACT The abundance of available digital big data has created new challenges in identifying relevant variables for regression models. One statistical problem that gained relevance in the era of big data is high-dimensional statistical inference, when the number of variables greatly exceeds the number of observations. Typically, prediction errors in linear regression skyrocket when the number of included variables gets close to the number of observations, and ordinary least squares (OLS) regression no longer works in a high-dimensional scenario. Regularized estimators as a feasible solution include the Least Absolute Shrinkage and Selection Operator (Lasso), which we introduce to communication scholars here. We will include the statistical background of this technique that combines estimation and variable selection simultaneously and helps identify relevant variables for regression models in high-dimensional scenarios. We contrast the Lasso with two alternative strategies of selecting variables for regression models, namely, a theory-based “subset selection” of variables and a nonselective “all in” strategy. The simulation shows that the Lasso produces lower and more relatively stable prediction errors than the two alternative variable selection strategies, and it is therefore recommended to use, especially in high-dimensional settings typical in times of big data analysis.

中文翻译：

使用套索方法自动识别线性回归的相关变量：使用R进行应用的方法学入门和使用替代选择策略的性能对比模拟

摘要大量可用的数字大数据为识别回归模型的相关变量带来了新的挑战。在大数据时代获得相关性的一个统计问题是高维统计推断，这时变量数大大超过了观察数。通常，当包含变量的数量接近观测值的数量时，线性回归中的预测误差会急剧上升，而普通最小二乘（OLS）回归在高维场景中将不再起作用。正规估计器是可行的解决方案，其中包括最小绝对收缩和选择算子（Lasso），我们在这里将其介绍给传播学者。我们将包括该技术的统计背景，该技术同时结合了估计和变量选择，并有助于为高维场景中的回归模型识别相关变量。我们将套索与两种为回归模型选择变量的替代策略进行了对比，即基于理论的变量“子集选择”和非选择性“全部”策略。仿真表明，与两种替代变量选择策略相比，套索产生的预测误差更低且相对更稳定，因此建议使用它，尤其是在大数据分析时通常使用的高维设置中。即基于理论的变量“子集选择”和非选择性“全能”策略。仿真表明，与两种替代变量选择策略相比，套索产生的预测误差更低且相对更稳定，因此建议使用它，尤其是在大数据分析时通常使用的高维设置中。即基于理论的变量“子集选择”和非选择性“全能”策略。仿真表明，与两种替代变量选择策略相比，套索产生的预测误差更低且相对更稳定，因此建议使用它，尤其是在大数据分析时通常使用的高维设置中。

更新日期：2019-10-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>