当前位置: X-MOL 学术arXiv.cs.IT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Price of Competition: Effect Size Heterogeneity Matters in High Dimensions
arXiv - CS - Information Theory Pub Date : 2020-07-01 , DOI: arxiv-2007.00566
Hua Wang, Yachong Yang, Weijie J. Su

In high-dimensional linear regression, would increasing effect sizes always improve model selection, while maintaining all the other conditions unchanged (especially fixing the sparsity of regression coefficients)? In this paper, we answer this question in the \textit{negative} in the regime of linear sparsity for the Lasso method, by introducing a new notion we term effect size heterogeneity. Roughly speaking, a regression coefficient vector has high effect size heterogeneity if its nonzero entries have significantly different magnitudes. From the viewpoint of this new measure, we prove that the false and true positive rates achieve the optimal trade-off uniformly along the Lasso path when this measure is maximal in a certain sense, and the worst trade-off is achieved when it is minimal in the sense that all nonzero effect sizes are roughly equal. Moreover, we demonstrate that the first false selection occurs much earlier when effect size heterogeneity is minimal than when it is maximal. The underlying cause of these two phenomena is, metaphorically speaking, the "competition" among variables with effect sizes of the same magnitude in entering the model. Taken together, our findings suggest that effect size heterogeneity shall serve as an important complementary measure to the sparsity of regression coefficients in the analysis of high-dimensional regression problems. Our proofs use techniques from approximate message passing theory as well as a novel technique for estimating the rank of the first false variable.

中文翻译:

竞争的代价:效应大小的异质性在高维度上很重要

在高维线性回归中,增加效应量是否总是会改善模型选择,同时保持所有其他条件不变(尤其是固定回归系数的稀疏性)?在本文中,我们在 Lasso 方法的线性稀疏机制中的 \textit{negative} 中回答了这个问题,通过引入一个新的概念,我们称之为效应大小异质性。粗略地说,如果回归系数向量的非零项具有显着不同的量级,则它具有很高的效应大小异质性。从这个新度量的角度来看,我们证明了当这个度量在某种意义上最大时,假阳性率和真阳性率沿着套索路径均匀地实现了最优权衡,最坏的权衡是在所有非零效应大小大致相等的意义上是最小的。此外,我们证明,当效应大小异质性最小时,第一个错误选择发生得比最大时要早得多。这两种现象的根本原因,隐喻地说,是在进入模型时效应量大小相同的变量之间的“竞争”。综上所述,我们的研究结果表明,在高维回归问题的分析中,效应大小异质性应作为回归系数稀疏性的重要补充措施。我们的证明使用来自近似消息传递理论的技术以及用于估计第一个错误变量的等级的新技术。我们证明,当效应大小异质性最小时,第一个错误选择发生得比最大时要早得多。这两种现象的根本原因,隐喻地说,是在进入模型时效应量大小相同的变量之间的“竞争”。综上所述,我们的研究结果表明,在高维回归问题的分析中,效应大小异质性应作为回归系数稀疏性的重要补充措施。我们的证明使用来自近似消息传递理论的技术以及用于估计第一个错误变量的等级的新技术。我们证明,当效应大小异质性最小时,第一个错误选择发生得比最大时要早得多。这两种现象的根本原因,隐喻地说,是在进入模型时效应量大小相同的变量之间的“竞争”。综上所述,我们的研究结果表明,在高维回归问题的分析中,效应大小异质性应作为回归系数稀疏性的重要补充措施。我们的证明使用来自近似消息传递理论的技术以及用于估计第一个错误变量的等级的新技术。
更新日期:2020-07-06
down
wechat
bug