Understanding the Under-Coverage Bias in Uncertainty Estimation,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Understanding the Under-Coverage Bias in Uncertainty Estimation
arXiv - CS - Machine Learning Pub Date : 2021-06-10 , DOI: arxiv-2106.05515
Yu Bai, Song Mei, Huan Wang, Caiming Xiong

Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression -- a vanilla algorithm for learning quantiles with asymptotic guarantees -- tends to \emph{under-cover} than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for $\alpha>0.5$ and small $d/n$, the $\alpha$-quantile learned by quantile regression roughly achieves coverage $\alpha - (\alpha-1/2)\cdot d/n$ regardless of the noise distribution, where $d$ is the input dimension and $n$ is the number of training data. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.

中文翻译：

了解不确定性估计中的覆盖不足偏差

估计回归任务中的数据不确定性通常是通过学习分位数函数或以输入为条件的真实标签的预测区间来完成的。经常观察到分位数回归 - 一种用于学习具有渐进保证的分位数的普通算法 - 往往比实际所需的覆盖水平更容易\ emph {under-cover}。虽然已经提出了各种修复方案，但对于这种覆盖不足的偏见为何会发生的更基本的理解仍然难以捉摸。在本文中，我们对学习分位数中不确定性估计算法的覆盖范围进行了严格的理论研究。我们证明了分位数回归存在固有的覆盖不足偏差，在我们学习可实现的线性分位数函数并且数据多于参数的普通设置中。更定量地说，对于 $\alpha>0.5$ 和小 $d/n$，分位数回归学习的 $\alpha$-quantile 大致达到覆盖 $\alpha - (\alpha-1/2)\cdot d/n$噪声分布，其中 $d$ 是输入维度，$n$ 是训练数据的数量。我们的理论表明，这种覆盖不足的偏差源于某个高维参数估计误差，而现有的分位数回归理论并未暗示这种误差。模拟和真实数据的实验验证了我们的理论，并在更实际的设置中进一步说明了各种因素（例如样本大小和模型容量）对覆盖不足偏差的影响。其中 $d$ 是输入维度，$n$ 是训练数据的数量。我们的理论表明，这种覆盖不足的偏差源于某个高维参数估计误差，而现有的分位数回归理论并未暗示这种误差。模拟和真实数据的实验验证了我们的理论，并在更实际的设置中进一步说明了各种因素（例如样本大小和模型容量）对覆盖不足偏差的影响。其中 $d$ 是输入维度，$n$ 是训练数据的数量。我们的理论表明，这种覆盖不足的偏差源于某个高维参数估计误差，而现有的分位数回归理论并未暗示这种误差。模拟和真实数据的实验验证了我们的理论，并在更实际的设置中进一步说明了各种因素（例如样本大小和模型容量）对覆盖不足偏差的影响。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>