当前位置: X-MOL 学术arXiv.math.ST › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bias Reduction for Sum Estimation
arXiv - MATH - Statistics Theory Pub Date : 2022-08-02 , DOI: arxiv-2208.01197
Talya Eden, Jakob Bæk Tejs Houen, Shyam Narayanan, Will Rosenbaum, Jakub Tětek

In classical statistics and distribution testing, it is often assumed that elements can be sampled from some distribution $P$, and that when an element $x$ is sampled, the probability $P$ of sampling $x$ is also known. Recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution $Q$ that is sufficiently close to $P$. This phenomenon raises interesting questions: under what conditions is a "noisy" distribution $Q$ sufficient, and what is the algorithmic cost of coping with this noise? We investigate these questions for the problem of estimating the sum of a multiset of $N$ real values $x_1, \ldots, x_N$. This problem is well-studied in the statistical literature in the case $P = Q$, where the Hansen-Hurwitz estimator is frequently used. We assume that for some known distribution $P$, values are sampled from a distribution $Q$ that is pointwise close to $P$. For every positive integer $k$ we define an estimator $\zeta_k$ for $\mu = \sum_i x_i$ whose bias is proportional to $\gamma^k$ (where our $\zeta_1$ reduces to the classical Hansen-Hurwitz estimator). As a special case, we show that if $Q$ is pointwise $\gamma$-close to uniform and all $x_i \in \{0, 1\}$, for any $\epsilon > 0$, we can estimate $\mu$ to within additive error $\epsilon N$ using $m = \Theta({N^{1-\frac{1}{k}} / \epsilon^{2/k}})$ samples, where $k = \left\lceil (\log \epsilon)/(\log \gamma)\right\rceil$. We show that this sample complexity is essentially optimal. Our bounds show that the sample complexity need not vary uniformly with the desired error parameter $\epsilon$: for some values of $\epsilon$, perturbations in its value have no asymptotic effect on the sample complexity, while for other values, any decrease in its value results in an asymptotically larger sample complexity.

中文翻译:

和估计的偏差减少

在经典统计和分布测试中,通常假设可以从某个分布 $P$ 中采样元素,并且当对元素 $x$ 进行采样时,采样 $x$ 的概率 $P$ 也是已知的。最近在分布测试方面的工作表明,如果从任何足够接近 $P$ 的分布 $Q$ 中提取元素,它们仍然会产生正确的输出,因此许多算法是稳健的。这种现象提出了有趣的问题:在什么条件下,“嘈杂”分布 $Q$ 是足够的,应对这种噪音的算法成本是多少?我们研究这些问题是为了估计一个多集 $N$ 实值 $x_1, \ldots, x_N$ 之和的问题。在 $P = Q$ 的情况下,这个问题在统计文献中得到了很好的研究,Hansen-Hurwitz 估计量经常使用的地方。我们假设对于某些已知分布 $P$,值是从逐点接近 $P$ 的分布 $Q$ 中采样的。对于每个正整数 $k$,我们为 $\mu = \sum_i x_i$ 定义一个估计器 $\zeta_k$,其偏差与 $\gamma^k$ 成正比(其中我们的 $\zeta_1$ 简化为经典的 Hansen-Hurwitz 估计器)。作为一个特殊情况,我们证明如果 $Q$ 是逐点 $\gamma$-接近均匀且所有 $x_i \in \{0, 1\}$,对于任何 $\epsilon > 0$,我们可以估计 $ \mu$ 到加性误差 $\epsilon N$ 内使用 $m = \Theta({N^{1-\frac{1}{k}} / \epsilon^{2/k}})$ 个样本,其中 $ k = \left\lceil (\log \epsilon)/(\log \gamma)\right\rceil$。我们证明这个样本复杂度基本上是最优的。
更新日期:2022-08-03
down
wechat
bug