Implementing Monte Carlo Tests with P‐value Buckets,Scandinavian Journal of Statistics

当前位置： X-MOL 学术 › Scand. J. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Implementing Monte Carlo Tests with P‐value Buckets
Scandinavian Journal of Statistics ( IF 1 ) Pub Date : 2019-12-17 , DOI: 10.1111/sjos.12434
Axel Gandy ₁ , Georg Hahn ₁ , Dong Ding ₁

Affiliation

Software packages usually report the results of statistical tests using p-values. Users often interpret these by comparing them to standard thresholds, e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **, *). We consider an arbitrary statistical test whose p-value p is not available explicitly, but can be approximated by Monte Carlo samples, e.g. by bootstrap or permutation tests. The standard implementation of such tests usually draws a fixed number of samples to approximate p. However, the probability that the exact and the approximated p-value lie on different sides of a threshold (the resampling risk) can be high, particularly for p-values close to a threshold. We present a method to overcome this. We consider a finite set of user-specified intervals which cover [0,1] and which can be overlapping. We call these p-value buckets. We present algorithms that, with arbitrarily high probability, return a p-value bucket containing p. We prove that for both a bounded resampling risk and a finite runtime, overlapping buckets need to be employed, and that our methods both bound the resampling risk and guarantee a finite runtime for such overlapping buckets. To interpret decisions with overlapping buckets, we propose an extension of the star rating system. We demonstrate that our methods are suitable for use in standard software, including for low p-value thresholds occurring in multiple testing settings, and that they can be computationally more efficient than standard implementations.

中文翻译：

使用 P 值桶实现蒙特卡罗测试

软件包通常使用 p 值报告统计测试的结果。用户通常通过将它们与标准阈值（例如 0.1%、1% 和 5%）进行比较来解释这些，有时会通过星级（***、**、*）来加强。我们考虑任意统计测试，其 p 值 p 不明确可用，但可以通过蒙特卡罗样本进行近似，例如通过引导或置换测试。此类测试的标准实现通常会抽取固定数量的样本来近似 p。但是，精确和近似 p 值位于阈值不同侧的概率（重采样风险）可能很高，尤其是对于接近阈值的 p 值。我们提出了一种方法来克服这个问题。我们考虑覆盖 [0,1] 并且可以重叠的一组有限的用户指定区间。我们称这些为 p 值桶。我们提出的算法以任意高的概率返回包含 p 的 p 值桶。我们证明，对于有界重采样风险和有限运行时间，都需要使用重叠桶，并且我们的方法既限制了重采样风险，又保证了此类重叠桶的有限运行时间。为了解释重叠桶的决策，我们提出了星级评分系统的扩展。我们证明我们的方法适用于标准软件，包括在多个测试设置中出现的低 p 值阈值，并且它们在计算上比标准实现更有效。我们证明，对于有界重采样风险和有限运行时间，都需要使用重叠桶，并且我们的方法既限制了重采样风险，又保证了此类重叠桶的有限运行时间。为了解释重叠桶的决策，我们提出了星级评分系统的扩展。我们证明我们的方法适用于标准软件，包括在多个测试设置中出现的低 p 值阈值，并且它们在计算上比标准实现更有效。我们证明，对于有界重采样风险和有限运行时间，都需要使用重叠桶，并且我们的方法既限制了重采样风险，又保证了此类重叠桶的有限运行时间。为了解释重叠桶的决策，我们提出了星级评分系统的扩展。我们证明我们的方法适用于标准软件，包括在多个测试设置中出现的低 p 值阈值，并且它们在计算上比标准实现更有效。

更新日期：2019-12-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>