当前位置: X-MOL 学术J. R. Stat. Soc. A › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Seconder of the vote of thanks to Glenn Shafer and contribution to the Discussion of ‘Testing by betting: A strategy for statistical and scientific communication’
The Journal of the Royal Statistical Society, Series A (Statistics in Society) ( IF 2 ) Pub Date : 2021-05-05 , DOI: 10.1111/rssa.12649
Frank P. A. Coolen 1
Affiliation  

Professor Shafer’s paper, as so much of his work, has taught me interesting new aspects of statistical inference with substantial historical context. Testing a probability distribution by betting is simple and powerful, and the betting interpretation is natural for sequential testing of hypotheses. The fact that the betting score is a likelihood ratio and the implied targets are very interesting, as is the further justification for the Neyman–Pearson theory in case of a one‐off hypothesis test.

My main query is about application. In practice, there is often no clear random phenomenon Y of interest, but, for example a wish to show that one method is better than another. Translating the practical question to a statistical scenario is important, the choice of Y can greatly influence the apparent conclusion. One typically designs an experiment resulting in a sample and summaries of the sample observations may be of interest. Suppose one observes Y i , i=1,…,n, and claims that these are observations of independent random quantities from distribution P. Let an alternative distribution Q be proposed, with the same mean as P but larger variance. One could use a test on the Y i ’s or the mean Y ¯ . If the data mean is close to the mean of P and Q, but the data variance is large, then the bet on Y ¯ is likely to support P while the bet on the Y i ’s is likely to support Q. An experienced statistician will understand that these two tests are different (similar to Example 4 in the paper), but others may be confused. It raises the question who decides on the choice of Y. Is it possible, perhaps by using the implied targets, to find a statistic Y such that P and Q can be distinguished in some optimal manner?

I note that Y must be fully known in order to apply the theory, which means that the design of the experiment or the way data are collected must be known, how otherwise can one assign a meaningful P and S (or Q)? This is important not only for small‐scale experiments, but also in applications with very large amounts of data, and I fear it is easily overlooked if statistical methods are applied without care. Elicitation of probability distributions based on expert judgements is difficult, but the fact that Y needs to be observable (or a function of observables) will be helpful. It will be important to see if S can be elicited, or if focus on Q works better to formulate the alternative to P.

Elicitation reminds me of my first statistics application, which considered reliability of heat exchangers (Coolen et al., 1992). Inspections planning required expert judgements as inputs, and the available experts’ opinions varied substantially. I used linear opinion pooling in a Bayesian setting, learning about the levels of expertise when data became available. Starting off with equal weights for the experts, these were updated to become proportional to the probabilities of the observed data according to the experts (the denominator in Bayes’ theorem), which has similarities with the probabilities of the data being used to distinguish between the claimed hypothesis (P) and the alternative (Q) in testing by betting. These weights had the property that it did not matter for the overall inference, based on the linearly pooled expert opinions, whether the opinions were first pooled and then updated or vice versa. The decision maker could interpret these weights as proportions of overall budget to assign to the experts. The work did not consider betting scores or hypothesis testing, but I believe there are links to the use of expert judgement in decision problems that could be explored.

Professor Shafer begins the paper with the statement that the p‐value is too complicated for effective communication to a wide audience. This has received increased attention in recent years as part of the discussion about repeatability and reproducibility of experiments (Atmanspacher & Maasen, 2016; Goodman, 1992; Senn, 2002). In recent work, we have considered reproducibility of hypothesis tests from non‐parametric predictive inference perspective (Augustin & Coolen, 2004; Coolen & Bin Himd, 2014; Coolen & Marques, 2020), which shows that the implicit statistical reproducibility is often poor, in particular for multi‐group tests with one‐sided alternatives. This issue will not be resolved using testing by betting, as it is a direct consequence of the dichotomous nature of such tests. Of course, one would like to overcome this problem by gathering more data when needed, that is if the test criterion is close to the decision borderline, the newly proposed methods may be useful here.

I am not sure if testing by betting will make statistics easier, I believe that statistics requires expertise as it is a very challenging topic bridging pure mathematics and real world applications in many fields, as nicely discussed by Hampel (Hampel, 1998), who also set out to develop an objective theory of ‘successful betting’ (Hampel, 2001). In this paper, Professor Shafer has presented a useful new tool for expert mathematical statisticians, which requires further development for practical implementation. It gives me great pleasure to second the vote of thanks.



中文翻译:

对格伦·谢弗(Glenn Shafer)的感谢和对“通过博彩测试:统计和科学传播策略”的讨论的贡献表示赞同

沙弗尔教授的论文以及他的许多著作,使我在相当大的历史背景下教了我有趣的统计学推论的新方面。通过下注测试概率分布既简单又强大,下注解释对于假设的顺序测试是很自然的。投注分数是一个似然比和隐含的目标这一事实非常有趣,在一次性假设检验的情况下,对Neyman-Pearson理论的进一步证明也是如此。

我的主要查询是关于应用程序的。实际上,通常没有明显的感兴趣的随机现象Y,但是,例如,希望表明一种方法优于另一种方法。将实际问题转化为统计情景很重要,Y的选择会极大地影响表面上的结论。通常,设计一个产生样本的实验,而样本观测值的摘要可能会引起人们的兴趣。假设一个观察 ÿ 一世 i = 1,…,n,并声称这是来自分布P的独立随机量的观测值。让我们提出一个替代分布Q,其均值与P相同,但方差更大。一个人可以对 ÿ 一世 或平均值 ÿ ¯ 。如果数据均值接近PQ的均值,但数据方差很大,则押注于 ÿ ¯ 押注P的同时很可能会支持P ÿ 一世 的可能支持Q。有经验的统计学家将理解这两个测试是不同的(类似于本文中的示例4),但是其他测试可能会混淆。这就提出了一个问题,即由谁来决定Y的选择。是否有可能(也许通过使用隐含目标)找到统计量Y,从而可以某种最佳方式区分PQ

我注意到,要应用该理论,Y必须是完全已知的,这意味着必须知道实验的设计或收集数据的方式,否则如何分配有意义的PS(或Q)?这不仅对于小规模的实验很重要,而且对于具有大量数据的应用程序也很重要,而且我担心如果不加注意地应用统计方法,它很容易被忽略。很难根据专家判断得出概率分布,但是需要Y是可观察的(或可观察的函数)这一事实将是有帮助的。重要的是要看是否可以得出S,或者是否专注于Q更好地制定P的替代方案。

启发让我想起了我的第一个统计应用程序,该应用程序考虑了换热器的可靠性(Coolen等,1992)。视察计划需要专家的判断作为输入,而可用的专家的意见则相差很大。我在贝叶斯环境中使用了线性意见汇总,以了解可用数据时的专业知识水平。从专家的权重相等开始,根据专家(贝叶斯定理中的分母),这些值被更新为与观察到的数据的概率成比例(与贝叶斯定理中的分母相似),与用来区分两个变量的数据的概率具有相似性。要求假设(P)和备选方案(Q)通过投注进行测试。根据线性合并的专家意见,这些权重具有对于整体推断无关紧要的属性,无论意见是先合并然后更新,反之亦然。决策者可以将这些权重解释为要分配给专家的总预算的比例。这项工作没有考虑投注分数或假设检验,但我相信在可以探讨的决策问题中使用专家判断有联系。

Shafer教授在论文开始时指出p值太复杂,无法有效地与广大受众进行交流。近年来,作为关于实验的可重复性和可再现性的讨论的一部分,这一点受到了越来越多的关注(Atmanspacher&Maasen,2016年; Goodman,1992年; Senn,2002年)。在最近的工作中,我们从非参数预测推理的角度考虑了假设检验的可重复性(Augustin和Coolen,2004年; Coolen和Bin Himd,2014年; Coolen和Marques,2020年)。),这表明隐式统计可重复性通常很差,尤其是对于具有单面选择的多组测试。使用下注测试无法解决此问题,因为这是此类测试二分性质的直接结果。当然,人们希望通过在需要时收集更多数据来克服这一问题,也就是说,如果测试标准接近决策边界,那么新提出的方法可能会在这里有用。

我不确定通过下注进行测试是否会使统计更加容易,我相信统计需要专业知识,因为将纯数学和现实世界在许多领域的应用结合起来是一个非常具有挑战性的话题,正如Hampel(Hampel,1998)很好地论述的那样。着手发展“成功投注”的客观理论(Hampel,2001年)。在本文中,Shafer教授为数学专家提供了有用的新工具,需要进一步开发才能实际应用。我非常高兴地对你表示感谢。

更新日期:2021-05-05
down
wechat
bug