当前位置: X-MOL 学术J. R. Stat. Soc. A › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Seconder of the vote of thanks to Glenn Shafer and contribution to the Discussion of ‘Testing by betting: A strategy for statistical and scientific communication’
The Journal of the Royal Statistical Society, Series A (Statistics in Society) ( IF 2 ) Pub Date : 2021-05-05 , DOI: 10.1111/rssa.12649
Frank P. A. Coolen 1

Professor Shafer’s paper, as so much of his work, has taught me interesting new aspects of statistical inference with substantial historical context. Testing a probability distribution by betting is simple and powerful, and the betting interpretation is natural for sequential testing of hypotheses. The fact that the betting score is a likelihood ratio and the implied targets are very interesting, as is the further justification for the Neyman–Pearson theory in case of a one‐off hypothesis test.

My main query is about application. In practice, there is often no clear random phenomenon Y of interest, but, for example a wish to show that one method is better than another. Translating the practical question to a statistical scenario is important, the choice of Y can greatly influence the apparent conclusion. One typically designs an experiment resulting in a sample and summaries of the sample observations may be of interest. Suppose one observes Y i , i=1,…,n, and claims that these are observations of independent random quantities from distribution P. Let an alternative distribution Q be proposed, with the same mean as P but larger variance. One could use a test on the Y i ’s or the mean Y ¯ . If the data mean is close to the mean of P and Q, but the data variance is large, then the bet on Y ¯ is likely to support P while the bet on the Y i ’s is likely to support Q. An experienced statistician will understand that these two tests are different (similar to Example 4 in the paper), but others may be confused. It raises the question who decides on the choice of Y. Is it possible, perhaps by using the implied targets, to find a statistic Y such that P and Q can be distinguished in some optimal manner?

I note that Y must be fully known in order to apply the theory, which means that the design of the experiment or the way data are collected must be known, how otherwise can one assign a meaningful P and S (or Q)? This is important not only for small‐scale experiments, but also in applications with very large amounts of data, and I fear it is easily overlooked if statistical methods are applied without care. Elicitation of probability distributions based on expert judgements is difficult, but the fact that Y needs to be observable (or a function of observables) will be helpful. It will be important to see if S can be elicited, or if focus on Q works better to formulate the alternative to P.

Elicitation reminds me of my first statistics application, which considered reliability of heat exchangers (Coolen et al., 1992). Inspections planning required expert judgements as inputs, and the available experts’ opinions varied substantially. I used linear opinion pooling in a Bayesian setting, learning about the levels of expertise when data became available. Starting off with equal weights for the experts, these were updated to become proportional to the probabilities of the observed data according to the experts (the denominator in Bayes’ theorem), which has similarities with the probabilities of the data being used to distinguish between the claimed hypothesis (P) and the alternative (Q) in testing by betting. These weights had the property that it did not matter for the overall inference, based on the linearly pooled expert opinions, whether the opinions were first pooled and then updated or vice versa. The decision maker could interpret these weights as proportions of overall budget to assign to the experts. The work did not consider betting scores or hypothesis testing, but I believe there are links to the use of expert judgement in decision problems that could be explored.

Professor Shafer begins the paper with the statement that the p‐value is too complicated for effective communication to a wide audience. This has received increased attention in recent years as part of the discussion about repeatability and reproducibility of experiments (Atmanspacher & Maasen, 2016; Goodman, 1992; Senn, 2002). In recent work, we have considered reproducibility of hypothesis tests from non‐parametric predictive inference perspective (Augustin & Coolen, 2004; Coolen & Bin Himd, 2014; Coolen & Marques, 2020), which shows that the implicit statistical reproducibility is often poor, in particular for multi‐group tests with one‐sided alternatives. This issue will not be resolved using testing by betting, as it is a direct consequence of the dichotomous nature of such tests. Of course, one would like to overcome this problem by gathering more data when needed, that is if the test criterion is close to the decision borderline, the newly proposed methods may be useful here.

I am not sure if testing by betting will make statistics easier, I believe that statistics requires expertise as it is a very challenging topic bridging pure mathematics and real world applications in many fields, as nicely discussed by Hampel (Hampel, 1998), who also set out to develop an objective theory of ‘successful betting’ (Hampel, 2001). In this paper, Professor Shafer has presented a useful new tool for expert mathematical statisticians, which requires further development for practical implementation. It gives me great pleasure to second the vote of thanks.


对格伦·谢弗(Glenn Shafer)的感谢和对“通过博彩测试:统计和科学传播策略”的讨论的贡献表示赞同


我的主要查询是关于应用程序的。实际上,通常没有明显的感兴趣的随机现象Y,但是,例如,希望表明一种方法优于另一种方法。将实际问题转化为统计情景很重要,Y的选择会极大地影响表面上的结论。通常,设计一个产生样本的实验,而样本观测值的摘要可能会引起人们的兴趣。假设一个观察 ÿ 一世 i = 1,…,n,并声称这是来自分布P的独立随机量的观测值。让我们提出一个替代分布Q,其均值与P相同,但方差更大。一个人可以对 ÿ 一世 或平均值 ÿ ¯ 。如果数据均值接近PQ的均值,但数据方差很大,则押注于 ÿ ¯ 押注P的同时很可能会支持P ÿ 一世 的可能支持Q。有经验的统计学家将理解这两个测试是不同的(类似于本文中的示例4),但是其他测试可能会混淆。这就提出了一个问题,即由谁来决定Y的选择。是否有可能(也许通过使用隐含目标)找到统计量Y,从而可以某种最佳方式区分PQ



Shafer教授在论文开始时指出p值太复杂,无法有效地与广大受众进行交流。近年来,作为关于实验的可重复性和可再现性的讨论的一部分,这一点受到了越来越多的关注(Atmanspacher&Maasen,2016年; Goodman,1992年; Senn,2002年)。在最近的工作中,我们从非参数预测推理的角度考虑了假设检验的可重复性(Augustin和Coolen,2004年; Coolen和Bin Himd,2014年; Coolen和Marques,2020年)。),这表明隐式统计可重复性通常很差,尤其是对于具有单面选择的多组测试。使用下注测试无法解决此问题,因为这是此类测试二分性质的直接结果。当然,人们希望通过在需要时收集更多数据来克服这一问题,也就是说,如果测试标准接近决策边界,那么新提出的方法可能会在这里有用。

