Abstract
In this paper we combine the theory of probability aggregation with results of machine learning theory concerning the optimality of predictions under expert advice. In probability aggregation theory several characterization results for linear aggregation exist. However, in linear aggregation weights are not fixed, but free parameters. We show how fixing such weights by success-based scores, a generalization of Brier scoring, allows for transferring the mentioned optimality results to the case of probability aggregation.
Similar content being viewed by others
References
Arrow, K.J.: Social Choice and Individual Values, 2nd edn. Yale University Press, Yale (1963)
Brier, G. W.: Verification of forecasts expressed in terms of probability. Mon. Weather. Rev. 78(1), 1–3 (1950)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Dietrich, F., Endriss, U., Grossi, D., Pigozzi, G., Slavkovik, M.: JA4AI – judgment aggregation for artificial intelligence (Dagstuhl Seminar 14202). Dagstuhl Reports 4(5), 27–39 (2014). https://doi.org/10.4230/DagRep.4.5.27. http://drops.dagstuhl.de/opus/volltexte/2014/4679
Feldbacher-Escamilla, C.J.: An optimality-argument for equal weighting. Synthese (2018). https://doi.org/10.1007/s11229-018-02028-1
Genest, C., McConway, K.J.: Allocating the weights in the linear opinion pool. J. Forecast. 9(1), 53–73 (1990). https://doi.org/10.1002/for.3980090106
Genest, C., McConway, K.J., Schervish, M.J.: Characterization of externally bayesian pooling operators. Ann. Stat. 14(2), 487–501 (1986). https://doi.org/10.1214/aos/1176349934
Genest, C., Zidek, J. V.: Combining probability distributions: a critique and an annotated bibliography. Stat. Sci. 1(1), 114–135 (1986)
Grossi, D., Pigozzi, G.: Judgment Aggregation: a Primer. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, Williston (2014)
Kornhauser, L.A., Sager, L.G.: Unpacking the court. Yale Law J. 96(1), 82–117 (1986). http://www.jstor.org/stable/796436
Lehrer, K., Wagner, C.: Rational Consesus in Science and Society. A Philosophical and Mathematical Study. Reidel Publishing Company, Dordrecht (1981)
List, C., Pettit, P.: Aggregating sets of judgments: an impossibility result. Econ. Philos. 18(01), 89–110 (2002)
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Cambridge (2012)
Rossi, F., Venable, K. B., Walsh, T.: A Short Introduction to Preferences. Between Artificial Intelligence and Social Choice. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, Williston (2011)
Schurz, G.: The Meta-Inductivist’s winning strategy in the prediction game: a new approach to hume’s problem. Philos. Sci. 75(3), 278–305 (2008)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning. From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Here we provide a proof of Theorem 2 which is a slight expansion of a proof provided in [5] which itself is loosely based on a proof provided in [16, p.253]: The main strategy of the proof is to apply inequalities such that the differences of the success rates are narrowly bounded. As we demonstrate now, success-based weighting allows for an optimal bound in the sense that in the limit such weighting cannot be outperformed by any other inference method in terms of the success rate.
Proof
In order to prove the no regret-property of the aggregating method Paggr, we characterise the difference between the competing predictors and that of the aggregating predictor by help of a learning parameter η which is a function of the number of rounds t, and which grows sublinearly with t. If such a characterisation succeeds, then the difference of the success rate grows sublinearly only and vanishes in the limit; this means that by help of such a characterisation the aggregating predictor is shown to be not outperformed by any other predictor in the limit. As it turns out, one can characterise such differences in successes by help of choosing \(\eta =\sqrt {\frac {2\cdot \ln (n)}{T}}\). Here T is an arbitrary round and sometimes also called the prediction horizon up to which a boundary is proven [3, p.15]. In order to generalise this boundary to any round t, one needs, in a second step, to get rid of the exact choice of T by employing the so-called doubling trick, according to which for each round t it is assumed that the prediction horizon T doubles; this assumption increases the bound a bit, but does not change anything regarding the limiting case, and hence allows for proving a general optimality result too. In the following proof we demonstrate the first part (for arbitrary T); the second part of applying the doubling trick can be recapitulated by help of [13, p.158].
-
i.
Recall from Sections 3 and 4 that the probabilistic aggregation method we are aiming at is defined as the weighted (wi,t) average of the individual predictions (Pi,t), where the weights are a function of the per round successes si,t and the latter are just defined as the “inverse” (within the unit interval) of the losses l(Pi,t(v),valt(v)).
-
ii.
Let \(\eta =\sqrt {\frac {2\cdot \ln (n)}{T}}\). Furthermore let l be convex. Let us also restate the weights \(w^{av}_{i,t}\) recursively via defining coefficients c: Let ci,1 (for 1 ≤ i ≤ n) be 1. Then define recursively \(c_{i,t+1}=c_{i,t}\cdot e^{-\eta \cdot {\sum }_{m=1}^{k}l^{m}_{i,t}/k}\), where \(l^{m}_{i,t}=l(P_{i,t}(v_{m}),val(v_{m}))\) is the loss of i at round t with respect to the prediction of all value vm.
-
iii.
By definition of c we get the following equalities about the ratio of the denominators used in normalisation of the weights (the normalising denominator for t + 1 and that of t):
$$ \begin{array}{@{}rcl@{}} \frac{\sum\limits_{i=1}^{n} c_{i,t+1}}{\sum\limits_{j=1}^{n} c_{j,t}}&=&\sum\limits_{i=1}^{n}\frac{c_{i,t+1}}{\sum\limits_{j=1}^{n} c_{j,t}}=\sum\limits_{i=1}^{n}\frac{c_{i,t} \cdot e^{-\eta\cdot\sum\limits_{m=1}^{k} l^{m}_{i,t}/k}}{\sum\limits_{j=1}^{n} c_{j,t}}\\ &=&\sum\limits_{i=1}^{n} w^{av}_{i,t} \cdot e^{-\eta\cdot\sum\limits_{m=1}^{k} l^{m}_{i,t}/k} \end{array} $$In what follows we abbreviate \(\sum \limits _{m=1}^{k} l^{m}_{i,t}/k\) simply by Σli,t.
-
iv.
By the inequality \(e^{-x}\leq 1-x+\frac {x^{2}}{2}\) (valid for all x ≥ 0) we get the instance:
$$ e^{-\eta\cdot{\Sigma} l_{i,t}}~~\leq~~1-\eta\cdot{\Sigma} l_{i,t}+\frac{\eta^{2}\cdot\left( {\Sigma} l_{i,t}\right)^{2}}{2} $$Note that due to the assumptions in ii. 0 ≤ η < 1 and due to the boundedness of loss l by [0, 1] η ⋅Σli,t ∈ [0, 1].
-
v.
By substituting the right term in the inequality of iv. for the e-term in iii. we get:
$$ \begin{array}{@{}rcl@{}} \frac{\sum\limits_{i=1}^{n} c_{i,t+1}}{\sum\limits_{j=1}^{n} c_{j,t}}&\leq&\sum\limits_{i=1}^{n} w^{av}_{i,t}\cdot \left( 1-\eta\cdot{\Sigma} l_{i,t}+\frac{\eta^{2}\cdot\left( {\Sigma} l_{i,t}\right)^{2}}{2}\right)\\ && \text{and by arithmetic transformation:}\\ &\leq& \sum\limits_{i=1}^{n} w^{av}_{i,t} - \left( \eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\frac{\eta^{2}}{2}\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot\left( {\Sigma} l_{i,t}\right)^{2}\right)\right)\\ && \text{By the normalisation of \textit{w}:~}\sum\limits_{i=1}^{n} w^{av}_{i,t}=1\text{, so:}\\ &\leq& 1 -\left( \eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\frac{\eta^{2}}{2}\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot\left( {\Sigma} l_{i,t}\right)^{2}\right)\right)\\ && \text{By taking the} \ln \text{on both sides of the inequality:}\\ \ln\left( \frac{\sum\limits_{i=1}^{n} c_{i,t+1}}{\sum\limits_{j=1}^{n} c_{j,t}}\right)&\leq& \ln\left( 1-\left( \eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\frac{\eta^{2}}{2}\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot\left( {\Sigma} l_{i,t}\right)^{2}\right)\right)\right)\\ \end{array} $$ -
vi.
By the inequality e−x ≥ 1 − x (valid for any x) we get \(\ln (e^{-x})\geq \ln (1-x)\) and hence \(-x\geq \ln (1-x)\). So, as an instance:
$$ \begin{array}{@{}rcl@{}} &&-\left( \eta\!\cdot\!\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right){\kern1.7pt}-{\kern1.7pt}\frac{\eta^{2}}{2}\!\cdot\!\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}^{2}\right)\right) \\ &&\geq\ln\left( 1-\left( \eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\frac{\eta^{2}}{2}\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}^{2}\right)\right)\right) \end{array} $$Verify that due to the assumptions in ii. 0 ≤ η < 1, the boundedness of loss l by [0, 1], as well as the normalisation of w our instance of x is within [0, 1].
-
vii.
By substituting the left (upper) term in the inequality of vi. for the right term in the inequality in v. we get:
$$ \begin{array}{ll} \ln\left( \frac{\sum\limits_{i=1}^{n} c_{i,t+1}}{\sum\limits_{j=1}^{n} c_{j,t}}\right)\leq-\left( \eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\frac{\eta^{2}}{2}\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot\left( {\Sigma} l_{i,t}\right)^{2}\right)\right)\\ \text{and by arithmetic transformation:}\\ \leq\frac{\eta^{2}}{2}\cdot\underbrace{\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot\left( {\Sigma} l_{i,t}\right)^{2}\right)}_{\leq1}-\eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)\\ \text{\dots{} due to~}\sum\limits_{i=1}^{n} w^{av}_{i,t}=1\text{, and} l\in [0,1], {so:}\\ \leq\frac{\eta^{2}}{2}\cdot 1 -\eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right) \end{array} $$ -
viii.
So, we arrived at the inequality (from vii.):
$$ \ln\left( \sum\limits_{i=1}^{n} c_{i,t+1}\right)-\ln\left( \sum\limits_{i=1}^{n} c_{j,t}\right)~~\leq~~\frac{\eta^{2}}{2}-\eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right) $$Now we can sum up each side of the inequality from 1 to T:
$$ \underbrace{\sum\limits_{t=1}^{T}\left( \underbrace{\ln\left( \sum\limits_{i=1}^{n} c_{i,t+1}\right)}_{=_{def}C_{t+1}}-\underbrace{\ln\left( \sum\limits_{i=1}^{n} c_{j,t}\right)}_{=_{def}C_{t}}\right)}_{\underset{=C_{T+1}-C1}{=~(C_{T+1}-C_{T})+\cdots+(C_{3}-C_{2})+(C_{2}-C_{1})}}~~\leq~~\underbrace{\sum\limits_{t=1}^{T}\left( \frac{\eta^{2}}{2}-\eta\cdot\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)\right)}_{=\frac{T\cdot\eta^{2}}{2}-\eta\cdot\sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)} $$So, we arrive at:
$$ \ln\left( \sum\limits_{i=1}^{n} c_{i,T+1}\right)-\ln\underbrace{\left( \sum\limits_{i=1}^{n} c_{i,1}\right)}_{=n\ }~~\leq~~\frac{T\cdot\eta^{2}}{2}-\eta\cdot\sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right) $$Hence:
$$ \ln\left( \sum\limits_{i=1}^{n} c_{i,T+1}\right)-\ln(n)~~\leq~~\frac{T\cdot\eta^{2}}{2}-\eta\cdot\sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right) $$Recall, ci,t is the cumulative loss up to t in the exponent and we are after the bound for the regret with respect to the best predictor, hence we concentrate on the predictor with minimal cumulative loss up to T: Let us denote this predictor with b (\(b=(\iota i)({\sum }_{t=1}^{T}{\Sigma } l_{i,t}=min({\sum }_{t=1}^{T}\sum l_{1,t},\dots ,{\sum }_{t=1}^{T}\sum l_{n,t}))\)). If there are more, then we can randomly pick one. Now:
$$ \ln(c_{b,T})~~\leq~~\ln\left( \sum\limits_{i=1}^{n} c_{i,T+1}\right) $$Hence:
$$ \ln(c_{b,T})-\ln(n)~~\leq~~\frac{T\cdot\eta^{2}}{2}-\eta\cdot\sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right) $$ -
ix.
By definition of c:
$$ c_{b,T}=\underbrace{c_{b,1}\cdot\prod\limits_{t=2}^{T}e^{-\eta\cdot{\Sigma} l_{b,t}}}_{\underset{=\exp\left( -\eta\cdot\sum\limits_{t=1}^{T}{\Sigma} l_{b,t}\right)}{=e^{-\eta\cdot({\Sigma} l_{b,1}+{\Sigma} l_{b,2}+\cdots+{\Sigma} l_{b,T})}}} $$So:
$$ \ln(c_{b,T})=\ln\left( e^{-\eta\cdot\sum\limits_{t=1}^{T}{\Sigma} l_{b,t}}\right)=-\eta\cdot\sum\limits_{t=1}^{T}{\Sigma} l_{b,t} $$By substituting the right term in the last inequality in viii. we get:
$$ -\eta\cdot\sum\limits_{t=1}^{T}{\Sigma} l_{b,t}-\ln(n)~~\leq~~\frac{T\cdot\eta^{2}}{2}-\eta\cdot\sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right) $$And by arithmetical transformation:
$$ \sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\sum\limits_{t=1}^{T}{\Sigma} l_{b,t}~~\leq~~\frac{T\cdot\eta}{2}+\frac{\ln(n)}{\eta} $$If we substitute for η in accordance with ii: \(\eta =\sqrt {\frac {2\cdot \ln (n)}{T}}\), we get:
$$ \sum\limits_{t=1}^{T}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot{\Sigma} l_{i,t}\right)-\sum\limits_{t=1}^{T}{\Sigma} l_{b,t}~~\leq~~\sqrt{2\cdot\ln(n)\cdot T} $$Now, what is left is to employ the left term of the difference in the inequality above for proving a bound for the meta-inductive method’s regret.
-
x.
According to (AGGR), Paggr predicts as follows: \(P_{aggr,t}(v_{m})=\sum \limits _{i=1}^{n} w^{av}_{i,t}\cdot P_{i,t}(v_{m})\). Hence its loss for value m is: \(l\left (\sum \limits _{i=1}^{n}(w^{av}_{i,t}\cdot P_{i,t}(v_{m})),val_{t}(v_{m})\right )\). And hence its average cumulative loss is:
$$\sum\limits_{t=1}^{T}\sum\limits_{m=1}^{k} l\left( \sum\limits_{i=1}^{n}(w^{av}_{i,t}\cdot P_{i,t}(v_{m})),val_{t}(v_{m})\right)/k$$Since l is convex (according to ii.), we get:
$$\sum\limits_{m=1}^{k} l\left( \sum\limits_{i=1}^{n}(w^{av}_{i,t}\cdot P_{i,t}(v_{m})),val_{t}(v_{m})\right)/k~~\leq~~\sum\limits_{m=1}^{k}\sum\limits_{i=1}^{n}\left( w^{av}_{i,t}\cdot l(P_{i,t}(v_{m}),val_{t}(v_{m}))\right)/k$$(I.e.: The loss of a weighted average of predictions is smaller than or equal to the weighted average of the losses of the predictions.) Hence, from the last inequality in ix. and the convexity of l we get:
$$ \begin{array}{l} \underbrace{\sum\limits_{t=1}^{T}\sum\limits_{m=1}^{k}\left( l\left( \sum\limits_{i=1}^{n}(w^{av}_{i,t}\cdot P_{i,t}(v_{m})),val_{t}(v_{m})\right)\right)\!/k-\!\!\sum\limits_{t=1}^{T}\sum\limits_{m=1}^{k} l(P_{b,t}(v_{m}),val_{t}(v_{m}))/k}_{=l^{av}_{aggr,T}\cdot T-l^{av}_{b,T}\cdot T}\\ ~~\leq~~\sqrt{2\cdot\ln(n)\cdot T} \end{array} $$ -
xi.
Now, since \(s^{av}_{i,T}=1 - l^{av}_{i,T}\), this means that:
$$s^{av}_{b,T}-s^{av}_{aggr,T}\leq\frac{const}{\sqrt{T}}$$By applying the above mentioned doubling trick, this holds for all T, hence:
$$\lim\limits_{t\rightarrow\infty}s^{av}_{b,t}-s^{av}_{aggr,t}\leq0$$Since Pb was the method with least cumulative loss up to t (we defined b this way in viii.), this bound holds also with respect to all other predictors (for all 1 ≤ i ≤ n).
□
Rights and permissions
About this article
Cite this article
Feldbacher-Escamilla, C.J., Schurz, G. Optimal probability aggregation based on generalized brier scoring. Ann Math Artif Intell 88, 717–734 (2020). https://doi.org/10.1007/s10472-019-09648-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-019-09648-4