Imagine that you turn down a 50–50 gamble of losing $10 or gaining $11, and you maximize expected utility. Then you will find yourself (absurdly) turning down any 50–50 gamble where you may lose $100, no matter how large the amount you could win. This is Rabin’s (2000) thought-provoking paradox. It shows how an innocuous preference has surprising implications that challenge the empirical validity of expected utility. For normative applications, the paradox suggests that preferences should be close to risk neutral for small stakes. For prescriptive and descriptive applications, the paradox raises the question of which assumptions of expected utility are violated (if any). This paper investigates the latter question.

Rabin’s paradox (RP) at first led to theoretical discussions about whether it violates expected utility at all. Rabin suggested that his paradox provides an argument not only against expected utility but, more generally, against reference independence and, consequently, against all traditional decision models. Several authors (referenced later) tried to rescue reference independence by suggesting other theoretical explanations, such as probability weighting, disappointment aversion, or background risks. Sometimes, utility of income was proposed for the same purpose. This paper will resolve RP empirically. We show that Rabin is right and that reference dependence indeed explains his paradox. Moreover, we find that other deviations from expected utility, while useful in many contexts, do not contribute to the explanation of RP in our experiment. Thus, we provide support for the, still contested (references given later), reference-dependent generalizations of decision theories. In particular, we argue that RP provides the strongest and most clear-cut argument supporting reference dependence together with the, also still debated, loss aversion. This is extra useful in the face of serious arguments against loss aversion (Yechiam 2019).

The theoretical debates of RP in the literature suffered from differences in terminology: (a) Rubinstein (2006) suggested that the term “expected utility” incorporates reference dependenceFootnote 1; (b) utility of income was used as an alternative term for reference dependence (see Fig. 1). Wakker (2010 pp. 244-245) reviewed early debates. Our §8 gives recent references and further details. The abundance of theoretical debates and semantic confusions have been barriers to the resolution of RP. Now, 20 years after its appearance, RP has turned into a classic and its meaning should be settled, both theoretically and empirically. This is the aim of our paper. We present a theoretical analysis of RP that can disentangle its potential causes, and then the experimental stimuli that allow identifying its real causes.

Fig. 1
figure 1

The preferences in Rabin’s paradox

Cox et al. (2013), Csvd hereafter, were the first to provide empirical evidence of the assumed preference patterns in RP. They also showed theoretically when Rabin’s calibration paradox refutes various reference-independent theories, including expected utility. Thus, they were the first to conclusively show that RP is a genuine violation of expected utility. However, they did not identify the causes of RP. Our study will do so. We now discuss possible causes.

Rabin (2000) already showed that utility curvature alone cannot completely explain RP. We, more strongly, do not find any empirical role for utility curvature in explaining RP. Several authors showed that other deviations from expected utility, primarily probability weighting, may explain RP theoretically.Footnote 2 Csvd’s data did not provide conclusive evidence on probability weighting, and their formal and empirical analyses of RP did not involve reference dependence. In our experiment, probability weighting, like utility curvature, plays no role in explaining RP. Neither do other reference-independent deviations from expected utility. Rabin (2000) conjectured that loss aversion, necessarily involving reference dependence, is the main cause of his paradox:

Indeed, what is empirically the most firmly established feature of risk preferences, loss aversion, is a departure from expected-utility theory that provides a direct explanation for modest-scale risk aversion. Loss aversion says that people are significantly more averse to losses relative to the status quo than they are attracted by gains, and more generally that people’s utilities are determined by changes in wealth rather than absolute levels. (p. 1288)

We find that loss aversion is the only cause of RP. Several other authors suggested loss aversion as an explanation (Csvd p. 307; Lindsay 2013; Park 2016; Wakker 2010 p. 244), but they did not formalize or test this conjecture.Footnote 3 We do so by incorporating reference dependence in our theoretical analysis and by carrying out empirical tests.Footnote 4 Thus, we resolve the 20-year old RP and show that it is a genuine deviation from basic normative classical decision principles, providing a strong argument for the modern behavioral approach.

The paper is organized as follows. Following notation and definitions in §1, §2 presents a theoretical analysis of RP with reference dependence incorporated. Sections 3–5 analyze RP under expected utility, rank-dependent utility, and reference dependence, respectively, stating predictions to falsify the various theories. Section 6 presents the experiment and its results, with a summary in Table 1. Then follows a discussing of the experiment (§7) and of related literature (§8), a discussion of reference dependence (§9), and a conclusion (§10).

Table 1 Summary of our findings

1 Notation and definitions

We consider only two-outcome prospects. By αpβ we denote a prospect yielding outcome α with probability p and outcome β with probability 1 − p. Outcomes are money amounts. In reference-independent models, outcomes refer to final wealth and are denoted in bold by Greek letters or real numbers. The initial wealth, which is the final wealth level when subjects enter the laboratory in our experiment, is denoted 0 (“zero”), as has been customary in classical reference-independent models. It is fixed throughout the analysis and experiment.

By ≽ we denote a preference relation over prospects. Throughout this paper, a utility function U maps outcomes to the reals. We assume that U is strictly increasing and continuous. The expected utility (EU) of a prospect αpβ then is

$$ pU\left(\boldsymbol{\alpha} \right)+\left(1-p\right)U\left(\boldsymbol{\beta} \right). $$
(1)

Expected utility holds if there exists a utility function U such that preferences maximize EU.

We next define the most general theory considered in this paper, prospect theory (Tversky and Kahneman 1992), and then specify other theories as special cases. Prospect theory assumes that for every choice situation subjects perceive a particular final wealth level as their reference point, which we denote θ. Commonly, the reference point is the status quo, but it can change within the analysis, for instance due to different framings. This is the crucial difference between the reference point and initial wealth, which is fixed throughout the analysis. Under prospect theory, outcomes describe changes with respect to this variable reference point and are denoted by Greek letters or real numbers in normal typeface. For example, outcome αθ designates final wealth α + θ with θ the reference point and α the change. The two different notations (bold and nonbold) for different kinds of outcomes serve to clarify the ambiguities that can arise but should be avoided in RP.

A weighting function w maps the probability interval [0, 1] to [0, 1] with w(0) = 0, w(1) = 1, and w strictly increasing. It does not have to be continuous. A loss aversion parameter λ is a positive number. Prospect theory (PT) holds if there exist a utility function u with u(0) = 0, two probability weighting functions w+ and w, and a loss aversion parameter λ such that preferences maximize the prospect theory value (PT) of prospects:

$$ PT\left({\alpha}_{\theta_p}{\beta}_{\theta}\right)={w}^{+}(p)u\left(\alpha \right)+\left(1-{w}^{+}(p)\right)u\left(\beta \right)\;\mathrm{if}\;\alpha \ge \beta \ge 0; $$
(2)
$$ {w}^{+}(p)u\left(\alpha \right)+{w}^{-}\left(1-p\right)\lambda u\left(\beta \right)\ \mathrm{if}\;\alpha \ge 0\ge \beta; $$
(3)
$$ {w}^{-}(p)\lambda u\left(\alpha \right)+\left(1-{w}^{-}(p)\right)\lambda u\left(\beta \right)\;\mathrm{if}\;0\ge \beta \ge \alpha . $$
(4)

The parameters u, w+, w, and λ can in principle depend on the reference point θ. However, they will be stable under small changes of θ, as in our experiment, and we, therefore, assume that they are independent of θ.Footnote 5 The loss aversion parameter can be incorporated into utility by writing

$$ U\left(\alpha \right)=u\left(\alpha \right)\kern0.33em for\kern0.33em \alpha \kern0.33em \ge \kern0.33em 0\kern0.33em \mathrm{and}\kern0.33em U\left(\beta \right)=\lambda u\left(\beta \right)\kern0.33em \mathrm{for}\kern0.33em \beta \le 0, $$
(5)

U will typically have a kink at 0. We usually denote the reference point as a subscript of the preference symbol rather than of the outcomes. If the reference point θ has been specified, we may therefore write α instead of αθ. Utility of income is the special case where there is no probability weighting, i.e., w+(p) = w(p) = p. It generalizes expected utility by incorporating reference dependence, and has expected utility as the special case where the reference point is fixed.

We now turn to reference-independent special cases of PT. The first special case we consider is rank-dependent utility (RDU). It assumes w+(p) = 1 − w(1 − p) = w(p) and λ = 1 (so that u = U). The main restriction is that RDU assumes reference independence: outcomes are described in terms of final wealth. This can be formalized by assuming that the reference point θ is fixed at 0.Footnote 6 We get

$$ RDU\left({\boldsymbol{\upalpha}}_{\mathbf{p}}\boldsymbol{\upbeta} \right)=w(p)U\left(\boldsymbol{\upalpha} \right)+\left(1-w(p)\right)U\left(\boldsymbol{\upbeta} \right)\;\mathrm{if}\ \boldsymbol{\upalpha} \ge \boldsymbol{\upbeta} $$
(6)

Probability weighting under RDU is sign-independent. For gains we have w(p) = w+(p) but for losses we have a dual w(p) = 1 − w(1 − p). EU is the special case of RDU where w+(p) = w(p) = w(p) = p.

For two-outcome prospects as used in our experiment, nearly all existing reference- and sign-independent nonexpected utility theories are special cases of RDU and, consequently, of PT (Wakker 2010 §7.11). Such theories include the reference-independent version of original prospect theory (Kahneman and Tversky 1979) and disappointment aversion theory (Gul 1991). Hence, the analysis of this paper concerns all transitive risk theories that are popular today.

2 The preferences in Rabin’s paradox: Reference-dependent versus reference-independent modeling

The formalization of reference dependence defined in the previous section has been used in many contexts, but not yet to analyze RP. This section shows how this formalization allows us to identify and isolate potential causes of the paradox. The subtle distinctions between reference points, initial wealth, final wealth, and changes of wealth call for careful notation, but once this is settled the RP can readily be resolved. Fig. 1 gives a comprehensive account. Given the subtle distinctions, with five choices playing a role, the figure cannot be very simple. In return, it gives a complete picture of all relevant issues. The various panels will be explained next.

Rabin assumed that people reject a 50–50 prospect of winning 11 or losing 10 (Fig. 1a: basic (final-wealth) preference). With the natural status quo of 0, this assumption is empirically plausible for different subjects at different wealth levels; that is, in a “between”-subject sense. It then is also plausible in a “within”-subject sense, i.e., for one subject at different wealth levels. For instance, if for a given subject in our experiment, the basic preference holds for most subjects €11 richer than her, then it will probably also hold for her if she were €11 richer. We call this argument the between-within argument. It makes Rabin’s claims plausible for various wealth levels within one individual while avoiding the experimental problem of implementing large wealth changes. Under expected utility, the argument implies the wealth-change preferences in Fig. 1b for a range of wealth levels ω. Csvd’s experiment covered the range ω ∈ [−100, 100000].

Figures 1a and 1b, above the bold dashed line, contain reference-independent presentations. Reference-dependent presentations are below the bold dashed line, in Figs. 1c, 1d1, and 1d2, with reference points specified as subscripts of preferences. Fig. 1c presents the basic reference-dependent preference, with reference point 0. The reference-change preference of Fig. 1d1 is then plausible for the various reference points ω concerned, say all ω ∈ [−100, 100000], the wealth levels considered by Csvd. We will discuss later whether the outcome-change preference (Fig. 1d2) is plausible.

EU, like all other reference-independent theories, does not distinguish between reference-change preference and outcome-change preference (Figs. 1d1 and 1d2), equating them also with the wealth-change preference in Fig. 1b. The brace below these three figures indicates this equivalence. It explains EU’s “between-within” move from the basic preference to the wealth-change preference. This move leads, via the equivalence between Fig.1d1 and 1d2, to highly risk averse preferences that cannot be accommodated by EU. This is the RP for EU.

Many theoretical explanations of RP have been suggested in the literature. Maintaining reference independence, one potential cause of the paradox is that both utility curvature and probability weighting contribute to risk aversion (for instance under RDU). Other theories, such as prospect theory, allow for reference dependence as a potential cause. Then the empirically plausible Fig. 1d1 does not imply Fig. 1d2, and, consequently, the empirically plausible Fig. 1a and the implausible Fig.1b are no longer linked.

To identify the causes of RP, it is crucial to model the wealth-change preference (Fig. 1d1) and the reference-change preference (Fig. 1d2) separately, and to compare the degree of risk aversion in these two decision situations. For example, if the risk aversion of Fig. 1a mainly shows up in Fig.1d1 and less so in Fig. 1d2, then reference dependence and loss aversion are the main causes of RP. In other cases, reference-independent deviations from expected utility, primarily probability weighting, are the main causes. If there is no significant risk aversion in Fig. 1d2, then probability weighting and other reference-independent causes are unimportant and utility of income suffices to explain RP. As emphasized by Buchak (2014 footnote 6), even though Rabin (2000) did not formally distinguish Figs. 1d1 and 1d2, he was careful to always choose framings consistent with Fig. 1d1 and never with Fig. 1d2. We will indeed find that the problem with expected utility is the transition from Fig. 1d1 to 1d2.

We used a brace below Figs. 1a and 1c to indicate that reference-independent theories do not distinguish between these two figures, similarly as they do not distinguish between Figs. 1b, 1d1, and 1d2. In particular, background risks will not play a significant role if they are incorporated into the reference point ω as in Fig. 1d1 rather than in the outcomes as in Fig. 1d2. The impossibility to distinguish between figures above one brace has hampered the debates in the literature using reference-independent theories, as for instance in Harrison et al. (2017).

3 Rabin’s paradox as a violation of expected utility

Because framing is central to the resolution of RP, we discuss the different frames that constitute our experimental stimuli jointly with our theoretical analyses. We use the framing in Fig. 2 to test Rabin’s basic preference (110.5(−10) ≼ 0 in Fig. 1a and 110.5(−10)≼00 in Fig. 1c). We use an accept-reject (“Yes-No”) formulation because empirical evidence suggests that this leads to most reference dependence and loss aversion (Ert and Erev 2013), and, therefore, gives the strongest possible test of classical theories. Our prediction, in agreement with common views on risk attitudes (Tversky and Kahneman 1992) and Csvd’s findings, is:

Fig. 2
figure 2

Presentation of basic preference (Fig. 1a) to subjects

Prediction 1

A large majority of the subjects will reject (choose “no”) in Fig. 2.

Implication

Expected utility with concave utility is falsified.

Explanation

As explained in §2, if the prediction holds true, then the preferences in Fig. 1d1 (110.5(−10)≼ω0) are also plausible. Under expected utility, the preferences in Fig. 1b ((ω + 11)0.5(ω − 10) ≼ ω) then hold. They imply U(ω +11) − U(ω) ≤ U(ω) − U(ω −10). Hence, the average marginal utility U′ over [ω, ω +11] is at most 10/11 times that over [ω −10, ω]. For concave utility, it implies that U′ falls by a factor of at least 10/11 over every interval [ω −10, ω +11] of length 21. This is too fast to be reasonable. For example, for every α, no matter how big, it would imply rejection of the prospect α0.5(100) if the wealth-change preferences (Fig. 1b) hold for all ω ∈ [100, α] (Rabin 2000 p. 1282). This is absurd and, hence, entails a violation of expected utility. We need factors other than utility curvature to explain the rejection in Fig. 2. □.

4 Nonexpected utility theories as failed attempts to preserve reference independence

The main attempts to save reference independence from RP are based on probability weighting, the other deviation from expected utility modelled by prospect theory. That is, RDU was used to explain RP. RDU, like EU, does not distinguish between reference-change (110.5(−10)≼ω0; Fig. 1d1) and outcome-change ((11 + ω)0.5(−10 + ω)≼0ω; Fig. 1d2) preferences. Consequently, the basic preference (110.5(−10) ≽ 0; Fig. 1a) implies the wealth-change preferences ((ω + 11)0.5(ω − 10) ≽ ω; Fig. 1b) as it does under EU. Barberis et al. (2006), Barseghyan et al. (2013), Csvd (their §4.1), Neilson (2001), and Wakker (2010 p. 244 5th paragraph) pointed out that RDU can—in theory—accommodate the final-wealth preferences (Figs. 1a and 1b).Footnote 7 For example, a moderate underweighting of p = 0.5, with

$$ w(0.5)<\frac{10}{21}=0.476, $$

suffices to accommodate these preferences even when utility is linear. Concave utility reinforces the preferences. Empirical studies have typically found an average of w(0.5) < 0.476 (Tversky and Kahneman 1992; Fox et al. 2015), supporting this explanation.Footnote 8 To explore it in more detail, we will test RDU by measuring probability weighting.

Safra and Segal (2008) give a theoretical treatment of RP using RDU, but with assumptions that are not empirically plausible. They assume independence of background risks (implicitly identifying Figs. 1b and 1d1) to argue that probability weighting cannot explain RP. Wakker (2010) argued that Safra and Segal’s independence assumption is restrictive and empirically implausible. His criticism is supported by Barberis et al. (2006), who showed that the independence assumption by itself already rules out probability weighting.

In a theoretical contribution, Neilson (2001) suggested the following extension of RP that would falsify RDU. We test this falsification empirically. Crucial for Rabin’s calibration in §2 is that the weight of the gain 11 is the same as the weight of the loss 10. To achieve these equal weights under RDU, for each subject we measured the probability r such that

$$ w(r)=0.5. $$
(7)

Details are in the Appendix. Based on existing empirical evidence (Fehr-Duda and Epper 2012; Tversky and Kahneman 1992; Wakker 2010 §9.5), we predict:

Prediction 2

The average r in Eq. 7 will exceed 0.5 considerably, entailing considerable risk aversion. □.

We then offered the prospect 11r(10) to each subject, where r was their individual value measured in Eq. 7. This gives the desired equal weighting of outcomes under RDU.Footnote 9 The offered prospect was more favorable than Rabin’s prospect if r > 0.5, which was the typical case. Fig. 3 displays the framing used for a subject with r = 0.63. The crucial point here is to use a framing that induces the right reference point and loss aversion. For this purpose, we again use the accept-reject framing. Hence, we have:

Fig 3
figure 3

Basic preference with r =0.63 instead of 0.50

Prediction 3

A majority of subjects will reject (choose “No”) in Fig. 3.

Implication

RDU cannot explain Rabin’s Paradox.

Explanation

Under RDU with linear or slightly concave utility, subjects should accept the prospect offered, contrary to Prediction 3. This shows that RDU’s correction for probability weighting does not remove all risk aversion. Neilson (2001) showed that utility curvature cannot explain the remaining risk aversion by deriving utility calibration paradoxes for RDU.Footnote 10 There must be factors beyond RDU that explain RP. □

On our domain of two-outcome prospects, nearly all reference-independent nonexpected utility theories agree with RDU (see end of §1). Hence, none of those other theories can explain RP either. We therefore turn to reference-dependent theories in the next section, where we will also allow probability weighting to be different for gains and losses, as in prospect theory. Our experiment will later find that probability weighting plays no empirical role in RP.

To avoid misunderstandings, we emphasize that our study does not claim that probability weighting would be unimportant. Many studies have demonstrated its importance (Barseghyan et al. 2013; Fehr-Duda and Epper 2012; Qiu and Steiger 2011; Tversky and Kahneman 1992; Viscusi 1995 p. 107; Wakker 2010). In particular, it can accommodate strong risk aversion for small stakes through first-order risk aversion. We claim only that probability weighting played no role in the particular choices in RP. To further illustrate our point, consider an alternative paradox, similar to RP and with similar calibration implications for utility. It could be constructed if subjects had preferences 210.50 ≼ 10 at all or many wealth levels, while perceiving all outcomes as gains. Then loss aversion could play no role and probability weighting would drive the paradox. We will in fact test this preference later (Fig. 5b) and find that it may exist, but is considerably weaker than the classical RP. Our findings regarding probability weighting only serve as an intermediate step in what is our main and positive purpose: to show the importance of reference dependence.

5 Reference-dependent theories can explain Rabin’s paradox

Many studies have confirmed reference and sign dependence, entailing violations of RDU (Huber et al. 2008; Wakker 2010 §9.5), although it continues to be debated (Isoni et al. 2011; Plott and Zeiler 2005; Yechiam 2019). We will analyze these concepts in their most basic and clearest form, using rank-dependent prospect theory where the reference point describes a deterministic wealth level (Tversky and Kahneman 1992). Other models are discussed in §8.

Sign dependence means that risk attitudes differ for losses and gains. Whereas probability weighting is mostly pessimistic for gains, with prevailing underweighting of favorable outcomes, for losses the opposite holds, with prevailing optimism and underweighting of unfavorable outcomes (Chesson and Viscusi 2003 Table III; Wakker 2010 §9.5). This reflection falsifies RDU. It also implies that the correction for probability weighting under RDU in Fig. 3 is not correct. To obtain Rabin’s calibration argument for utility, which involves the same decision weights for the two outcomes, we should, according to prospect theory (Eq. 4), measure for each subject the probability p such that

$$ {w}^{+}(p)={w}^{-}\left(1-p\right). $$
(8)

Our measurement of p is similar to Abdellaoui et al. (2016). Details are in the Appendix. Because RDU is a special case of prospect theory, it predicts p = r. Under RDU, Eq. 8 provides an alternative way to find the probability r (=p) of Eq. 7. However, based on the common findings of reflection we predict:

Prediction 4

0.5 ≈ p < r. □

We offered the prospect 11p(−10) to subjects. Fig. 4 displays this offer for a subject with p = 0.52. It is natural to assume that the reference point is the status quo for this choice.

Fig. 4
figure 4

Basic preference with p =0.52 instead of 0.50

Given Prediction 1 concerning the same choice but with probability 0.5 and given Prediction 4, we have:

Prediction 5

A large majority of subjects, as in Fig. 2 (Prediction 1), will reject (choose “No”) in Fig. 4.

Implication

Probability weighting does not contribute to the explanation of RP. Because p ≈ 0.5, probability weighting does not capture any risk aversion in RP. After properly correcting for probability weighting (Fig. 4), the same unexplained risk aversion remains as before (Fig. 2). □.

Under prospect theory, the above prediction gives indirect support for reference dependence, because it is the only explanation left for RP, given that utility curvature and probability weighting have been ruled out (and also other nonexpected utilities; see the end of §1). Loss aversion λ is commonly found to be about 2, although there is much variation (Ert and Erev 2013; Abdellaoui and Kemel 2014, §5.1; Wakker 2010 §9.5). Loss aversion thus leads to strong risk aversion and can readily explain the preference in Fig. 3 and the strong preferences in Figs. 2 and 4 for any plausible probability weighting and utility curvature. Apart from prospect theory, transitive deviations from expected utility proposed in the literature usually have not considered sign dependence. For our stimuli they mostly agree with RDU and they also make Prediction 3.

To obtain direct support for reference dependence, we tested the reference-change and outcome-change preferences. In Fig. 5b, the outcome-change preference cannot be formulated as an accept-reject decision and was formulated as a binary choice. To have a clean test of reference dependence, we therefore also framed the reference-change question in Fig. 5a as a binary choice. This change in framing will probably reduce loss aversion and, consequently, risk aversion somewhat. To make the framings and procedures as similar as possible, we also added the prior endowment of €1 in Fig. 5b, which by normative standards should be negligible. Finally, we used the probabilities p of Eq. 8 instead of 0.5 to control for probability weighting and focus on reference dependence. By Prediction 4, these probabilities p will not have a systematic effect on risk aversion and Figs. 5a and 5b also test Figs. 1d1 (110.5(−10)≼ω0) versus 1d2 ((11 + ω)0.5(−10 + ω)≼0ω).

Fig 5
figure 5

A direct test of reference dependence (with p =0.52 instead of 0.50).

Figures 5a and 5b differ only in the way that final outcomes are split into reference point and change with respect to the reference point. Our analysis is based on the assumption that: (a) the reference point in Fig. 5a has the additional payment incorporated; (b) accordingly, the outcome −€10 in Fig. 5a is perceived as a loss; (c) in Fig. 5b, the status quo of €0 is the reference point so that no losses are perceived. This is the most common assumption for reference points and for their implementation in experiments (Bateman et al. 2005; de Martino et al. 2006; Fehr-Duda et al. 2010; Kühberger 1998; Tversky and Kahneman 1992). It is crucial for the common incentivization of losses with prior endowments, and for endowment effects underlying WTP-WTA discrepancies (Sayman and Öncüler 2005; Viscusi and Huber 2012).

In the CPE model of Köszegi and Rabin (2006), future expectations serve as reference points if choices have been anticipated sufficiently far ahead in time, but not if they come as a surprise. In our experiment, subjects did not know beforehand what the choices would be, or the dynamic development of reference points. Our assumption will, probably, not hold for all subjects, and several subjects may perceive various other reference points, such as the sure outcome €10 depicted in Fig. 5b. It suffices that our assumption holds for most subjects.

In Fig. 5b, loss aversion does not play a role for most subjects and, therefore, risk aversion will be lower even though it will probably still exist owing to pessimistic probability weighting for gains.Footnote 11 Most subjects will take Fig. 5a as Fig. 1d1 (110.5(−10)≼ω0), and they will be as strongly risk averse as in the basic preferences in Fig. 2. Some subjects will integrate payments and take Fig. 5a as Fig. 1d2 ((11 + ω)0.5(−10 + ω)≼0ω), which reduces risk aversion. We summarize our claims:

Prediction 6

A majority of subjects will reject (choose the sure Prospect B) in Figs. 5a and 5b, but fewer than in Fig. 2, and the fewest in Fig. 5b.

Implication

A difference in risk aversion between Figs. 5a and 5b falsifies reference independence. □.

6 Our experimental findings

Subjects

N = 77 students (29 female; average age 22 years) from Erasmus University Rotterdam participated, in four sessions. Most were finance bachelor students.

Incentives

Each subject received a €10 participation fee. In addition, we randomly (by a bingo machine) selected two subjects in each session who could play out one of their randomly selected choices (which includes those to elicit r and p; see the appendix) for real. The selections were implemented in front of all subjects in a session by a volunteer. Baltussen et al. (2012) validated such incentive schemes. Subjects were paid immediately after the experiment. Losses could never exceed €10 and were paid from the showup fee. This way of implementing losses is common in experimental economics. The experiment lasted about 45 minutes and the average payment per subject was €15.70.

Procedure

The experiment was computer-run. Subjects sat in cubicles to avoid interactions. They could ask questions at any time during the experiment. Training questions familiarized subjects with the stimuli. Subjects could only start after they had correctly answered two comprehension questions.

Stimuli

Probabilities were generated by throwing two 10-sided dice. Details are in the Online Appendix.Footnote 12 We first measured the probability r (Eq. 7). Then we asked the two accept-reject questions of Figs. 2 and 3, followed by the measurement of p (Eq. 8). We, finally, asked the accept-reject question of Fig. 4 and the two questions of Figs. 5a and 5b, with the order of these three questions counterbalanced.

Results

Statistical tests, all two-sided, confirmed our predictions. Table 1 summarizes some results.

Prediction 1 [basic preference]: 88% rejected (“No”) the prospect in Fig. 2 (p value <0.001; binomial test).

Prediction 2: [r > 0.5]: The mean and median r were 0.63 > 0.5 (p value <0.001; Wilcoxon test).

As a byproduct of the measurement of r, we also measured utility (see the Appendix). We found linear utility, which is plausible for the moderate amounts in our experiment. Thus, whereas Implication 1 shows that utility curvature cannot entirely explain RP, we do not find any contribution of utility to the explanation of RP.

Prediction 3 [basic preference with RDU probability weighting]: 74% rejected the prospect in Fig. 3 (p value <0.001; Binomial test). This percentage is smaller than in Fig. 2 (p value = 0.015; McNemar test).

Prediction 4 [0.5 ≈ p < r]: The mean p was 0.52 and the median was 0.48. H0: p = 0.5 is not rejected (p value = 0.4; Wilcoxon test). p < r is confirmed (p value <0.001; Wilcoxon test).

When measuring p, as a byproduct we also measured loss aversion. It was approximately 2 (see the Appendix), in agreement with previous findings in the literature and well suited to explain RP.

Prediction 5 [basic preference with PT probability weighting]: 87% rejected the prospect shown in Fig. 4 (p value <0.001; binomial test). This was not significantly different from Fig. 2 (p value = 1; McNemar test).

Prediction 6 [reference- versus outcome-change preference]: 78% rejected the prospect in Fig. 5a (p value <0.001; binomial test), and 62% rejected the prospect in Fig. 5b (p value = 0.08; binomial test). The latter is smaller than the former (p value = 0.04; McNemar test).

The findings of Figs. 5a and 5b (Prediction 6) provide a direct within-subject falsification of reference independence.

7 Discussion of experimental details

Our experiment involved some adaptive (chained) stimuli, where answers given to some questions affected later stimuli, for instance regarding the probabilities r and p in Figs. 3 and 4. It was practically impossible for subjects to see through this procedure and to work out if and how manipulation could be beneficial. Hence, manipulation is, in the terminology of Bardsley et al. (2010 pp. 265, 285), a theoretical possibility but it is practically impossible. In practice, it does not affect incentive compatibility in our experiment.

Counterbalancing is commonly used to avoid order effects, but it can complicate the design for subjects and the subsequent analyses, and it can increase noise. It is, therefore, used only to avoid the major risks of order effects. In our study, this is particularly important because showing multiple different lotteries may affect the reference point that subjects select. We felt that Figs. 5a and 5b were most vulnerable to this problem and we, therefore, counterbalanced their presentation, combined with Fig. 4. For the other stimuli, we saw no reason to expect biases due to order effects, and we did not counterbalance them.

We could have avoided some order effects by using a between-subject design instead of a within-subject design. The pros and cons of these two designs are well-known (Ballinger and Wilcox 1997; Camerer 1989 p. 85). A between-subject design avoids order effects, but a within-subject design gives more statistical power, can test more hypotheses, and gives cleaner tests. The latter points are particularly relevant for our corrections for probability weighting. Further, a between-subject design had practical difficulties. Embedding it in sessions with other experiments might lead to spillover effects similar to the order effects that we sought to avoid. Implementing it in isolation would lead to very short experiments and subject’s implied payoff per hour would have substantially exceeded the upper bound imposed in our lab to avoid negative externalities for other experiments.

8 Preceding literature

Samuelson (1963) preceded Rabin in providing a paradox for expected utility where risk aversion in the small implies unreasonable risk aversion in the large. Samuelson assumed rejection of one prospect 2000.5(−100). If this preference holds at all wealth levels (as implied by constant absolute risk aversion), then it follows that 100 independent repetitions of this prospect should also be rejected. However, by the law of large numbers the latter preference is questionable. Edwards (1954 p. 401) presented a similar paradox. RP is stronger because from weaker and more convincing assumptions it derives stronger and more absurd conclusions. Hansson (1988) presented the same basic phenomenon as Rabin; it was perfected by Rabin.

Wakker (2010 pp. 244-245) surveyed early discussions of RP. Since then, Johansson-Stenman (2010) presented a theoretical analysis of RP for life-time consumption, Barseghyan et al. (2013 pp. 2526-2527) discussed an explanation based on probability weighting, Golman and Loewenstein (2016) suggested a cognitive model to explain it, and Sydnor (2010) provided field evidence, from insurance markets, supporting Rabin’s empirical claims. Schechter (2007) measured risk attitudes of farmers in Paraguay, where information about initial and final wealth was available. Her findings supported the reference-dependent evaluation of Fig. 1d1 (110.5(−10)≼ω0) rather than the integrated evaluation of Fig. 1d2 ((11 + ω)0.5(−10 + ω)≼0ω), which she referred to as evidence for narrow bracketing (Rabin and Thaler 2001). In our context, narrow bracketing is equivalent to reference dependence, and our study can be interpreted as supporting the importance of narrow bracketing.

Sarver (2018) presented a general class of all preference functionals satisfying a convexity condition w.r.t. probabilistic mixing. This class contains functionals that are nonkinked, but close enough to kinked, to accommodate RP. He points out that the RDU functional, a special case of his general class, may still be best suited to accommodate RP (p. 1367 first paragraph). He also discusses background risks, which we discussed at the end of §2.

The rest of this section discusses the closely related paper of Csvd. Csvd investigated RP systematically, following up on their theoretical analysis in Cox and Sadiraj (2006). Csvd were the first to confirm RP empirically and to establish it as another falsification of expected utility. They also provided a detailed theoretical analysis under RDU (their Eq. NL-1), with probability weighting as the deviation from EU. Outcomes were taken reference independent, in terms of final wealth; i.e., they were changes w.r.t. the wealth level upon entering the lab. Csvd pointed out that RDU is a special case of prospect theory (fixed reference point; sign-independent probability weighting), so that this special case of PT is also covered by their analysis.

Csvd provided theorems that exactly identify the utility functions and probability weighting functions that lead to Rabin’s calibration paradoxes under RDU for various potential empirical preferences. We followed up on their results. In particular, we measured and corrected for probability weighting in RDU to find out to what extent it accommodates RP empirically.

In their experiments, Csvd used large outcomes, incentivized through an arrangement with a casino with small but positive probabilities of actual implementation. For 41 German students they found majority preferences

$$ {\left(\boldsymbol{\omega} +\mathbf{110}\right)}_{0.5}\left(\boldsymbol{\omega} -\mathbf{100}\right)\preccurlyeq \boldsymbol{\omega} $$

for ω = 3K, 9K, 50K, 70K, 90K, and 110K with K = 1000 and Euro as unit. For 30 Indian students they found majority preferences

$$ {\left(\boldsymbol{\omega} +\mathbf{30}\right)}_{0.5}\left(\boldsymbol{\omega} -\mathbf{20}\right)\preccurlyeq \boldsymbol{\omega} $$

for ω = 100, 1K, 2K, 4K, 5K, and 6K with rupee as unit (50 rupees is a one-day salary for the students). Finally, for another group of 40 Indian students they found majority preferences

$$ {\left(\boldsymbol{\omega} +\mathbf{90}\right)}_{0.5}\left(\boldsymbol{\omega} -\mathbf{50}\right)\preccurlyeq \boldsymbol{\omega} $$

for ω = 50, 800, 1.7K, 2.7K, 3.8K, and 5K. Thus, they confirmed preferences as in Fig. 1d2 ((11 + ω)0.5(−10 + ω)≼0ω) for a wide enough range of wealth levels to imply RP for expected utility and thus establish it as a genuine empirical violation.

The implications of Csvd’s findings for probability weighting are not entirely clear. Their Corollary 1.1Footnote 13 shows that RDU with nonlinear probability weighting and linear utility can accommodate their findings, and does not lead to calibration paradoxes, if w(0.5) ≤ 10/21 for the German students, w(0.5) ≤ 2/5 for the first group of Indian students, and w(0.5) ≤ 5/14 for the second group of Indian students. To avoid misunderstandings, note that these upper bounds on w(0.5) can be relaxed somewhat under concave utility, offering extra protection against probability calibration paradoxes. Thus, theories that transform both probabilities and outcomes are less prone to calibration problems than theories that transform only one of these two.

Probability weighting is least plausible for the second Indian group of Csvd (requiring w(0.5) ≤ 5/14). However, it cannot be ruled out without further information about this particular group of subjects, and actual measurement of w is desirable to settle the case. This is why we measured and fully corrected for probability weighting in our experiment. Csvd did not formalize or test reference dependence with loss aversion, but suggested it as an explanation of the problems of probability weighting.Footnote 14

Interestingly, Csvd also tested a dual version of RP, introduced by Sadiraj (2014), in which calibration paradoxes are the result of probability weighting rather than of utility.Footnote 15 Masatlioglu and Raymond (2016) consider other variations revealing further problems. We will not review extensions of Rabin’s paradox, but focus on Rabin’s original paradox. As we showed, this already gives clear conclusions for decision theory, namely that reference dependence is desirable for descriptive theories. Many other findings further demonstrated the importance of reference- and sign dependence, factors beyond probability weighting.Footnote 16

Summarizing, Csvd were the first to conclusively demonstrate that RP falsifies expected utility. They suggested that probability weighting and reference dependence may accommodate these violations, but the evidence they provided was not conclusive. They strongly suggested that probability weighting alone cannot tell the whole story. In their introduction, they raised the general question: “Is there a plausible theory for decision under risk?” As we have shown, the main message from RP is that reference dependence is an important part of the answer to this general question. As regards normative implications, there is wide, though not universal, agreement that reference dependence—taken as a framing effect—is irrational, and that it is more irrational than probability weighting. Probability weighting only violates the von Neumann-Morgenstern independence axiom as in Allais’ paradox. Allais and many others saw such violations as rational. Consequently, RP provides a more serious deviation from classical rationality assumptions than previously thought.

9 Discussion of reference dependence

Andersen et al. (2018) examined the dependence of risk aversion (as in Fig. 1d1) on wealth, using individual data on wealth from Denmark. They assumed a homogenous agent and then used between-subject comparisons. They found a weak relation between risk aversion and wealth. The authors interpreted wealth levels as reference points, and their finding as reference dependence. However, they could not distinguish between (analogs of) Figs. 1d1 and 1d2; i.e., between reference dependence and outcome dependence. They have no test analogous to our Fig. 5. Therefore, their finding can also be taken as a weak deviation from constant absolute risk aversion in a final wealth model.

Markowitz (1952) was among the first to propose reference dependence, but he did not incorporate probability weighting and made empirically invalid conjectures about utility curvature. Other early works include Shackle (1949 Ch. 2) on sign dependence and Edwards (1954 p. 395 & p. 405). Edwards later influenced the young Tversky. Arrow (1951 p. 432) discussed reference dependence, pointing out that it plays no role when outcomes refer to final wealth, and criticizing it for this reason. An early appearance of loss aversion is in Robertson (1915 p. 135).

Prospect theory was the first reference-dependent theory that could work empirically. It still is, today, the most widely tested and best confirmed theory of decision under risk. Many different and more advanced reference-dependent models (Apesteguia and Ballester 2009; Köszegi and Rabin 2006; Masatlioglu and Raymond 2016; Schmidt 2003; Schmidt et al. 2008) and definitions of loss aversion (Abdellaoui et al. 2007, Table 1; Peeters and Czapinski 1990; Köbberling and Wakker 2005; Peters 2012; Schmidt and Zank 2005) have been proposed since, that can also be applied in complex and more general situations. They are not needed for the relatively simple RP choices and our empirical tests. Further, in the alternative models, utility curvature, probability weighting, and loss aversion are often not clearly separated. In particular, the best-known alternative model (Köszegi and Rabin 2006) has not yet been extended to allow for probability weighting.

We leave detailed analyses of RP using alternative models to future studies. We have used prospect theory, the earliest, simplest, and most extensively tested reference-dependent model, and, as we have shown, it leads to clear conclusions. Whereas reference-dependent theoretical models, and experimental stimuli to detect reference dependence, have been used before, our novelty lies in combining the two in a way that resolves RP.

10 Conclusion

Rabin’s (2000) paradox is one of the most famous paradoxes in modern economics. It is commonly, although not universally, accepted as negative evidence against classical expected utility (Kahneman 2003 p. 164). Its cause had not yet been identified, so that no positive inference could be derived. We identify this cause and provide a positive inference: RP proves that we need reference-dependent generalizations of classical models, and it does so more strongly than any other paradox did before. Other deviations from expected utility do not contribute to explaining Rabin’s paradox.