1 Introduction

Most economic theories describe the decision-making process as a confluence of preferences, beliefs, and cognitive processes. Disentangling these primitives is a challenge because they are all unobservable in most empirical data. An important advantage of experiments is that auxiliary revelation mechanisms can be used to elicit participants’ beliefs. Accurate belief data can supplement choice data to facilitate stronger identification of the preferences and cognitive processes that guide choice.

It is well-known that heterogeneous preferences can make eliciting accurate beliefs difficult.Footnote 1 This is because heterogeneous preferences may also impact behavior in the revelation mechanism used to elicit beliefs. For example, participants may misreport in unincentized Introspection mechanisms (Introspection) if they find it arduous to think carefully about their beliefs or if revealing their true belief causes them discomfort. Explicit incentives can mitigate these issues, but incentive-compatible mechanisms must use lotteries and lotteries interact with risk preferences. This has led to the use of the sophisticated Stochastic Becker-Degroot-Marschak mechanism (SBDM), which is predicted to induce truthful revelation for a wide variety of preferences.Footnote 2

Despite the impressive theoretical properties of the SBDM, there is little evidence that the SBDM mechanism outperforms Introspection in terms of belief accuracy (Hollard et al. 2016; Trautmann and van de Kuilen 2015).Footnote 3 Further, participants using incentive-compatible belief elicitation mechanisms often misreport their beliefs even when the probability of an event occurring is objectively known (Hao and Houser 2012; Burfurd and Wilkening 2018). These results suggest that there may be a second potential difficulty for belief elicitation: an interaction between belief elicitation mechanisms and cognition.

Heterogeneous responses to belief elicitation based on cognition has potentially important implications for interpreting belief data and for choosing a belief elicitation method. If decision-making and reporting behavior vary systematically with knowledge, modes of reasoning, cognitive effort, or fluid intelligence—fundamental components of cognition—complex mechanisms might yield reliable reporting data from participants who make better decisions, and less reliable data from participants who make sub-optimal ones. This could have serious implications for analysis, since belief errors would be correlated systematically with unobservable skills and abilities. It also suggests that researchers may face a tradeoff between catering for heterogenous preferences and heterogeneous cognition.

In this paper, we focus on a potential interaction between the SBDM mechanism and probabilistic reasoning. To study this interaction, we use a two-part design in which we first identify participants whose decision-making is consistent or inconsistent with probabilistic reasoning and then examine how participants of both types respond to different belief elicitation mechanisms.

To evaluate whether participants’ decisions are consistent with probabilistic reasoning, we use a variant of an urn task introduced in Charness and Levin (2005) and Charness et al. (2007), which we refer to as the Bucket Game. In each period, participants are individually assigned one of two buckets (A or B) with equal probability. Each bucket is divided into two sides, and each side contains 20 balls; each ball may be black or white. Participants draw and replace a ball from the left side of their bucket at the start of each period and are paid $4 if they observe a black ball and $0 if they observe a white ball. The color of the ball is informative, and presents an opportunity for a participants to update their belief that they have been given Bucket A. Participants then choose whether they would like to draw an additional ball from the left or the right side of their bucket. They are paid $4 if their second ball is black and $0 if their second ball is white.

The task is structured so that it is optimal for an individual who updates her belief in the direction predicted by Bayes’ rule to switch to the right side of her bucket if the first ball successfully earned $4, and to stay with the left side of her bucket if the first ball was unsuccessful. By contrast, if individuals use a simple reinforcement-learning heuristic or have an affective response to success, they will prefer to stay after a success and switch after a failure. This choice pattern directly violates stochastic dominance and reveals behavior that is not consistent with Bayes’ rule. Thus, by observing decisions in the Bucket Game we can identify individuals whose choices are “consistent” and “inconsistent” with probabilistic reasoning and stochastic dominance.

After 20 iterations of the Bucket Game, we begin part two of our experiment. Participants continue to play the Bucket Game, but belief elicitation is introduced. In our main treatments, half of the participants are exposed to an Introspection mechanism; after observing their first ball, participants are asked to report the probability that they have been given Bucket A. The other half are exposed to the SBDM mechanism. This belief elicitation method is incentive-compatible under minimal assumptions about risk preferences but is fairly complex and likely unfamiliar to participants.

To allow for variation in the probabilistic difficulty of forming correct beliefs, we vary the composition of balls in the bucket and the number of balls drawn. A feature of our design is that all participants observe instances of two balls being drawn, and both a black and a white ball being observed. In these periods the combined signal is uninformative and thus it takes no effort to form a belief. Our design therefore allows us to observe belief errors in (i) problems in which signals are informative and beliefs are costly to compute and (ii) problems in which signals are uninformative and beliefs are easy to compute.

We interpret the Bucket Game as identifying individuals who have high and low crystallized intelligence related to probabilistic reasoning. Relative to fluid intelligence, which captures an individual’s capacity for abstract reasoning and “inductive” capacity, crystallized intelligence relates to the knowledge an individual has acquired through experience (Horn and Cattell 1966).Footnote 4 Ex-ante, we predicted that probabilistic reasoning would be important for the SBDM mechanism because it requires that an individual’s choices are consistent with stochastic dominance and probabilistic sophistication in order for its incentive properties to hold. As our “inconsistent” group frequently violates stochastic dominance, we predicted that these individuals may not understand the incentive properties of the mechanism and may have additional belief errors as a result.

In the SBDM mechanism, errors may arise from two potential sources: (i) inaccurate underlying beliefs that are a result of incorrect Bayesian updating and (ii) misreported beliefs that are due to a misunderstanding of the incentive properties of the mechanism. As both types of errors are likely to differ with probabilistic reasoning, observing a difference in the belief errors between consistent and inconsistent participants in the SBDM mechanism does not necessarily imply that inconsistent participants are misunderstanding the incentive properties of the mechanism.

To isolate the mechanism-specific misreport channel, we employ a difference-in-difference approach in which we compare the difference in mean errors between consistent and inconsistent participants in the SBDM mechanism with the same difference in the Introspection mechanism. As the Introspection mechanism is not predicted to generate mechanism-specific misreports, we predict that the SBDM mechanism will have a larger difference in belief errors between consistent and inconsistent participants than the Introspection mechanism.Footnote 5

Pooling the data from our initial experiments and pre-registered follow-up experiments, the results are in line with these predictions: the mean error of a consistent participant is 37.6 percent smaller than an inconsistent participant in the SBDM. However, it is only 22.4 percent smaller in the Introspection mechanism. The difference in sensitivities to probabilistic reasoning is significant in a permutation test that ensures independence between the main effects and the interaction effect. Further, the difference in sensitivities is strongest in easy decision problems in which Bayesian updating is not required. In these decision problems, identifying the correct belief is unlikely to be cognitively costly and differences in belief errors is most likely driven by confusion stemming from the SBDM mechanism itself.

A caveat to these results is that the magnitude of the estimated interaction effect between the SBDM mechanism and probabilistic reasoning is large and significant in our initial lab-based experiments but small and not significant in our online follow-up experiments.Footnote 6 Further, while the magnitude of the interaction effect in the follow-up experiments increases when the data is restricted to easy decision problems or when the most obvious outliers are removed, a difference in the magnitude of the interaction effect between the two samples persists. Thus, we see value in future independent replications of our experiments and in understanding whether there are differences in how belief elicitation mechanisms and incentives interact with lab and online environments.

The main rationale for using incentive-compatible belief elicitation is to induce participants to carefully report their beliefs and to provide high-quality information even when calculating beliefs is costly. Thus, we would predict that belief errors in the SBDM mechanism will be less sensitive to the difficulty of the decision problem than Introspection. Consistent with this second prediction, we find that in the SBDM mechanism, the mean error of participants in easy decision problems is 20.7 percent smaller than in hard decision problems. By contrast, in the Introspection mechanism, the mean error in easy decision problems is 53.8 percent smaller than the mean error in hard decision problems. The difference in sensitivities to task difficulty is significant in a permutation test that ensures independence between the main effects and the interaction effect.

In a follow-up experiment, we also identify individuals with high and low fluid intelligence (“ability”) with a Raven task and high and low proclivity for cognitive effort using an extended Cognitive Reflection Test.Footnote 7 Although these tasks strongly predict errors in both the SBDM mechanism and Introspection, there is no significant interaction effect between high and low-ability types and mechanisms, nor high and low-effort types and mechanisms.

Taken together, our results suggest that while the SBDM mechanism encourages participants to think carefully about their beliefs in difficult elicitation problems, some individuals struggle to understand the mechanism. This may lead to heterogeneity in belief errors across a population based on probabilistic reasoning skills. Our results help to clarify why earlier studies have found mixed evidence regarding the relative efficiency of the SBDM mechanism and the Introspection mechanism. It also highlights a potential confound in designs that rely on individual-level beliefs since belief errors may be correlated systematically with probabilistic reasoning.

The rest of the paper is arranged as follows. In Sect. 2 we describe the Stochastic Becker-DeGroot-Marschak mechanism and discuss the existing literature on belief elicitation and cognitive processes. In Sect. 3 we discuss the experiment, hypotheses, and analysis plan. Results are presented in Sect. 4.

2 The stochastic Becker-DeGroot-Marschak mechanism

Consider a participant in an experiment who has a subjective belief about the distribution of a discrete random variable X, with range \(\mathcal {X}\). Her true beliefs \(P_X\) describes the probability that \(X = x\) for each \(x \in \mathcal {X}\), and the researcher wants to know belief p that event \(P(X = x)\) will occur.

If participants have an aversion to lying, and if there are no cognitive costs from identifying or reporting p, unincentivized Introspection will be truth-telling. However, if the researcher is concerned that these conditions are not satisfied, she can use explicit incentives to induce truthful reporting. “Scoring rules” describe a payment schedule based on a participant’s reported belief \(r \in [0,1]\) and the realisation of the random variable X. For a single realisation of X, a scoring rule S is a mapping \(S: [0,1] \times \mathcal {X} \rightarrow \mathbb {R}\). This means that S(rx) is paid when r is reported and outcome x is realized.

For a participant who has utility function u, in which u is a utility function in the class of von Neumann-Morgenstern Expected Utility functions, a rational participant faced with scoring rule S reports \(r \in [0,1]\) to maximize \(\mathbb {E}u(S(r,X))\) where, by the expected utility assumption,

$$\begin{aligned} \mathbb {E}u(S(r,X)) = \sum _{x \in \mathcal {X}} u(S(r,x))P(X=x). \end{aligned}$$

Using the terminology introduced by Winkler and Murphy (1968), a “proper” scoring rule renders it optimal for risk-neutral agents to report their beliefs truthfully. That is, given a utility function \(u(S(r,X)) = S(r,X)\), the scoring rule is “truth-telling” (or “incentive-compatible”) in the sense that, for all \(P_X \in \mathcal {P}_X\),

$$\begin{aligned} p \in \underset{r\in [0,1]}{\arg \max } \mathbb {E} u(S(r,X)). \end{aligned}$$

As the definition suggests, truth-telling may not occur in cases in which \(u(S(r,X)) \ne S(r,X)\). This may be problematic when participants have heterogeneous risk preferences that are unobservable to the researcher.Footnote 8

As noted as far back as Smith (1961) and Savage (1971), moving from a deterministic scoring rule to a stochastic one makes it possible to induce truth-telling for all von Neumann-Morgenstern Expected Utility miximizers.Footnote 9 Here, we discuss a stochastic scoring rule that has garnered significant interest in the literature: the Stochastic Becker-DeGroot-Marschak mechanism (SBDM).

In the SBDM mechanism, the experimenter presents a participant with a choice under risk, described by lottery \(H_{A}L\), which pays H if event A occurs and L if not. The participant forms a subjective belief that event A will occur. We denote this subjective probability p. The participant is then asked to issue report \(r \in [0,1]\) about her belief p before making a decision based on her beliefs. A second lottery is created in parallel. A number z is realized from the distribution of random variable Z, which has distribution \(P_Z\) on support [0, 1]. The participant does not know z, but does know that if z falls above her report r she will receive lottery \(H_{z}L\), which makes a high payoff H with probability z. If z falls below r she receives lottery \(H_AL\). The lotteries therefore offer identical payoffs with different probabilities. It is in the participant’s best interest to report \(r = p\), because a report of \(r \ne p\) might mean the participant receives the less desirable lottery.

By construction, the SBDM uses the same two payoffs for the subjective and objective lotteries and thus the particular cardinal values assigned to the high and low payoffs are not predicted to influence reports. As a result, the SBDM induces truth-telling under minimal assumptions about preferences; namely, that u(S(rX)) are consistent with stochastic dominance and probabilistic sophistication (Karni 2009). As per Machina and Schmeidler (1992), probabilistic sophistication means that a participant will rank lotteries according to the implied probability distribution over outcomes. “Stochastic dominance” is the condition that a participant has preference relation \(\succeq\) over lotteries such that \(H_qL \succeq H_{q'}L\) for all \(H > L\) if and only if \(q\ge q'\).

2.1 Cognitive processes and belief elicitation

Although there is little research that empirically studies the interaction between cognitive processes and reporting behavior in belief elicitation mechanisms, a few papers suggest that cognition may influence behavior in the SBDM. Hao and Houser (2012) evaluate two implementations of the SBDM mechanism: the standard implementation in which a participant directly reports her beliefs, and an ascending clock mechanism.Footnote 10 While both mechanisms are incentive-compatible, the ascending clock mechanism is also obviously strategy-proof and more easily understood by cognitively limited agents (Li 2017). Thus, differences in the quality of reports across these two implementations suggests that cognition may influence reporting. Hao and Houser identify “naive” subjects, who report \(r \ne p\), and “sophisticated” subjects who report \(r = p\). The clock mechanism reduces the sample of naive observations and improves the accuracy of reported beliefs.Footnote 11

Freeman and Mayraz (2019) study how individuals choose between safe and risky lotteries in environments in which (i) participants are shown exactly one lottery, (ii) they are given a choice list and one decision is randomly selected for payment, and (iii) they are given a choice list but informed about the decision that will be paid before making their choice. The paper finds more risk-taking in the individual choice problem relative to the other two formats and conjectures that the choice list provides scaffolding that helps decision makers identify their true preferences. If cognition is an issue in the SBDM mechanism, then we should also find that belief errors in the SBDM are reduced with choice lists. Holt and Smith (2016) compares behavior between a direct elicitation method and a choice list using an “induced value” urn task in which participants receive one or more signals from an urn. The probability that the balls are drawn from a particular urn can be calculated explicitly via Bayes’ rule. The paper does not find a significant difference in belief errors between the choice list approach and a direct elicitation implementation based on Holt and Smith (2009). However, it does find that the choice list reduces boundary reports. Burfurd and Wilkening (2018) also does not find differences in belief errors between a direct elicitation format based on Hao and Houser (2012) and a choice list format in urn problems with a single draw. However, Burfurd and Wilkening (2018) does find that there is significant heterogeneity in belief errors across individuals even when the probability of an event is objectively known.

In a concurrent project, Schlag and Tremewan (2020) studies a “frequency” based belief elicitation mechanism that can be used when multiple realisations of an outcome are available. The paper compares this mechanism to an SBDM mechanism based on the instructions of Dal Bó et al. (2017). The authors find that the frequency method performs well against the SBDM and that the difference in performance is driven by a large number of participants who choose a focal report of 50% in the SBDM mechanism. These focal reports are correlated with poor performance in a Cognitive Reflection Test. We do not find the same large spike of focal reports at 50% in our data, though we use a different analogy-based instruction format and include a control quiz.Footnote 12

3 The experiment

We use a two-part design in which we first identify whether participants are consistent or inconsistent probabilistic reasoners, making use of a computerized “Bucket Game”. We then study how participants respond to different belief elicitation techniques. We describe the Bucket Game before introducing the treatments.

3.1 The bucket game

The Bucket Game is a variant of an urn task introduced in Charness and Levin (2005) and Charness et al. (2007). In each period, a participant is allocated one of two buckets (A or B) with equal probability. Each bucket is divided into a left and a right side and each side holds 20 balls. Subjects are not told which bucket they have been given, but are provided an illustration that shows the composition of balls in the two buckets. An example illustration is given in Fig. 1. As can be seen, the left hand side of each bucket is composed of a mixture of black and white balls and there are more black balls in the left hand side of Bucket A than Bucket B. The right hand side of Bucket A is filled with only black balls and the right hand side of Bucket B is filled with only white balls. The buckets used in all treatments share these features.

Fig. 1
figure 1

Illustrations of Bucket A and Bucket B, as presented to participants

In each period, the participant observes the color of a ball that is drawn (with replacement) from the left-hand side of her bucket. If the participant observes a black ball, she receive a stage-one payment of $4. If the ball is white, she receive $0. Next, the participant must decide whether to draw a second ball from the same (left) side of her bucket, or to switch to the other (right) side. The participant receives a payment of $4 if she observes a black ball in this second stage and receives $0 if she observes a white ball.

There are more black balls on the left hand side of Bucket A than Bucket B. Thus, the first draw from the bucket is informative about the bucket that has been allocated to the participant. Participants whose updating is directionally consistent with Bayes’ rule are predicted to use this information in their choice. If a consistent participant observes a black ball from the left-hand side of her bucket, her belief that she has been given Bucket A will exceed 0.5 and she should choose to switch to the right side of the bucket. If a consistent participant receives a white ball, her belief that she has been given Bucket A will be less than 0.5 and she should choose to continue to draw from the left side.

However, the game is designed so that the expected value maximizing choice is at odds with an intuitive reinforcement learning heuristic in which a decision maker repeats actions that are successful and changes actions when unsuccessful. When observing a black ball on the first draw, the participant is “successful” and receives $4. Thus, reinforcement learning predicts that the participant will continue to choose left. After observing a white ball, the participant receives $0 and reinforcement learning predicts that the participant switches to the right. We therefore predict that participants who use a reinforcement learning heuristic will always choose the side that is stochastically dominated.

3.2 Experimental design and treatments

We ran an initial experiment consisting of 239 participants and a follow-up experiment consisting of 244 participants. Our initial experiment was conducted in the University of Melbourne’s Experimental Economics Lab and was conducted in a traditional lab setting. Participants were recruited using ORSEE (Greiner 2015) from the university’s experimental economics subject pool and sessions were conducted using z-Tree (Fischbacher 2007). The follow-up experiment recruited participants from the same database but excluded those who participated in the initial experiment. The follow-up experiment was pre-registered with the Centre for Open Science (https://osf.io/t57vq) and was conducted online using oTree (Chen et al. 2016).

In the initial experiment, we randomized individuals to computers in the lab using a set of bingo balls. Each terminal was assigned one of six potential treatments. These treatments are summarized in Table 1. The treatments differed in the number of black balls in the left hand side of Bucket A, and in the belief elicitation method.

A session consisted of three blocks and each block consisted of 20 periods. In the first block, participants in all treatments received the same computerized instructions describing the Bucket Game and were required to successfully answer all questions in a computerized quiz before starting the experiment. Participants then played 20 periods of the Bucket Game. They were informed about whether they successfully drew a black ball from their chosen side of the bucket in each period.

Table 1 Summary of treatments

In the second block, we elicited beliefs with the SBDM mechanism in one-third of treatments and with an Introspection mechanism in one-third of treatments. The remaining treatments were not exposed to any belief elicitation mechanism and were used to test for an observer effect. We discuss the observer effect in Appendix B.

As with the first block, all participants received computerized instructions at the start of the second block and were required to take a quiz before continuing. The instructions in the Introspection and SBDM treatments explained the belief elicitation task and included additional control questions to ensure participant comprehension.

After reading the instructions for Block Two, participants played twenty more periods of the Bucket Game. We elicited beliefs after the participant had observed the draw from the left hand side of their bucket but before they chose left or right. All beliefs were expressed as the “chance-in-100” the participant had been given Bucket A.

In the Introspection treatments, there were no payments associated with belief reports. However, the instructions asked participants to think carefully about their beliefs.

In the SBDM treatments, we used an adaptation of the direct elicitation method developed in Hao and Houser (2012). This set of instructions was shown in Burfurd and Wilkening (2018) to yield high quality data and to be quick to implement relative to alternatives.

Block Three of the experiment was identical to Block Two, except that a participants initial draw consisted of two balls from the bucket instead of one. These draws were done with replacement and the participant was informed of the colour of both balls before reporting their belief and making their left/right choice. Subjects were paid for each black ball they received from the initial draws. As discussed in more detail in Sect. 3.2.1 below, this block was important because it created situations in which the signal was uninformative, which allows us to study how beliefs interact with task difficulty. Instructions for Block Three were short and discussed only the additional draw that the participant observed.

To avoid wealth effects and potential hedging strategies, participants were paid in cash for three randomly chosen periods announced at the end of the experiment—one chosen from each of the three blocks.Footnote 13 Participants were allowed to proceed at their own pace through the experiment and most participants completed the experiment in under 45 min. Including a show-up fee of $10, the average payment of a participant was $24.40 AUD. The experiments were run in November and December of 2015, when $1 AUD \(\approx\) $0.72 USD.

The follow-up experiment was similar to the initial experiment except that we dropped the No-Elicitation treatment and included two additional questionnaires. The first was an expanded version of Frederick’s Cognitive Reflection Test (Frederick 2005), which used three additional questions from Primi et al. (2016) and an additional set of placebo questions taken from Thomson and Oppenheimer (2016). The questions on the CRT were given in a fixed order, with the original and well-known “bat-ball” CRT question asked last. The full list and ordering of questions is included in Appendix H.

The second survey was a short form version of Raven’s Advanced Progressive Matrices test developed and validated in Bors and Stokes (1998). The short form consists of 12 questions extracted from the original 36, but does not include early questions in the test that most university students are able to answer correctly.

We randomly selected one question from each quiz and paid the participant $4 if they answered the question correctly. Thus, the incentives offered in these quizzes were similar in magnitude to the main experiment.

Participants in the follow-up experiment worked at their own pace and no time limits were imposed when answering the main questions or surveys. The show-up fee was increased to $15 to cover the time required to complete the two questionnaires. The average payment was $35.55 AUD, with most participants completing the experiment in 75 minutes or less. The experiments were completed in December 2020, when $1 AUD \(\approx\) $0.74 USD. Due to Covid-19 restrictions the follow-up experiments were conducted online using Zoom and oTree (Chen et al. 2016). All key protocols were preserved and participants were able to privately ask questions throughout. Participants names and decisions were not visible to other participants.

The total number of participants who were assigned to each of the four main treatments in the initial experiment and follow-up experiment are shown in Table 2.  Following our pre-analysis plan, we compared the initial experiment to the follow-up experiment and did not find any statistically significant differences (see "Appendix A"). We therefore pooled the data from the two experiments when reporting averages and testing the main two hypotheses. We also show in Appendix F that our results are robust to outliers, which tended to be more frequent in the follow-up online experiment.

Table 2 Sample sizes

3.2.1 Informative and uninformative signals

An important feature of our design is that all participants were exposed to periods in which they drew one black ball and one white ball before reporting their beliefs in Block Three. In these periods, the signals were jointly uninformative and the decision problem required no Bayesian updating to report the true belief. We conjecture that reporting the correct beliefs was not cognitively challenging in these periods, and we compare errors from these periods to periods with informative signals to test whether errors in the Introspection mechanism is influenced by task difficulty.

To generate additional variation in the difficulty of the belief updating task, we also used two different sets of buckets across the treatments and varied the number of balls drawn within a treatment. In our “high information” treatments, Bucket A contained 14 black balls and Bucket B contained 6 black balls. In Blocks One and Two of this treatment, receiving a single black signal results in a posterior of \(\rho ' = 0.7\) while receiving two black signals in Block Three results in a posterior of \(\rho ' = 0.84\). In the other half of the treatments, Bucket A contained 12 black balls and Bucket B contained 8 black balls. In these treatments, receiving a single black signal results in a posterior of \(\rho ' = 0.6\) and receiving two black signals results in a posterior of \(\rho ' = .69\).

All treatments were designed so that posteriors were an equal distance from the prior whether the participant observes a white or a black ball (i.e., the posteriors were 0.7 and 0.3 after receiving a black ball or a white ball in the high information treatments). This symmetry allows us to cleanly aggregate participants’ reported beliefs: for example, in Block Two of the high information treatment, a participant who reported \(r = 0.5\) has a belief error of 0.2 regardless of whether they observed a white or a black ball.

3.2.2 Measures of cognitive heterogeneity

We classify participants as consistent or inconsistent probabilistic reasoners based on their decisions in the last ten periods of Block One. We elected to use only the second half of the Block One sample to ensure that individuals were not being classified based on early experimentation.Footnote 14 A participant is classified as consistent if they made 7 or more correct left/right decisions in periods 11–20. Our type cutoff was set to achieve as close to a median split across consistent and inconsistent types as possible. Based on this classification there are 215 consistent participants and 188 inconsistent participants in our treatments with a belief elicitation mechanism. The proportion of consistent types is balanced across treatments, with 105 consistent participants in the Introspection treatments (53 percent of Introspection participants) and 110 inconsistent participants in the SBDM treatments (54 percent of SBDM participants).

Cognitive ability is often divided into crystallized intelligence, which relates to knowledge that an individual has acquired, and fluid intelligence, which relates to a individual’s capacity for abstract reasoning, using the model proposed in Cattell (1963). As noted in the introduction, we interpret our Bucket Game as identifying individuals who have high and low crystallized intelligence related to probabilistic reasoning. Individuals who are inconsistent are observed to frequently violate stochastic dominance, which only requires updating in the direction predicted by Bayes’ rule. We predict that such knowledge is important to the SBDM mechanism because stochastic dominance is one of the weak assumptions required for the mechanism to be incentive compatible.

In our follow-up experiment we use additional surveys to generate measures of fluid intelligence and cognitive effort. Following our analysis plan, we classify individuals as high-ability and low-ability using a median split of performance in the short-form Raven’s Advanced Progressive Matrices test. Individuals are classified as high-ability if they got 9 or more of the 12 matrices questions correct and low-ability otherwise. 134 individuals were classified as high-ability and 110 participants were classified as low-ability. 74 of the high-ability participants were in the Introspection treatment (representing 60 percent of Introspection participants) and 60 of the high-intelligence participants were in the SBDM treatment (51 percent of SBDM participants).

We classify individuals into high-effort and low-effort groups using a median split of our extended CRT.Footnote 15 137 participants who answered 4 or more CRT questions correctly are classified as high-effort, while 107 are classified as low-effort. 67 of the high-effort cohort belong to the Introspection treatment (representing 55 percent of Introspection participants) while 70 belong to the SBDM group (57 percent of SBDM participants).

3.3 Statistics and hypotheses

Both of our main hypotheses come from a \(2\times 2\) factorial design. We are primarily interested in the interaction effect between factors. The standard approach to testing this type of model would be to use a parametric ANOVA specification. However, our dependent variable in this analysis, Error, is the absolute error of a participant’s reports, relative to the objective Bayesian posterior. The distribution of errors is not normally distributed and thus the underlying assumption of parametric ANOVA is not satisfied. The permutation test represents an ideal alternative since it requires only minimal assumptions about the errors, is exact in some cases, and has high power relative to other approaches.

The main assumption of permutation tests is that the data is exchangeable under the null hypothesis. Data is exchangeable if the probability of the observed data is invariant with respect to random permutations of the indexes (Basso et al. 2009). In the \(2\times 2\) factor design, the observations are typically not exchangeable since units assigned to different treatments have different expectations. This implies that approaches that freely permute data may fail to separate main and interaction effects (Good 2000). Instead, we use a variant of the synchronized permutation test of Perasin (2001) and Salmaso (2003), which restricts permutations to the same level of a factor to generate test statistics for main factors and interactions that are independent of each other (Basso et al. 2009).

A detailed explanation of the synchronized permutation test is included in Appendix D. We note that in some cases our data is not balanced, which can also confound main effects and interaction effects. To deal with this issue, we follow a suggestion in Montgomery (2017) of randomly dropping observations so that each cell has the same number of observations. Although we lose some power by reducing the size of the sample, the resulting data is a random sample of the original and the resulting test statistic is independent of the main effect. To ensure that our random subset of data is not driving our results, we use an outer loop in our testing procedure and perform our permutation test with 1000 sub samples. We report the average p value over the 1000 sub samples in the main text.

A potential concern when using a permutation test is that it may be sensitive to heterogeneity in the dispersion of points across cells. This issue was raised in the context of the Mann-Whitney test by Fagerland and Sandvik (2009), who show that deviations in Type I error rates can be generated for a null of identical means or medians when the means and medians of two samples are the same but the skewness or kurtosis of the samples differ. To at least partially address this concern, we also tested a Wald-type permutation statistic (WTPS) developed by Pauly et al. (2015). This procedure uses a free permutation of the dependent variable and is asymptotically valid in the case of heteroscedasticity in the errors across cells. As seen in Appendix F, results using this test are similar to those in the main text if we control for outliers.

Finally, in our tables, we also report the results from pairwise permutation tests. For these tests, we regress error on the mechanism treatment dummy and randomize assignment to treatments using the “ritest” command in Stata (Heß 2017). These permutation tests are performed 10,000 times and the null hypothesis is that there are no differences between the test groups.

3.3.1 Hypotheses

Sensitivity to Probabilistic Reasoning: As shown by Karni (2009), the SBDM mechanism is incentive-compatible when individuals’ preferences over risk satisfy probabilistic sophistication and stochastic dominance. Thus, for consistent participants, we would predict lower errors in the SBDM regardless of the difficulty of the belief updating problem.

By contrast, a participant who makes an incorrect decision in the Bucket Game is actively choosing a bucket with a lower expected value over one with a higher expected value. Such actions violate stochastic dominance. Thus, inconsistent participants may have difficulty understanding and interacting with the SBDM mechanism.

Using behavior in the Introspection treatments to control for inherent differences in accuracies between the two groups, we predict:

Hypothesis 1

The SBDM mechanism is more sensitive to probabilistic reasoning than the Introspection mechanism.

If Hypothesis 1 is true, we should see a larger difference in errors between consistent and inconsistent participants in the SBDM mechanism than in the Introspection mechanism. Let \(i \in \{1, 2\}\) represent the assignment of an individual to the SBDM mechanism (\(i = 1\)) or the Introspection mechanism (\(i = 2\)). Likewise, let \(j \in \{1, 2\}\) represent whether an individual is classified as consistent (\(j = 1\)) or inconsistent (\(j = 2\)). Then, using a standard additive ANOVA specification, we assume that the mean absolute error of individual k assigned to mechanism i and classified as type j, \(E_{ijk}\), can be decomposed into a overall mean (\(\mu\)), two main effects (\(\alpha _i\) and \(\beta _{j}\)), an interaction effect \((\alpha \beta )_{ij}\), and an error term \(\epsilon _{ijk}\):

$$\begin{aligned} E_{ijk} = \mu + \alpha _{i} + \beta _{j} + (\alpha \beta )_{ij} + \epsilon _{ijk}. \end{aligned}$$
(1)

By including the additive constant \(\mu\), all main effects and interactions in the model can be defined to sum to zero. Thus, we assume that \(\alpha _{1} + \alpha _{2} = 0\), \(\beta _{1} + \beta _{2} = 0\), \((\alpha \beta )_{i1} + (\alpha \beta )_{i2} = 0\) for all i, and \((\alpha \beta )_{1j} + (\alpha \beta )_{2j} = 0\) for all j. In this construction, \(\alpha _{1} = -\alpha _{2}\) and thus, under the null of no effect of the mechanism on errors, each of the main effects \(\alpha _{1} = \alpha _{2} = 0\). Under the alternative, \(\alpha _{1}\) represents the difference from a zero average, and the interaction term \((\alpha \beta )_{ij}\) represents the deviation from the sum \(\alpha _{i} + \beta _{j}\).

Hypothesis 1 predicts that \((\alpha \beta )_{11} < 0\). This would imply that there is a greater difference in errors between consistent and inconsistent participants in the SBDM mechanism than in the Introspection mechanism. As seen in the Appendix, the estimate for \((\alpha \beta )_{11}\) is based on the difference between (i) the difference in mean errors between consistent and inconsistent types in the SBDM mechanism and (ii) the difference in mean errors between consistent and inconsistent types in the Introspection mechanism. Thus, when discussing our results, we will report the mean errors of each group and discuss the magnitude and one-sided significance of this difference-in-difference.

As noted in the introduction, a belief error in the Introspection treatment is based on inaccurate underlying beliefs that are a result of incorrect Bayesian updating while a belief error in the SBDM mechanism may be a combination of (i) inaccurate underlying beliefs and (ii) misreported beliefs that are due to a misunderstanding of the incentive properties of the mechanism. In order for the interaction effect to be interpreted as a measurement of SBDM-specific misreports, the difference in errors between consistent and inconsistent participants that stem from inaccurate beliefs must be similar for the two mechanisms.

As discussed below, we hypothesize that the SBDM mechanism is likely to improve accuracy in difficult questions in which decision making is cognitively costly. Thus, there is a concern that accuracy improvements may not be uniform across consistent and inconsistent participants. To address this concern, we report the difference-in-difference estimate for Hypothesis 1 using only the decision problems with an uninformative signal in addition to reporting the estimate from the full sample. In this subset of decision problems, underlying beliefs require no updating and we have no reason to believe that belief accuracy should differ across mechanisms.

Sensitivity to Task Difficulty: While the Introspection mechanism may be easier for inconsistent participants to understand, a concern is that participants may not have an incentive to think carefully about their belief when updating is cognitively costly. This would imply that the quality of data in the Introspection mechanism may be strongly dependent on the difficulty of forming accurate beliefs.

In our design, participants are exposed to decision problems in which signals are informative and in which Bayesian updating is challenging. Participants are also exposed to simple problems in which signals are uninformative and no Bayesian updating is needed. Using behavior in the SBDM treatments to control for inherent differences in belief errors between these two types of problems, we would predict:

Hypothesis 2

The Introspection mechanism is more sensitive to task difficulty than the SBDM mechanism.

To test for Hypothesis 2, we again let \(i \in \{1, 2\}\) represent the assignment of an individual to the SBDM mechanism (\(i = 1\)) or the Introspection mechanism (\(i = 2\)), but divide our decision problems into hard problems in which the posterior is informative (\(j = 1\)) and easy problems in which the posterior is uninformative (\(j = 2\)). We predict that the difference is greater in the Introspection mechanism than in the SBDM mechanism. Thus, our test statistic is given by:

$$\begin{aligned} E_{ijk} = \mu + \alpha _{i} + \beta _{j} + (\alpha \beta )_{ij} + \epsilon _{ijk}, \end{aligned}$$
(2)

where \(E_{ijk}\) is the mean absolute error of participant k in mechanism i in decision problems of j difficulty. We predict that \((\alpha \beta )_{21} > 0\) as this would indicate that there is greater variation in belief errors under Introspection when participants encounter easy decision problems relative to difficult decision problems. We note that \((\alpha \beta )_{21}\) is based on the difference between (i) the difference in mean errors between informative and uninformative problems in the Introspection mechanism and (ii) the difference in mean errors between informative and uninformative problems in the SBDM mechanism. Thus, when discussing our results, we will again report the mean errors associated with each mechanism-difficulty combination, and discuss the magnitude and one-sided significance of this difference-in-difference.

Combining Hypotheses 1 and 2, we predict that the relative performance of the SBDM is likely to be best for consistent types in problems with informative signals and worst for inconsistent types in problems with uninformative signals. A priori, we cannot order the other two combinations of types and decision problems since the relative importance of mechanism complexity and task difficulty are unknown.

4 Results

4.1 Probabilistic reasoning

Result 1

Consistent with Hypothesis 1, the SBDM mechanism is more sensitive to probabilistic reasoning than the Introspection mechanism.

Table 3 reports mean errors of reports under the SBDM mechanism and the Introspection mechanism for (i) consistent participants, (ii) inconsistent participants, and (iii) both consistent and inconsistent participants combined. We report mean errors for each informative posterior pair starting with the most informative posteriors and ending with the least informative signal. Thus, for instance, the \(\rho ' \in \{0.16, 0.84\}\) column corresponds to data from Block Three of the high-information treatments when a participant has drawn either two black balls or two white balls. We then show mean errors for all informative signals combined and for the case of an uninformative signal. Finally, mean errors over all decision problems are shown in the last column.

Table 3 Mean error of reports under the SBDM mechanism and the Introspection mechanism for (i) consistent participants, (ii) inconsistent participants, and (iii) all participants combined

In Sect. 3.3.1 we showed that the interaction effect is based on the difference between (i) the difference in mean errors of consistent and inconsistent types in the SBDM mechanism and (ii) the difference in mean errors of consistent and inconsistent types in the Introspection mechanism. As seen in the last column, the mean error for consistent participants in the SBDM mechanism is 10.37 while the mean error for inconsistent participants is 16.63. Thus, there is a \(-6.25\) percentage point difference in means in the SBDM mechanism. In percentage terms, the mean error of a consistent participant is 37.6 percent smaller than an inconsistent participant in the SBDM mechanism.

The mean error for consistent participants in the Introspection mechanism is 12.62 while the mean error for inconsistent participants is 16.26. Thus, there is a \(-3.64\) percentage point difference in means in the Introspection mechanism and the mean error of a consistent participant is only 22.4 percent smaller than an inconsistent participant. The difference-in-difference estimate of \(-2.61\) (\(-6.25 + 3.64\)) is significant using the one-sided synchronized test described in the last section (p value = .027). The effect is also large in magnitude given that the mean error in the sample is 13.79.

We note that the difference-in-difference estimate is particularly large in decision problem with uninformative signals. In these problems, the difference-in-difference estimate is \(-4.23\) and the effect is significant using the same one-sided synchronized test as above (p value \(= .017\)). In these questions, there is no Bayesian updating necessary. Thus, identifying the correct belief is unlikely to be cognitively costly and the difference in belief errors is likely driven by inconsistent participants being confused by the SBDM mechanism itself. The difference-in-difference estimate is not significant when the sample is restricted to informative signals (p value \(= .075\)).Footnote 16

In Appendix F, we also report robustness results when we exclude outliers. If we remove individuals whose reports are almost always above or below 50, the difference-in-difference estimates become larger and the p values fall. Thus, our results in this section do not appear to be the result of an allocation of outliers to treatments.

Turning to our second hypothesis, regarding task difficulty, we find:

Result 2

Consistent with Hypothesis 2, the Introspection mechanism is more sensitive to task difficulty than the SBDM mechanism.

Recall from the last section that our parameter of interest for Hypothesis 2 is the difference in mean errors between (i) informative and uninformative questions in the Introspection mechanism and (ii) informative and uninformative questions in the SBDM mechanism. Referring back to Table 3 and looking at the rows corresponding to the full sample, the mean errors under Introspection is 7.54 when the signal is uninformative and 16.33 when the signal is informative. Mean errors under the SBDM mechanism are 11.02 in problems in which the signal is uninformative and 13.90 in problems in which the signal is informative. Thus, under Introspection, the difference in mean errors is 8.79 while it is 2.88 under SBDM. The difference-in-difference estimate of 5.91 is significant in the one-sided synchronized permutation test described in Sect. 3 (p value \(< .001\)).

4.2 Focal reports in the SBDM and introspection mechanisms

Having found evidence that the SBDM mechanism is more sensitive to heterogeneity in probabilistic reasoning, and that the Introspection mechanism is more sensitive to task difficulty, we now take a deeper look at the data to understand what is driving the differences in mechanism performance. We begin by comparing consistent participants’ responses to both mechanisms when signals are informative.

Result 3

In decision problems with an informative signal, consistent participants have significantly smaller belief errors in the SBDM mechanism than in the Introspection Mechanism. The difference is due in part to the larger number of focal reports observed in the Introspection mechanism.

As seen by comparing the first two rows of Table 3, the SBDM is more accurate for consistent participants when we combine the data from all the informative priors (p value \(=0.008\)). Thus reports in the SBDM mechanism have lower mean errors than reports in the Introspection mechanism for consistent participants when signals are informative.

Figure 2 shows the distribution of reports for consistent participants for each of the eight informative signals under the SBDM mechanism and Introspection. Introspection has more focal reports of 0, 50, and 100 than the SBDM mechanism. Aggregating over the eight informative priors, focal reports by consistent participants occur in 41 percent of cases in the Introspection mechanism and in only 18 percent of cases in the SBDM mechanism. This difference is significant when we compare the average proportion of focal reports made in the two mechanisms in a permutation test using data from periods with informative signals (p value \(<0.001\)).

Excluding the focal reports, the mean error of consistent participants in the Introspection mechanism is 7.63 in periods with an informative signal while the mean error in the SBDM mechanism is 8.81 in the same periods. Thus, the larger number of focal reports in the Introspection mechanism appears to be the main driver of differences between the two mechanisms for consistent participants.

Fig. 2
figure 2

Distribution of reported beliefs by consistent participants

Result 4

In decision problems with an uninformative signal, there is no significant differences in mean errors between the SBDM mechanism and Introspection for consistent participants. However, consistent participants in the Introspection mechanism make significantly more correct and incorrect focal reports.

In periods with uninformative signals, the mean error for consistent participants is 8.29 in the SBDM mechanism and 6.62 in the Introspection mechanism and there is no significant difference between the two mechanisms (p value \(= 0.283\)). In the Introspection mechanism 72.97 percent of consistent participants report the correct belief of 50 while only 57.75 percent of consistent participants report the correct belief in the SBDM. This difference in correct focal reports is significant (p value = 0.009). However, incorrect focal reports are also common under Introspection when signals are uninformative: 6.63 percent of reports are extreme reports of 0 or 100 in the Introspection mechanism, while 2.13 percent of reports are extreme reports of 0 of 100 in the SBDM mechanism (p value = 0.041).

Result 5

In decision problems with informative signals, there is no significant difference in mean errors between the SBDM mechanism and Introspection for inconsistent participants. However, in decision problems with an uninformative signal, inconsistent individuals have significantly smaller belief errors in the Introspection mechanism than in the SBDM mechanism.

As seen by comparing rows 3 and 4 of Table 3, inconsistent participants have slightly lower errors in the SBDM mechanism than the Introspection mechanism in each of the four cases with informative signals. However, none of these differences are significant.

When the signals are uninformative, the mean error for inconsistent participants in the Introspection mechanism is 8.53 while the mean error in the SBDM mechanism is 14.43. This difference is significant in a permutation test (p value \(= 0.003\)). The distribution of reports indicates a correct report of 50 is made in 73.70 percent of cases in the Introspection mechanism and in only 39.76 percent of cases in the SBDM mechanism. This difference is significant when we compare the average proportion of correct reports made in the two mechanisms in a permutation test using data from periods with uninformative signals (p value \(<0.001\)). The strong reduction in correct focal reports of 50 suggests that some individuals do not understand the truth-telling properties of the SBDM mechanism and misreport as a result.

4.3 Differences in the initial experiment and follow-up experiment

In our pre-analysis plan we committed to pooling the data from our original and follow-up experiments if there were no significant differences in errors in the full data set, the SBDM sample, or the Introspection sample. As seen in "Appendix A", we find no differences in the samples along these dimensions and have therefore used the pooled data as the basis for our evaluation of Hypotheses 1 and 2. In this section we deviate from our pre-analysis plan to discuss an important difference in the two samples as they relate to Hypothesis 1.

Result 6

The magnitude of the estimated interaction effect between the SBDM mechanism and probabilistic reasoning is much larger in the initial experiment than in the follow-up experiment.

Tables 5 and 6 in "Appendix A" show the mean error of reports for the initial experiment and follow-up experiment separately. As seen in Table 5, in the initial experiment’s Introspection treatment, the mean error of consistent participants is 14.58 and the mean error of inconsistent participants is 15.01. In the initial experiment’s SBDM treatment, the mean error for consistent participants is 10.45 and the mean error for inconsistent participants is 16.30. Thus, for the initial experiment, the difference-in-difference estimate related to Hypothesis 1 is \(-5.42\) (\(10.45 - 16.30 - (14.58 - 15.01)\)), which is significant in the one-sided synchronized permutation test described in Sect. 3 (p value \(= .009\)).

By contrast, as seen in Table 6 in "Appendix A", in the follow-up experiment’s Introspection treatment, the mean error for consistent participants is 11.32 and the mean error for inconsistent participants is 17.01. In the follow-up experiment’s SBDM treatment, the mean error for consistent participants is 10.32 while the mean error for inconsistent participants is 16.86. Thus, for the follow-up experiment, the difference-in-difference estimate related to Hypothesis 1 is \(-0.85\) (\(10.32 - 16.86 - (11.32 - 17.01)\)), which is not significant in the one-sided synchronized permutation test described in Sect. 3 (p value \(= 0.314\)).

As noted in Sect. 3.3.1, interpreting the difference-in-difference as a measure of SBDM-specific misreport-errors when using all decision problems relies on the assumption that any difference in errors between consistent and inconsistent participants that stem from inaccurate beliefs are similar for the two mechanisms. Thus, one potential reason for the difference in point estimates is that this assumption is violated in one of the two experiments.

To explore this issue, we also calculated the difference-in-difference estimates using only the easy decision problems in which signals were uninformative. These decision problems provide the cleanest estimate of SBDM-specific misreports because most individuals are likely to have correct latent beliefs. In these problems, the difference in point estimates diminishes but does not go away: in our initial treatment, the difference-in-difference estimate is \(-6.41\) (p value \(= .029\)), while in the follow-up experiment, the point estimate is \(-2.68\) (p value \(= 0.150\)).

A second potential reason for the difference in point estimates are changes to the experimental environment. Covid-19 restrictions prevented us from using the lab and our follow-up experiments were conducted online. While we worked hard to maintain identical protocols in the two experiments, it is possible that the online environment generated new sources of errors. As discussed in Appendix F, we find some evidence that this may be the case. In the follow-up data, there are a number of participants who appear to be reporting their beliefs out of 40 (the total number of balls in the bucket) rather than 100. A conservative removal of the 16 most extreme outliers (individuals whose reports almost always fell below 50 or above 50) increases the magnitude of the difference-in-difference from \(-0.85\) to \(-1.74\) (p value \(= 0.163\)). However, this estimate is still smaller in magnitude than the point estimate from our original experiment using the same criterion for removing outliers (\(-5.45\); p value \(= .010\)).Footnote 17

Thus, restricting attention to easy decision problems or controlling for outliers in the follow-up sub-sample results in point estimates that are similar to the pooled difference-in-difference estimate that we use throughout the paper. We cannot, however, fully explain the difference between the original experiment and follow-up experiment. We hope that future replications will be conducted that can improve our understanding of this issue and to understand if there are systematic differences in how belief elicitation mechanisms and incentives interact with lab and online environments.

4.4 Fluid intelligence and cognitive effort

In our follow-up experiment, we divided participants into high-ability and low-ability groups based on their performance on a short-form version of the Raven’s Advanced Progressive Matrices task, and high-effort and low-effort groups based on their performance on an extended Cognitive Reflection Test. Our pre-analysis plan predicted the following hypotheses:

Hypothesis 3

The SBDM mechanism is more sensitive to variation in fluid intelligence than the Introspection Mechanism.

Hypothesis 4

The SBDM mechanism is more sensitive to variation in cognitive effort than the Introspection Mechanism.

To test for these hypotheses, we repeated the analysis we used to test for Hypothesis 1, but split groups based on their classification in the Raven task and the CRT. Our pre-analysis plan called for a one-sided test with a greater difference in errors between high and low types in the SBDM mechanism than in the Introspection mechanism.

Result 7

Both fluid intelligence and cognitive effort strongly predict errors in both the SBDM and Introspection Mechanisms. However, there is no significant difference in sensitivity to fluid intelligence nor to cognitive effort.

Support for Result 7 is given in Table 4. Panel A of this table reports mean errors under the SBDM mechanism and the Introspection mechanism for (i) high-ability participants and low-ability participants. We first report errors for all decision problems that were informative and for all decision problems that were uninformative. In the last column, we report the mean error on all decision problems. Panel B is identical to Panel A except that it divides individuals into high-effort and low-effort groups based on the extended CRT.

Table 4 Fluid intelligence and cognitive effort

As seen in Panel A, mean error in the SBDM mechanism is 11.12 for high-ability participants and 16.45 for low-ability participants. The mean error in the Introspection mechanism is 10.04 for high-ability participants and 17.98 for low-ability participants. The difference-in-difference estimate is therefore 2.61, which has the opposite sign from the one predicted in Hypothesis 3; it is not significant using the one-sided synchronized test (p value \(= .914\)).

As seen in Panel B, mean error in the SBDM mechanism is 10.18 for high-effort participants and 17.30 for low-effort participants. The mean error in the Introspection mechanism is 10.00 for high-effort participants and 19.03 for low-effort participants. The difference-in-difference estimate is therefore 1.91, which has the opposite sign from the one predicted in Hypothesis 4 and is again not significant (p value \(= .868\)).

Although we do not observe an interaction effect for either case, we note that low-ability and low-effort individuals have very large errors relative to high-ability and high-effort ones. Thus, while we do not find a significant difference in sensitivity to different belief elicitation mechanisms, both Fluid Intelligence and Cognitive Effort are strongly predictive of belief errors. This finding is highly consistent with earlier papers in which Raven test scores have been found to correlate positively with fewer Bayesian updating errors (Charness et al. 2018) and with more accurate beliefs (Burks et al. 2009). It is also consistent with the finding of Schlag and Tremewan (2020) that focal reporting in the SBDM is correlated with scores on the Cognitive Reflection Test.

5 Discussion and conclusion

The Stochastic Becker-DeGroot-Marschak (SBDM) mechanism is a theoretically elegant way of eliciting incentive-compatible beliefs under a variety of risk preferences. However, the mechanism is complex and there is concern that some participants may misunderstand its incentive properties. We use a two-part design in which we identify participants whose decision-making is consistent and inconsistent with probabilistic reasoning, and elicit their beliefs in both easy and hard decision problems. Relative to Introspection, there is less variation in mean belief errors between easy and hard problems in the SBDM mechanism. However, there is a greater difference in belief errors between consistent and inconsistent participants. These results suggest that while the SBDM mechanism encourages individuals to think more carefully about beliefs, it is more sensitive to probabilistic reasoning skills. Our results show that mechanism complexity is an important consideration when using elicitation mechanisms, and identifies probabilistic reasoning as an important consideration when interpreting elicited beliefs.

By identifying different channels by which errors occur in the two mechanism, we can better understand the mixed results from earlier studies that compare the two mechanisms. In particular, our finding that errors in the Introspection mechanism varies with task difficulty implies that any horse race between the two mechanisms is likely to be strongly task dependent and that task difficulty may be an important consideration in whether to offer explicit incentives for beliefs.

Our finding that errors in both the SBDM and Introspection mechanism vary with participants’ probabilistic reasoning ability, fluid intelligence, and cognitive effort suggests that researchers should be cautious when using individual beliefs to identify types. For example, in the literatures on overconfidence, it is common to use the difference between an agent’s true ability and their reported belief about this ability as a proxy for overconfidence. If errors are correlated with cognitive ability, then individuals who are assigned to the overconfident group may also include a large set of low-ability types who struggle to optimize in other situations.

We see value in an independent replication of our experiments and in understanding whether there are differences in how belief elicitation mechanisms interact with lab and online environments. As seen in Sect. 4.3, the point estimate for the interaction effect related to probabilistic reasoning is large and significant in our initial lab-based experiments but small and not significant in our online follow-up experiments. Further, while the magnitude of the interaction effect in the follow-up experiments increases when the data is restricted to easy decision problems or when the most obvious outliers are removed, a difference in the magnitude of the interaction effect between the two samples persists.

It is an open question as to how to improve the SBDM to reduce the impact of probabilistic reasoning. Holt and Smith (2016) and Burfurd and Wilkening (2018) suggest that choice lists can reduce focal reports but neither paper finds accuracy improvements from using multiple choice lists. Nonetheless, choice lists may be important for a subset of individuals and it would be interesting to understand how they interact with cognition.

As an alternative, Hao and Houser (2012) suggests that combinatorial clocks might play an important role if researchers can overcome the censoring which results from using a single ascending clock. This result is consistent with the notion of obviously strategy-proof mechanisms (Li 2017). One potential solution would be to conduct both an ascending and decreasing clock auctions against a dummy player with a common cutoff point \(\hat{p}\) and pay the participant for the outcome of one of these clock auctions. In the ascending clock auction, the clock probability z goes from zero to one and the participant receives \(H_{A}L\) if z reaches \(\hat{p}\). If the participant drops out, she receives \(H_{\hat{p}}L\). In the descending clock auction, the participant receives \(H_{\hat{p}}L\) if z reaches \(\hat{p}\) and she receives \(H_{A}L\) if she drops out. In both mechanisms, it is a dominant strategy to drop out at one’s true value. Further, at least one of the two mechanisms will have no censoring.

It is also an open question as to how probabilistic reasoning and cognition interacts with other elicitation methods, particularly those that are robust to heterogeneity in risk preferences. One alternative to the SBDM is to combine quadratic-scoring rules with a binary lottery procedure, which theoretical induces risk neutrality under subjective expected utility. This binary lottery procedure has been found to generate better data than the quadratic-scoring rule in objective settings (Harrison et al. 2013; Hossain and Okui 2013), and subjective settings (Harrison and Phillips 2014; Harrison et al. 2014, 2015, 2017), but not in settings in which subjective beliefs about others is elicited (Koh 2017). A recent paper by Danz et al. (2020) finds that transparent information on the incentives of the binary lottery procedure actually increases belief errors, suggesting that cognition may also be important for this mechanism. A second alternative is the frequency method of Schlag and Tremewan (2020), which elicits beliefs in terms of natural frequencies and can be used when multiple realisations of an outcome are available.