Introduction

The rapidly expanding scientific literature calls for methods to synthesize research such as a systematic review. Meta-analysis is an important part of a systematic review and refers to applying statistical methods to combine findings of different studies on the same topic. The first step of a meta-analysis is to obtain a standardized effect size (e.g., standardized mean difference or correlation coefficient) for each included study and these effect sizes are subsequently combined by means of meta-analysis methods to summarize the included studies. Meta-analysis is nowadays seen as the gold standard for research synthesis (Aguinis et al., 2011; Head et al., 2015) and is often used for policy making as its results are seen as best available evidence (Thompson & Sharp, 1999; Cordray & Morphy, 2009; Hedges & Olkin, 1985).

The random-effects model is the most commonly used statistical model in meta-analysis (Borenstein et al., 2010; 2009). In the random-effects model, each study is assumed to have an unknown true underlying effect size. The main parameter of interest in this model is the overall effect size of the studies included in the meta-analysis. However, estimation and statistical inferences for the between-study variance in the random-effects model is just as important (Higgins et al., 2009) because both parameters focus on distinct and relevant aspects of a meta-analysis. For example, the overall effect size in a meta-analysis on the efficacy of a psychological treatment refers to the average efficacy of the treatment across studies and the between-study variance quantifies the heterogeneity across studies.

The vast majority of meta-analyses uses frequentist analysis techniques for estimation and drawing statistical inferences. However, Bayesian meta-analysis methods have been proposed as well and have gained in popularity (Xu et al., 2008). Bayesian methods are especially well suited for analyzing meta-analytic data (Smith et al., 1995; Sutton & Abrams, 2001; Lunn et al., 2013; Turner & Higgins, 2019) because the multilevel structure of a random-effects meta-analysis can be straightforwardly taken into account. Moreover, estimation of the between-study variance is imprecise using frequentist meta-analysis methods in the common situation of a meta-analysis containing a small number of studies (Chung, Rabe-Hesketh, & Choi, 2013; Sidik & Jonkman, 2007; Kontopantelis, Springate, & Reeves, 2013). Bayesian meta-analysis methods are advantageous in this situation because (i) externally available information about the between-study variance can be incorporated in the prior distribution if available, and (ii) the methodology does not directly rely on large sample theory.

Meta-analysts generally want to estimate and conduct hypothesis tests for the parameters in the random-effects model. The vast majority of the literature on Bayesian meta-analysis methods has been focused on parameter estimation using either empirical Bayes or fully Bayesian estimation (e.g., (Turner et al., 2015; Rhodes et al., 2015; Normand, 1999; Lambert et al., 2002)). However, hypothesis testing using Bayes factors has also been proposed, which quantifies evidence of one model relative to another model (Kass & Raftery, 1995). (Berry, 1998) proposed a Bayes factor for testing the null hypothesis of no between-study variance in a meta-analysis of studies using 2x2 contingency tables. Rouder and Morey (2011) developed a Bayes factor to test the null hypothesis of no effect in a meta-analysis of studies using a two-independent groups design. (Scheibehenne et al., 2017) and (Gronau et al., 2017) proposed a Bayesian model averaging approach to compute an average Bayes factor for testing the null hypothesis of no effect. In this approach, an average Bayes factor is computed by averaging over posterior model probabilities obtained with the random-effects model and the equal-effect model (a.k.a. fixed-effect or common-effect model) where the study’s true effect sizes are assumed to be homogeneous.

The first contribution of our paper is that we propose, in contrast to existing Bayesian meta-analysis methods, to use a marginalized random-effects meta-analysis (MAREMA) model rather than the random-effects model. In this MAREMA model, the study-specific true effects are regarded as nuisance parameters and integrated out of the probability density function. The elimination of nuisance parameters via integration is common in integrated likelihood methods (e.g., Berger et al., (1999)), and has recently been extended to marginal random intercept models (Mulder and Fox, 2013; 2019), and marginal item response theory models (Fox et al., 2017).

The proposed MAREMA model encompasses three important meta-analysis models. First, the equal-effect model is included if the between-study variance is equal to zero. Second, the random-effects model is included when the between-study variance is positive, which implies that the differences between studies’ effect sizes cannot be fully explained by sampling error. Third, the random-effects model in case of a negative between-study variance is also included. A negative between-study variance indicates that the differences between studies’ effect sizes are smaller than expected based on sampling error. A negative between-study variance may yield relevant insights for meta-analysts because it may indicate that the assumption of independence of primary studies in a meta-analysis is violated or that the computation of the effect sizes of the primary studies is incorrect (Ioannidis et al., 2006).

Our proposed methodology is also distinctive from other Bayesian meta-analysis methods because we place a prior distribution on the proportion of total variance that can be attributed to between-study variance rather than directly on the between-study variance parameter. This proportion is known as I2 (Higgins & Thompson, 2002; Higgins et al., 2003) in the meta-analysis literature and it is frequently used to quantify the relative heterogeneity around the true effect size (Borenstein et al., 2017). An advantage of placing a prior on I2 is that it is a bounded parameter which enables us to place a proper (noninformative) uniform prior to compute Bayes factors (note that Bayes factors generally cannot be computed using improper priors while at the same time arbitrarily vague proper priors should also be avoided due to Bartlett’s phenomenon (Jeffreys, 1961; Lindley, 1957; Bartlett, 1957). Due to the uniformity of the prior, the proposed Bayes factors can be used for a default Bayesian meta-analysis without requiring external prior knowledge about the model parameters.

The proposed Bayes factors enable testing point and one-sided hypotheses. Examples of a point and one-sided hypothesis are testing whether the overall effect size in a meta-analysis equals zero (i.e., H : μ = 0) or is larger than zero (i.e., H : μ > 0). Moreover, the proposed Bayes factors also enable testing multiple hypotheses simultaneously (e.g., H : μ = 0 vs. H : μ > 0 vs. H : μ < 0). Another attractive property is that the quantification of relative evidence between hypotheses is exact even in the extreme case of only two studies in a meta-analysis as Bayes factors do not rely on large sample theory.

The outline of this paper is as follows. We continue by further introducing the MAREMA model and illustrate Bayesian estimation under the MAREMA model. Subsequently, we introduce Bayes factor testing under the MAREMA model and elaborate on the specification of the prior distributions. In Section “Bayesian hypothesis testing in examples”, the MAREMA model is used to compute Bayes factors for a meta-analysis on the efficacy of two treatments for post-traumatic stress disorder (PTSD) and a meta-analysis on the effect of a smartphone application and cessation support on long-term smoking abstinence. We conclude the paper with a discussion section.

Marginalized random-effects meta-analysis model

The conventional random-effects meta-analysis model (Borenstein et al., 2009; Konstantopoulos & Hedges, 2019) assumes that i = 1, 2, ..., k independent effect sizes are extracted from studies on the same topic. The statistical model of the random-effects model can be written as

$$ \begin{aligned} y_{i} \sim N(\theta_{i}, {\sigma_{i}^{2}}) \\ \theta_{i} \sim N(\mu, \tau^{2}) \end{aligned} $$
(1)

where yi is the observed effect size of the i th study, 𝜃i is the study-specific (unknown) true effect size, μ is the overall true effect size, τ2 is the between-study variance in true effect size, and \({\sigma _{i}^{2}}\) is the known sampling variance of the i th study, which is conventionally estimated in practice and then assumed to be known in the analysis. The random-effects model simplifies to an equal-effect model if τ2 = 0 because all studies then share a common true effect size. Note that other distributions for the random effects than the normal distribution have been proposed (e.g., Baker & Jackson, 2008; 2016; Lee & Thompson, 2008), but a normal distribution for the random effects remains to be used in almost any random-effects meta-analysis.

The study-specific true effects 𝜃i are generally treated as nuisance parameters in the random-effects model. We integrate these out of the random-effects model in Eq. 1 to obtain the MAREMA model, which is given by (see also Raudenbush & Bryk, 1985 for instance)

$$ y_{i} \sim N\left( \mu, {\sigma}_{i}^{2}+\tau^{2}\right). $$
(2)

Multiple estimators have been proposed for estimating the between-study variance τ2 (Veroniki et al., 2016; Langan et al., 2016). Estimates of τ2 cannot be compared across meta-analyses if these meta-analyses used different effect size measures. That is, a τ2 estimate in a meta-analysis of standardized mean differences cannot be compared to one of correlation coefficients, and this was one of the reasons to develop I2 that will be described next (Higgins & Thompson, 2002).

Quantifying heterogeneity using I 2

A commonly used way to quantify the relative heterogeneity in a meta-analysis is using I2 (Higgins & Thompson, 2002; Higgins et al., 2003),

$$ I^{2} = \frac{\tau^{2}}{\tau^{2} + \tilde{\sigma}^{2}} $$
(3)

where \(\tilde {\sigma }^{2}\) is the typical within-study sampling variance that is computed with

$$ \tilde{\sigma}^{2} = \frac{(k-1) \sum 1/{\sigma_{i}^{2}}}{\left( \sum 1/{{\sigma}_{i}^{2}}\right)^{2} - \sum 1/{\sigma_{i}^{4}}}. $$
(4)

I2 has an intuitive interpretation because it is the proportion of total variance that can be attributed to between-study variance in true effect size. Note that I2 resembles the intraclass correlation coefficient (ICC) that is routinely reported in multilevel analysis. This ICC indicates the proportion of total variance that can be attributed to taking into account the dependence of observations within the level 2 units (e.g., (Hox et al., 2018)). However, a major difference between I2 and the ICC is that the total variance in the computation of I2 (i.e., \(\tau ^{2}+\tilde {\sigma }^{2}\)) is a function of the studies’ sample size whereas the ICC is not a function of the sample size of the level 2 units. Hence, I2 artificially increases if the sample size of the primary studies increases while τ2 remains constant (Rücker et al., 2008).

Next, we reparameterize the MAREMA model in Eq. 2 using I2. We replace I2 with the Greek letter ρ to make explicit that it is an unknown parameter and can attain negative values. The MAREMA model can then equivalently be written as

$$ y_{i} \sim N(\mu, {\sigma^{2}_{i}}+\tilde{\sigma}^{2} \rho/(1-\rho)). $$
(5)

where \(\tilde {\sigma }^{2} \rho /(1-\rho )\) transforms ρ to its corresponding τ2. In Eq. 5, ρ must lie in the interval (ρmin,1) with

$$ \rho_{min} = \frac{-\sigma_{min}^{2}}{-\sigma^{2}_{min}+\tilde{\sigma}^{2}} $$
(6)

where \(\sigma ^{2}_{min}\) is the smallest sampling variance of the studies included in the meta-analysis. ρmin is the smallest possible value of the parameter ρ given the observed data and is always negative. Note that the special case ρ = 0 (i.e., the equal-effect model) does not lie on the boundary of the parameter space, which enables testing the hypothesis of homogeneous true effect size.

Bayesian estimation under the MAREMA model

The MAREMA model can be estimated using flat priors if prior information is absent,

$$ \begin{aligned} \pi(\mu, \rho) & = \pi(\mu) \pi(\rho), \text{with} \\ \pi(\mu) & \propto 1 \\ \pi(\rho) & = U(\rho_{min}, 1) \end{aligned} $$
(7)

where U refers to the uniform distribution. Flat priors are used for estimation to minimize the impact of the priors on the results. We illustrate Bayesian estimation under the MAREMA model by analyzing data of two meta-analyses. The first meta-analysis is by Ho & Lee, (2012) on the efficacy of eye movement desensitization and reprocessing (EMDR) therapy versus exposure based cognitive behavior therapy (CBT) to treat PTSD. This meta-analysis consists of ten standardized mean differences (i.e., Hedges’ g) and a positive effect size indicates that EMDR therapy is more efficacious than CBT. The second meta-analysis is by Whittaker et al.,, (2019) on the difference between using a smartphone app for smoking cessation support and lower intensity support on long-term abstinence. Three studies are included in this meta-analysis and log risk ratio is the effect size measure of interest. A positive log risk ratio indicates that using a smartphone app for smoking cessation support yields more long-term abstinence than lower intensity support. These examples were selected to illustrate that the proposed methodology can be used for different effect size measures and meta-analyses with only a small number of primary studies, which is especially common in medical research (Rhodes et al., 2015; Turner et al., 2015).

A Gibbs sampler is proposed for Bayesian estimation under the MAREMA model (see Appendix A). We also analyzed the data of these meta-analyses with frequentist random-effects meta-analysis using the restricted maximum likelihood estimator (Raudenbush, 2009) for estimating τ2 as implemented in the R package metafor (Viechtbauer, 2010). R code of these analyses is available at https://osf.io/jcge7/.

The posterior distributions of μ and ρ when fitting the MAREMA model to the meta-analyses by Ho & Lee, (2012) (solid lines) and Whittaker et al.,, (2019) (dashed lines) are presented in Fig. 1. Remarkably, the posterior distributions of ρ (right panel of Fig. 1) are very wide for both meta-analyses. There is also considerable posterior support for negative ρ values. This could suggest that the random-effects model, which is employed for the frequentist analysis, may not be appropriate for these data. We will reflect on causes and the implications of a negative value for ρ in the discussion section.

Fig. 1
figure 1

Posterior distributions of μ (left panel) and ρ (right panel) for the meta-analyses by Ho & Lee, (2012) (solid lines) and Whittaker et al.,, (2019) (dashed lines). The posterior distributions in the figure are smoothed using a logspline as implemented in the R package logspline (Kooperberg, 2020). ρmin = − 1.045 for the meta-analysis by Ho & Lee, (2012) and ρmin = − 2.326 for the meta-analysis by Whittaker et al.,, (2019)

Parameter estimates obtained with the MAREMA model and also the frequentist random-effects model are presented in Table 1. The first row of Table 1 shows the results of the MAREMA model and the second row those of the frequentist random-effects model. The first three columns show the results of estimating μ and the last three columns those of estimating ρ. The metafor package was used for the frequentist meta-analysis, which does not report a standard error for \(\hat {\rho }\).

Table 1 Results of Bayesian estimation under the marginalized random-effects meta-analysis (MAREMA) model (using posterior means) and frequentist random-effects meta-analysis when estimating the parameters in the meta-analysis by Ho & Lee, (2012) (first panel) and Whittaker et al.,, (2019) (second panel)

Parameter estimates of the MAREMA model and frequentist meta-analysis of the meta-analysis by Ho & Lee, (2012) were comparable. However, the estimates for μ were slightly larger under the MAREMA model relative to the frequentist meta-analysis estimate. Furthermore, as expected, the 95% credibility interval for ρ under the MAREMA model was considerably wider than the 95% confidence interval of the frequentist meta-analysis due to the fact that the random-effects model does not allow negative values for ρ and therefore there is less “room” for ρ to vary. To conclude, EMDR therapy was more efficacious than CBT therapy for treating PTSD, and heterogeneity was imprecisely estimated close to zero (indicating homogeneity).

Parameter estimates for μ under the MAREMA model were approximately zero whereas the frequentist meta-analytic estimate for μ was slightly larger for the meta-analysis by Whittaker et al.,, (2019). Furthermore, due to the skewness of the posterior of ρ under the MAREMA model there is a considerable difference between the posterior mean and posterior mode, where the latter is close to the estimate under the frequentist random-effects model. To conclude, estimates based on the MAREMA model and frequentist meta-analysis differed slightly for the meta-analysis of Whittaker et al.,, (2019). These difference were probably caused by the meta-analysis only containing three studies.

Given the uncertainty in the unconstrained estimates it is particularly useful to test precise null hypotheses on the overall effect μ and the relative heterogeneity ρ. Hence, Bayesian hypothesis testing under the MAREMA model is discussed next.

Bayesian hypothesis testing under the MAREMA model

Testing hypotheses plays a fundamental role in scientific research in general and psychological science in particular. In this section, we propose multiple hypothesis tests for the mean μ and the relative between-study heterogeneity ρ separately.

We propose testing hypotheses for both μ and ρ under the MAREMA model. The following hypotheses are being tested for μ:

$$ \begin{aligned} H_{0}: \mu = 0 \\ H_{1}: \mu < 0 \\ H_{2}: \mu > 0, \end{aligned} $$
(8)

where support for H0, H1, or H2 indicates that the overall effect μ is equal to zero, is negative, or is positive, respectively. For ρ we test the following hypotheses:

$$ \begin{aligned} H_{0}: \rho = 0 \\ H_{1}: \rho < 0 \\ H_{2}: \rho > 0, \end{aligned} $$
(9)

where support for H0, H2, or H1 indicates a good fit of an equal-effect model, a random-effects model, or a model which assumes less variance due to sampling error (and thus a misfit of the equal-effect or random-effects model), respectively.

Bayes factors are used for testing the proposed hypotheses (Jeffreys, 1961; Kass & Raftery, 1995). A Bayes factor quantifies the evidence in the data for one hypothesis relative to a contrasting hypothesis via the ratio of the marginal likelihood,

$$ B_{12} = \frac{m_{1}(\mathbf{y})}{m_{2}(\mathbf{y})} $$
(10)

where y is a vector of the observed effect size yi and m1 and m2 are the marginal likelihoods under H1 and H2, respectively. For example, B12 = 1 indicates that both hypotheses are equally supported by the data whereas B12 = 20 indicates that there is 20 times more support in the data for H1 relative to H2.

Prior specification

In Bayesian hypothesis testing for the overall effect size, it is not recommended to use an arbitrarily vague prior to avoid Bartlett’s paradox (Jeffreys, 1961; Lindley, 1957; Bartlett, 1957). Instead, following (Mulder & Fox, 2019), we propose a unit-information prior for μ conditional on ρ in combination with a proper uniform prior for ρ. Under the unconstrained MAREMA model this boils down to

$$ \begin{aligned} \pi_{u}(\mu, \rho) & = \pi_{u} (\mu | \rho) \pi_{u}(\rho), \text{with} \\ \pi_{u}(\mu | \rho) & = N\Big(\mu, k \left( \mathbf{1}^{\prime} \mathbf{\Sigma}_{\rho}^{-1} \mathbf{1}\right)^{-1}\Big) \\ \pi(\rho) & = U(\rho_{min}, 1). \end{aligned} $$
(11)

The unit-information prior π(μ|ρ) contains the amount of information of a single study (Zellner, 1986). This is visible in the variance of π(μ|ρ) because the number of studies in a meta-analysis k is multiplied by the variance of \(\hat {\mu }\), which is \((\mathbf {1}^{\prime } \mathbf {\Sigma }_{\rho }^{-1} \mathbf {1})^{-1}\) where 1 is a column vector of ones and \(\mathbf {\Sigma }_{\rho } = \text {diag}\left ({{\sigma }_{1}^{2}}+\tau ^{2},...,{{\sigma }_{k}^{2}}+\tau ^{2}\right )\). Unit-information priors are commonly used for computing Bayes factors in model selection and hypothesis testing problems. For example, the well-known Bayesian information criterion (BIC, (Schwarz, 1978)) is based on an approximation of the marginal likelihood using a unit-information prior (Raftery, 1995; Kass & Wasserman, 1995). This class of priors is also employed in many other Bayesian testing scenarios (e.g., (Liang et al., 2008; Rouder & D Morey, 2012; Mulder et al., in press)). The usefulness of unit-information priors lies in the fact that they cover a reasonable range of possible values for the model parameters. As the prior contains the information of a single study, these priors are neither too informative nor too vague (to avoid Bartlett’s paradox). The prior π(μ|ρ) depends on ρ because τ2 is included in the variance-covariance matrix Σρ. Note that the prior π(μ|ρ) is different from the prior used for Bayesian estimation under the MAREMA model in Section “Bayesian estimation under the MAREMA model” as we used an improper prior for estimation whereas the prior π(μ|ρ) used for testing is a proper prior. The proper prior π(ρ) is the same prior distribution as we proposed for estimation under the MAREMA model.

The unconstrained prior in Eq. 11 is used as building block for the priors under the hypotheses of interest. In particular, for the one-sided hypotheses, truncated priors are considered while the precise null hypotheses receive a point mass at the null value. Figure 2 illustrates the proposed prior distributions for testing unconstrained, one-sided, and point hypotheses under the MAREMA model. The left panel shows prior distributions for testing hypotheses regarding μ where ρ is left unconstrained. The dashed line refers to the prior distribution of the unconstrained hypothesis in Eq. 11. The asterisk at the top of the figure illustrates the point mass for testing the hypothesis H0 : μ = 0. The solid and dotted lines refer to the prior distributions for the one-sided hypotheses H1 : μ < 0 and H2 : μ > 0, respectively. Their heights are twice as large as the unconstrained prior to ensure that the distributions integrate to 1. The right panel shows prior distributions for testing hypotheses regarding ρ. The dashed, solid, and dotted lines refer to the prior distributions of the unconstrained (Hu), left-sided (H1 : ρ < 0), and right-sided (H2 : ρ > 0) hypotheses whereas the asterisk refers to the point mass for the hypothesis of no between-study variance (H0 : ρ = 0).

Fig. 2
figure 2

Prior distributions of μ (left panel) and ρ (right panel) for Bayesian hypothesis testing under the MAREMA model. The different prior distributions refer to the hypotheses listed in Eqs. 8 and 9

Marginal likelihood

The marginal likelihoods of the different hypotheses differ with respect to the prior distributions. Hence, the marginal likelihoods of the different hypotheses can be computed by using different prior distributions in combination with adjusting the limits of integration.

For example, the marginal likelihood for the one-sided hypothesis H1 : μ < 0 with ρ unconstrained can be written as a function of the marginal likelihood under the unconstrained model,

$$ m_{1}(\textbf{y}) = \iint_{\mu<0} f(\textbf{y}|\mu,\rho)\pi_{1}(\mu,\rho)d\mu d\rho = m_{u}(\textbf{y}) \frac{P(\mu < 0 | \mathbf{y}, H_{u})}{P(\mu < 0 | H_{u})}, $$
(12)

where f(y|μ,ρ) is the likelihood function of the MAREMA model in Eq. 5, mu(y) is the marginal likelihood under the unconstrained model, P(μ < 0|y,Hu) and P(μ < 0|Hu) are the posterior and prior model probabilities for μ < 0 under the hypothesis Hu, and the prior under H1 can be written as a truncation of the unconstrained prior

$$ \pi_{1}(\mu,\rho) = \pi_{u}(\mu,\rho)I(\mu<0)/P(\mu < 0 | H_{u}), $$

where I(⋅) is the indicator function. Note here that \(P(\mu < 0 | H_{u})=\frac {1}{2}\) because the unconstrained prior for μ is centered around 0. The posterior probability in Eq. 12 can be computed as the proportion of unconstrained posterior draws satisfying μ < 0. Interested readers are referred to Appendix B for further computational details.

Software: BFpack

The R package BFpack (Mulder et al., in press) contains functions for computing Bayes factors for a large set of statistical models (e.g., multivariate regression, generalized linear models, and correlation analysis). As an argument the main function “BF” needs a fitted modeling object which defines under which model Bayes factors need to be computed. To execute the Bayes factor test under the MAREMA model using the “BF” function, the function needs as argument an object returned by fitting a random-effects meta-analysis model with the metafor package. The metafor package is popular software for conducting meta-analysis that can be used for any effect size measure and requires the meta-analyst to supply the observed effect sizes of the primary studies and the corresponding sampling variances (or standard errors). Hence, researchers familiar with the metafor package can readily compute the Bayes factors that we propose using the function BF. The “BF” function also returns unconstrained estimates of μ and ρ based on a Gibbs sampler.

Bayesian hypothesis testing in examples

We compute Bayes factors using the MAREMA model for the two examples introduced in Section “?? ??”. The MAREMA model was fitted to the two meta-analyses using the proposed unit-information prior on μ and uniform prior on ρ.Footnote 1 We tested the three hypotheses for μ and ρ listed in Eqs. 8 and 9, respectively. We analyzed the data using R (R Core Team, 2020) and the R packages metafor (Viechtbauer, 2010) and BFpack (Mulder et al., in press) in particular. R code illustrating how to compute Bayes factors and posterior probabilities for the hypotheses for the two examples is available at https://osf.io/ejfsv/.

The Bayes factors and posterior model probabilities for hypotheses on μ are presented in Table 2. For the meta-analysis by Ho & Lee, (2012), the Bayes factor comparing H2 with H1 is the largest of the tested hypotheses, which implies that the hypothesis H2 : μ > 0 is 15.810 times more likely than the hypothesis H1 : μ < 0. Moreover, the Bayes factor comparing H2 with H0 : μ = 0 equaled 3.779 implying that μ in this meta-analysis is likely larger than zero. Also the posterior probabilities (last row in Table 2; assuming equal prior probabilities) suggested that a positive effect (H2) was most likely after observing the data with a posterior probability of 0.753. The two-tailed frequentist hypothesis test of H0 : μ = 0 was not statistically significant using a significance level of 0.05 (z = 1.936, two-tailed p value= 0.053).

Table 2 Bayes factors and posterior model probabilities (P(Hq|y)) for hypotheses on μ. The results based on the meta-analysis by Ho & Lee, (2012) are shown in the first columns and by Whittaker et al.,, (2019) in the last columns of the table

The Bayes factors and posterior model probabilities for hypotheses on ρ are shown in the first columns of Table 3 for the study by Ho & Lee, (2012). The Bayes factors comparing the hypotheses H0 : ρ = 0 with H1 : ρ < 0 and H2 : ρ > 0 indicated that H0 is approximately four and five times more likely, respectively. These results were also corroborated by the posterior probability that was the largest for H0 (with 0.689). This indicates that there was most evidence for an equal-effect model. The commonly used Q-test for testing whether the studies are homogeneous in a frequentist meta-analysis was not statistically significant (Q(9) = 9.417,p = 0.4). Note here that a nonsignificant result does not imply evidence for the null, as p values cannot be used for quantifying evidence for the null (because p values are by definition uniformly distributed if H0 would be true). To conclude, the EMDR treatment was observed to be on average more efficacious than the CBT treatment and effects are homogeneous across studies. However, due to the uncertainty in the posterior probabilities and the parameter estimates (see Section “?? ??”), more studies are required in order to draw more definite conclusions.

Table 3 Bayes factors and posterior model probabilities (P(Hq|y)) for hypotheses on ρ. The results based on the meta-analysis by Ho & Lee, (2012) are shown in the first columns and by Whittaker et al.,, (2019) in the last columns of the table

Bayes factors for testing hypotheses on μ for the meta-analysis by Whittaker et al.,, (2019) are shown in the last columns of Table 2. Hypothesis H0 : μ = 0 received more support than hypotheses H1 : μ < 0 and H2 : μ > 0, but there was no strong evidence for any of the hypotheses (largest Bayes factor equaled 2.558). This absence of strong evidence was likely also caused by this meta-analysis only consisting of three studies. The posterior model probabilities (last row Table 2) also showed that H0 was most likely (probability is 0.537). Application of a two-tailed frequentist hypothesis test resulted in a nonsignificant result, and thus H0 could not be rejected (z = 0.349, two-tailed p value= 0.727).

The hypothesis H0 : ρ = 0 received 10.958 more evidence than H1 : ρ < 0 and 2.901 more evidence than H2 : ρ > 0. Moreover, the hypothesis H2 : ρ > 0 was 3.778 more likely than H1 : ρ < 0. This corroborated the posterior model probabilities indicating that the effect sizes were most likely to be either homogeneous (with a probability 0.696) or heterogeneous (with a probability 0.240). Interestingly, the frequentist Q-test was statistically significant (Q(2) = 6.24,p = 0.044. This may be due to the fact that the significance tests rely on large sample theory, which may not be realistic here given there are only three studies. Using the posterior probabilities to get conditional error probabilities, there is a probability of 0.210 + 0.254 = 0.464 that we would be wrong when concluding that the hypothesis of no effect is true, and a probability of 0.064 + 0.240 = 0.304 that the hypothesis of homogeneous effects is true given the observed data. Thus (as expected) more data would be needed in order to receive more pronounced evidence about which hypothesis is likely to be true.

Discussion

The main goals of meta-analyses are estimating the overall effect and the heterogeneity in effect size as well as drawing inferences for these parameters. This paper proposes novel Bayesian estimation and hypothesis testing methods to achieve these goals. Our approach is novel compared to alternative Bayesian meta-analysis methods because the framework builds on the MAREMA model where the (nuisance) study-specific effects are integrated out. This MAREMA model encompasses both an equal-effect and a random-effects meta-analytic model, and also encompasses a model which assumes that there is less variance than under the equal-effect and random-effects models.

Another major contribution is that we place a uniform prior distribution on the proportion of the variance that can be explained by between-study heterogeneity in true effect size rather than on the between-study variance or standard deviation in true effect size directly, as is usually done in Bayesian meta-analyses (Berry, 1998; Scheibehenne et al., 2017; Gronau et al., 2017). This relative variance has a clear standardized scale (Higgins et al., 2003) which facilitates its interpretation. Furthermore, the bounded parameter allows specifying a proper noninformative uniform prior in case prior information is absent or when a default (reference) test is preferred. Other advantages are that this prior does not depend on the effect size measure used in the meta-analysis, and avoids the need of eliciting a prior scale for which the Bayes factor can be highly sensitive.

We illustrated the proposed Bayesian hypothesis testing and estimation in two illustrative examples. Both the estimation and hypothesis testing results revealed large posterior uncertainty regarding the heterogeneity in true effect size. The uncertainty can be explained by the relatively small number of studies which is common in meta-analyses (Rhodes et al., 2015; Turner et al., 2015), and therefore the obtained quantifications for posterior uncertainty seemed reasonable. More convincing evidence for one of the tested hypotheses will be obtained if more studies become available and can be included in the meta-analysis. Note here that because Bayes factors are consistent the evidence in favor of the true hypothesis would go to infinity if the number of studies in a meta-analysis tends to infinity. The frequentist test for between-study heterogeneity on the other hand was statistically significant even though the example only included three studies. This may be another example that p values tend to overestimate the evidence against a null hypothesis (e.g., (Berger & Delampady, 1987; Sellke et al., 2001; Benjamin & Berger, 2019), and that Bayes factors and posterior probabilities may better capture the evidence in favor of or against statistical hypotheses in case of small samples.

In both applications, evidence was found that the heterogeneity (ρ) may be negative. This implies that an equal-effect meta-analysis model (where ρ = 0) or a random effects model (where ρ > 0) may not be appropriate, and might cause bias in the results. A negative ρ may also indicate a violation of the assumptions of the meta-analysis model, which is “corrected” by allowing ρ to be negative. For example, the reported within-study sampling variances \({\sigma _{i}^{2}}\) may be overestimated, the assumption of independent primary study’s effect sizes in a meta-analysis is violated, or the effect sizes of the primary studies were incorrectly computed (Ioannidis et al., 2006). More research is needed to explore the possible causes of a negative ρ, and how it affects the results. A good starting point would be (Nielsen et al., 2021) who explored the effect of negative intraclass correlations in a multilevel model.

We have only tested point hypotheses where the parameter was constrained to zero or one-sided hypotheses comparing hypotheses where the parameter was smaller or larger than zero. Other hypotheses that can be tested are combined hypotheses (e.g., H : μ > 0&ρ > 0) or hypotheses testing a parameter to another hypothesized value than zero. For example, meta-analysts may want to test point hypotheses using the rules of thumb for a small, medium, and large effect size as defined for many common effect size measures (for rules of thumb for multiple effect size measures see (Cohen, 1988)). Point hypotheses on the heterogeneity may be constrained to ρ = 0.25, 0.5, and 0.75 to resemble low, moderate, and high heterogeneity according to the thresholds proposed by (Higgins et al., 2003). It is important to note that any hypothesis on the proportion of total variance that can be explained by heterogeneity in true effect size can directly be transformed to a hypothesis on the between-study variance in true effect size. For example, testing the hypothesis H : ρ = 0 is equivalent to testing H : τ2 = 0. If a hypothesis on ρ is tested against another value than 0, the equivalent hypothesis on the between-study variance in true effect size can be obtained by transforming ρ to the between-study variance in true effect size.

We proposed a framework for Bayesian hypothesis testing and estimation using minimally informative default prior distributions, but our methodology can readily be extended to other prior distributions. Informative priors can be specified to incorporate external information about the heterogeneity in true effect size or overall effect from research in comparable fields. This may provide better estimates and statistical inferences especially if the number of primary studies in the meta-analysis is small. For example, if prior information about the heterogeneity parameter ρ is available this could be translated to an informative stretched beta prior distribution (following, (Mulder & Fox, 2019) for example) while prior information about the effect size could be translated to an informative normal prior. By considering different prior distributions one can assess the robustness of the quantification of the relative evidence in the data between statistical hypotheses under the MAREMA model.

Meta-analysts are generally not only interested in estimating and drawing inferences about the overall effect size and the between-study variance, but they are also interested in studying whether systematic differences between primary studies can explain the between-study variance. In case of large between-study variance, examining whether systematic differences between studies exist might even be more insightful than focusing on the overall effect size (Rouder et al., 2019). The between-study variance in a meta-analysis can be explained by including moderator variables in a meta-analysis model in a so-called meta-regression (Thompson & Sharp, 1999; Van Houwelingen et al., 2002). Future research may focus on extending our MAREMA model such that Bayesian estimation and Bayes factor testing can be conducted when moderators are included.

Future research may also focus on extending our methods to more advanced meta-analysis models such as multilevel meta-analysis (van den Noortgate & Onghena, 2003; Konstantopoulos, 2011) and multivariate meta-analysis (Jackson et al., 2011; Hedges, 2019). These models relax the strong assumption of the conventional random-effects model that effect sizes in a meta-analysis have to be independent. Another avenue for future research is studying whether relaxing the assumptions of the random-effects model benefit Bayesian meta-analysis under the MAREMA model in particular and Bayesian meta-analysis in general. For example, the random-effects model assumes that the within-study sampling variances are known (van Aert & Jackson, 2019; Jackson, 2009; Konstantopoulos & Hedges, 2019), and we adopted this assumption in the MAREMA model. This assumption is not tenable in practice which can be problematic if the sample size of studies is small. Estimates of the within-study sampling variance are then imprecise, which potentially lead to biased parameter estimates and inaccurate statistical inferences. This strong assumption can be avoided in a Bayesian meta-analysis by taking the uncertainty in these variances into account by means of a prior distribution instead of using a plug-in estimate. A logical choice for a prior distribution on the within-study sampling variances is the inverse-gamma distribution.

Another topic for future research is studying to what extent the proposed estimation and Bayes factor test under the MAREMA model are affected by publication bias. Publication bias implies that especially studies with statistically significant findings are more likely to be published than studies with non-significant findings. Consequently, studies with non-significant findings are more difficult to locate and are less likely to be included in a meta-analysis. Due to publication bias, effect sizes of primary studies and, in turn, also the overall effect size of a meta-analysis are most likely positively biased (Kraemer et al., 1998; van Assen et al., 2015; Lane & Dunlap, 1978). We expect estimation and inferences based on the MAREMA model to be inaccurate if publication bias is severe, and recommend researchers to also apply and report methods that correct for publication bias in this case.

To conclude, we have proposed Bayesian hypothesis testing and estimation using the MAREMA model. The proposed Bayes factors allow testing point and one-sided hypotheses for both the overall effect and heterogeneity in true effect size. We hope that our methods together with the easy-to-use software included in the R package BFpack enables researchers to frequently use Bayesian methods in their meta-analyses.