1 Introduction

Bayesian hierarchical models (BHMs) have become ubiquitous in many fields of scientific enquiry because of their ability to flexibly model complex natural systems yet remain relatively easy to build and modify for the questions at hand (Hobbs and Hooten 2015). Berliner (1996) clarified conceptualization of a BHM based on three submodels: (1) data, (2) process, and (3) parameter models. The data model accounts for measurement error or other differences between observable data and the true natural process for which the researcher would like to make inference. The process model describes the state of nature using a random process to account for its unknown intricacies. Finally, the parameter model describes our uncertainty about the parameters governing this random process. The full model is built by conditioning each submodel on the one above it. Traditionally, inference for BHMs has been conducted using some form of Markov chain Monte Carlo (MCMC) because the model structure is ideally suited to it (Gelfand and Smith 1990; Gelfand and Ghosh 2015). If not for development of MCMC methods, widespread use of BHMs may never have come to pass (Green et al. 2015). With a large amount of data or complex submodels, however, MCMC can be computationally challenging or infeasible to implement (Hooten et al. 2021; Wikle 2003). Herein, we propose a multistage method for making inference at all levels of a BHM using various software formats in each stage. By breaking the estimation procedure into multiple stages based on a partition of the full data it becomes computationally efficient. By allowing the use of the most suitable software platform in the initial stage, it can be made more efficient yet, not only in computing time, but also in development time saved by using software designed to facilitate each specific initial stage analysis.

To alleviate computational challenges in using MCMC to fit BHMs in “big data” situations, there have been several avenues of approach developing in the literature. One of these is development of the so-called two-stage methods that seek to break the computation into pieces based on partitions of the data (Goudie et al. 2019; Hooten et al. 2016, 2021; Lunn et al. 2013; Mesquita et al. 2020; Scott et al. 2016). The first stage involves fitting separate data models to the partitions using MCMC; then, in the second stage, the separate MCMC samples are combined to produce an MCMC sample as if the partitions had been simultaneously analyzed with the full BHM. A similar approach is the divide-and-conquer approach (Srivastava et al. 2018) that averages posterior distributions from data partitions to approximate an analysis of the full data.

An alternative approach to default MCMC for computationally challenging BHMs is to use optimization methods for exploring the posterior parameter space rather than stochastic sampling (Green et al. 2015). Optimization of the posterior density aims to find the maximum a posteriori (MAP) estimate rather than the posterior mean for a point estimate. The main benefit is that optimizing the posterior with respect to the parameters often requires fewer evaluations of the posterior density. That comes at a cost, however, because asymptotic results are usually required to describe parameter uncertainty (Van der Vaart 2000).

In an effort to provide a computationally efficient and flexible method for fitting a general class of BHMs, conditionally independent hierarchical models (CIHMs; Kass and Steffey 1989), we consider combining benefits of meta-analysis, two-stage MCMC, and posterior optimization into a multistage method that can be used to quickly fit BHMs in big data situations or when the data-level models are complex themselves. Although meta-analysis and two-stage MCMC are both accomplished in multiple stages, two-stage MCMC was initially designed to analyze a BHM, whereas meta-analysis seeks to summarize results from various analyses by forming them into a BHM, namely a multivariate normal linear mixed model (Gasparrini et al. 2012). We can use the meta-analysis approach to approximate the second-stage combination of initial stage inference in a two-stage MCMC. This allows us to use multiple approaches to approximate the posterior inference in the initial stage, including optimization methods, deterministic sampling, or MCMC if it is convenient.

In what follows, we provide notation and a working definition of the CIHM as we will use it. We show how the full posterior can be approximated using multivariate normal approximations to first-stage posterior densities. Then, we demonstrate how the first-stage results can be combined using the linear mixed model approach of standard meta-analysis (Gasparrini et al. 2012; Higgins et al. 2009) when the first-stage parameters are random effects. An optional third stage is introduced to retroactively revisit the first-stage analysis and can improve estimates after the second-stage results are obtained. The method is then demonstrated for two different types of environmental data, a simple generalized linear mixed model (GLMM) and an integrated population model (IPM) for demographic parameter inference.

2 Conditionally Independent Hierarchical Model

The description of BHMs by Berliner (1996) is very general, and we only consider a specific, but still very broad, class of hierarchical models (CIHMs; Gelfand and Ghosh 2015; Kass and Steffey 1989). The general form of a CIHM is

$$\begin{aligned} \begin{aligned} \text {Data-level model: }&\varvec{y}= (\varvec{y}_1',\dots ,\varvec{y}_n')' \sim \prod _{i=1}^n[\varvec{y}_i|\varvec{\theta }_i, \varvec{\gamma }_i, \varvec{\eta }], \\ \text {Unit-level effects: }&\varvec{\theta }= (\varvec{\theta }_1',\dots ,\varvec{\theta }_n')' \sim [\varvec{\theta }|\varvec{\psi }],\\&\varvec{\gamma }= (\varvec{\gamma }_1',\dots ,\varvec{\gamma }_n')' \sim \prod _{i=1}^n[\varvec{\gamma }_i], \\ \text {Population effects: }&(\varvec{\psi }',\varvec{\eta }')' \sim [\varvec{\psi },\varvec{\eta }], \end{aligned} \end{aligned}$$
(1)

where “[A|B]” is used to represent the conditional probability density or distribution function of A given B (Gelfand and Smith 1990), \(\varvec{y}_i\) is the observed data set (or data partition) for the ith unit (\(i=1,\dots ,n\)), \(\varvec{\theta }_i\) are random unit-level parameters, \(\varvec{\psi }\) are population-level parameters for \(\varvec{\theta }_i\), \(\varvec{\gamma }_i\) are distinct (relative to each \(\varvec{y}_i\)) unit-level parameters for modeling the ith data set, and \(\varvec{\eta }\) are global (fixed) population-level parameters. Typically, \(\varvec{\psi }\) and \(\varvec{\eta }\) are the most scientifically interesting; however, there can be interest in each of the \(\varvec{\theta }_i\) and \(\varvec{\gamma }_i\) depending on the situation. Note that unit-level effects are allowed to be correlated which is slightly more general than Kass and Steffey (1989).

To place the development of the multi-stage methodology in context, we will demonstrate its use on the analysis of two ecological data sets. The first one uses a generalized linear mixed model (GLMM) to assess the effect of wolf (Canis lupus) presence on the occurrence of moose (Alces alces) browse in willow stands. Although these data are not large in the modern sense, we demonstrate a two-stage analysis of them because the model can easily be fit in its full hierarchical form for comparison. Another common group of models in the ecological literature are integrated population models (IPMs). Initially proposed by Besbeas et al. (2002), IPMs are routinely used to collect information in multiple sparse data sets for improved inference on shared parameters. To demonstrate the two-stage meta-analysis approach for IPM model fitting, we analyze the classic northern lapwing (Vanellus vanellus) data from Besbeas et al. (2002).

2.1 Moose Browse in the Presence Wolves (GLMM)

This example involves \(n =\) 2,914 binary responses, \(y_{ij}\), which indicate the presence of moose browsing on tree \(j=1,\dots ,n_i\) of plantation \(i=1,\dots ,24\). The original study assessed whether or not tree browsing is reduced in plantations exposed to high wolf activity (van Beeck Calkoen et al. 2018). In addition, the study accounted for inherent moose preferences for tree height. To perform a similar analysis, we use a random effects model to treat each plantation as a unit to learn about the fixed effect of high wolf presence on moose browsing while accounting for natural variation in moose response across the 24 plantations. The CIHM is given by

$$\begin{aligned} \begin{aligned}&y_{ij} \sim \text {Bernoulli}(p_{ij}), \\&\text {logit}(p_{ij}) = \delta _0 + \delta _{\text {height}}h_{ij} + \delta _{\text {wolf}}w_i + u_{0i} + u_{\text {height},i} h_{ij},\\&\mathbf {u}_i = (u_{0i}, u_{\text {height},i}) \sim \text {N}(\mathbf {0},\text {diag}(\sigma _1^2, \sigma _2^2)), \end{aligned} \end{aligned}$$

where \(h_{ij}\) equals the height of the jth tree in the ith plantation and \(w_i\) is a binary indicator of high wolf presence in the ith plantation.

To use the two-stage approach, we must hierarchically center the model because \(\delta _0\) and \(\delta _{\text {wolf}}w_j\) are not identifiable within plantation units. Using hierarchical centering (Gelfand et al. 1996), the model can be rewritten as

$$\begin{aligned} \begin{aligned} \text {logit}(p_{ij})&= \theta _{0i} + \theta _{\text {height},i}h_{ij},\\ \varvec{\theta }_i&= (\theta _{0i}, \theta _{\text {height},i}) \sim \text {N}(\mathbf {L}_i\varvec{\delta }, \varvec{\Sigma }) , \end{aligned} \end{aligned}$$

where

$$\begin{aligned} \mathbf {L}_i = \left[ \begin{array}{ccc} 1 &{} 0 &{} w_i \\ 0 &{} 1 &{} 0 \end{array} \right] \text { and } \varvec{\Sigma }= \text {diag}(\sigma ^2_1, \sigma ^2_2). \end{aligned}$$

In terms of the CIHM specification (1), \(\varvec{\psi }= (\varvec{\delta }, \sigma ^2_1, \sigma ^2_2)\). The parameters \(\varvec{\gamma }\) and \(\varvec{\eta }\) are not used.

2.2 Lapwing Demographic Modeling (IPM)

The typical IPM (Schaub and Abadi 2011) has the CIHM form

$$\begin{aligned} {[}\varvec{y}] = \prod _{i=1}^n[\varvec{y}_i|\varvec{\gamma }_i, \varvec{\eta }], \text { and } [\varvec{\gamma }] = \prod _{i=1}^n[\varvec{\gamma }_i], \end{aligned}$$

where \(\varvec{\gamma }_i\) are data-type distinct parameters and \(\varvec{\eta }\) are common parameters. Traditionally, there are no random unit-level effects (\(\varvec{\theta }_i\)) in IPM models. Multiple data sets are used to obtain better estimates of the common parameters, \(\varvec{\eta }\), and then can be obtained from any single \(\varvec{y}_i\).

The lapwing data consist of \(n=2\) data sets, a ring-recovery data set, \(\varvec{y}_1\), and an index measure of the abundance of adult female birds obtained from a generalized additive model of counts, \(\varvec{y}_2\). These data have been analyzed previously to demonstrate IPM methodology for ecological data (Besbeas et al. 2002; Besbeas and Morgan 2019; Brooks et al. 2004; Goudie et al. 2019).

Ring-recovery models are based on marking animals and releasing them back into the wild. In subsequent years, the marked animals are then recovered after they have died. The likelihood for this type of model is a product-multinomial, where numbers of ring-recoveries in years following release are multinomial distributed with cell probabilities that are functions of \(\phi _t\) (for \(t=1963,\dots ,1997\)), the probability of survival from time t to \(t+1\) and \(\lambda _t\), the probability that an animal that died between \(t-1\) and t is recovered. To assess the influence of weather on survival and trends in recovery, the ring-recovery parameters are modeled with

$$\begin{aligned} \begin{aligned} \text {logit}(\phi _{yt})&= \eta _{y0} + \eta _{y1} x_t, \\ \text {logit}(\phi _{at})&= \eta _{a0}+ \eta _{a1} x_t,\\ \text {logit}(\lambda _t)&= \gamma _{\lambda 0} + \gamma _{\lambda 1} t, \end{aligned} \end{aligned}$$
(2)

where \(\phi _{yt}\) is the survival of first year birds, \(\phi _{at}\) is the adult female survival, and \(x_t\) is the number of days below freezing.

The census index data consist of noisy measures of female adult lapwing abundance in the study area. To make inference about population dynamics, we use the state-space model of Besbeas et al. (2002),

$$\begin{aligned} \begin{aligned}&\mathbf {N}_{t+1} = \mathbf {T}_t\mathbf {N}_{t} + \varvec{\epsilon }_{t}\\&y_{t} \sim \text {N}(N_{at}, \sigma ^2), \end{aligned} \end{aligned}$$
(3)

where \(\mathbf {N}_t = (N_{yt}, N_{at})'\) is the number of yearling and adult females, respectively, in year t, and \(y_t\) are the noisy count of adult females. The transition matrix, \(\mathbf {T}_t\), is parameterized with \(\phi _t\) and the rate of offspring production, \(\rho _t\). The survival parameters are the same as the ring-recovery model, but the production is modeled on the log scale using

$$\begin{aligned} \log (\rho _t) = \gamma _{\rho 0} + \gamma _{\rho 1} t. \end{aligned}$$

Full model details are provided in online appendix S2.

The IPM melds the information in the two data sets to allow inference for all parameters of the joint model,

$$\begin{aligned} \begin{aligned} \varvec{\eta }&= (\eta _{y0}, \eta _{y1}, \eta _{a0}, \eta _{a1})', \\ \varvec{\gamma }_1&= (\gamma _{\lambda 0}, \gamma _{\lambda 1})' \text {, and } \\ \varvec{\gamma }_2&= (\gamma _{\rho 0}, \gamma _{\rho 1}, \log \sigma , \log N_{y0}, \log N_{a0})'. \end{aligned} \end{aligned}$$

Random unit-level \(\varvec{\theta }\) parameters are not traditionally used, but see the analysis in Sect. 4.2 for an alternate version. In each of the previous studies that analyze these data, bespoke code was written in software such as MATLAB, JAGS (Plummer 2003), NIMBLE (de Valpine et al. 2017) or some other language to optimize the joint likelihood or execute an MCMC. For a rather simple IPM model like this, it is straightforward to code. But for more complicated capture–recapture models or abundance models, this can become an impediment for practitioners. Many of the individual submodels in IPMs have software available for fitting them separately. For the ring-recovery data in this analysis, there has been production-level software to fit the appropriate capture-recapture model for over 20 years. Thus, we should be able to utilize this substantial development in capture–recapture data analysis.

3 Methods

We propose a two-stage method (extended to an optional third stage) for analyzing CIHM models (1). In the first stage, each data set, \(\varvec{y}_i\), is analyzed individually according to its individual data-level model in the CIHM using the best method and software available for that particular data type. Then, in the second stage, the results are combined (melded) using the unit-level and population-level models to approximate the inference as if the entire CIHM was fit in one step (e.g., via MCMC). In an optional third stage, we can revisit estimates of the distinct unit parameters, \(\varvec{\gamma }_i\) to see if further updating is possible.

3.1 Two-Stage Approximate Bayesian Inference

If we were to fit the CIHM model in the traditional way, we would use summaries from the full posterior distribution

$$\begin{aligned} {[}\varvec{\psi }, \varvec{\eta }, \varvec{\gamma }, \varvec{\theta }|\varvec{y}] \propto [\varvec{y}|\varvec{\theta }, \varvec{\gamma }, \varvec{\eta }] [\varvec{\theta }|\varvec{\psi }][\varvec{\gamma }][\varvec{\psi },\varvec{\eta }]. \end{aligned}$$

Suppose, however, we only possess a single \(\varvec{y}_i\). Then, we might estimate \(\varvec{\vartheta }_i = (\varvec{\theta }_i', \varvec{\gamma }_i', \varvec{\eta }')'\) using the posterior distribution

$$\begin{aligned} {[}\varvec{\vartheta }_i|\varvec{y}_i] \propto [\varvec{y}_i|\varvec{\vartheta }_i] [\varvec{\vartheta }_i], \end{aligned}$$

where \([\varvec{\vartheta }_i]\) is a prior distribution that would be used in the absence of additional individuals. Using these individual posterior distributions, we can rewrite the full hierarchical model posterior as

$$\begin{aligned} {[}\varvec{\psi }, \varvec{\eta }, \varvec{\gamma }, \varvec{\theta }|\varvec{y}] \propto \left( \prod _{i=1}^n\frac{[\varvec{\vartheta }_i|\varvec{y}_i]}{[\varvec{\vartheta }_i]}\right) [\varvec{\theta }|\varvec{\psi }][\varvec{\gamma }][\varvec{\psi },\varvec{\eta }]. \end{aligned}$$
(4)

Stage I of the method we propose begins by first finding estimates \(\hat{\varvec{\vartheta }}_i\) and covariance matrices \(\hat{\varvec{S}}_i\) such that we can approximate

$$\begin{aligned} \frac{[\varvec{\vartheta }_i|\varvec{y}_i]}{[\varvec{\vartheta }_i]} \propto [\varvec{y}_i|\varvec{\vartheta }_i] \approx \text {N}(\varvec{\vartheta }_i|\hat{\varvec{\vartheta }}_i, \hat{\varvec{S}}_i) \end{aligned}$$
(5)

for each data set or partition. This approximation is motivated by standard large-sample asymptotic results. For large-sample theory to motivate the approximation, \([\varvec{y}_i|\varvec{\vartheta }_i]\) should satisfy the standard regularity conditions on the likelihood function, namely the “true” \(\varvec{\vartheta }_i\) is not on the edge of the parameter space and \([\varvec{y}_i|\varvec{\vartheta }_i]\) is continuously twice differentiable (Kass and Steffey 1989; Le Cam and Yang 2000). However, if the log-likelihood is approximately quadratic in \(\varvec{\vartheta }_i\), then the approximation can work well even for modest sample sizes (Geyer 2005).

An estimate and covariance matrix for use in (5) is the maximum likelihood estimate (MLE) of \(\varvec{\vartheta }_i\) and \(\hat{\varvec{S}}_i\) is the negative inverse Hessian. For now, we assume that the size of \(\varvec{y}_i\) is large enough and the parameters are identifiable enough such that \(\hat{\varvec{\vartheta }}_i\) is well defined and \(\hat{\varvec{S}}_i^{-1}\) is nonnegative definite. The main benefit of using the MLE is that there is virtually always software available for finding the MLE and large-sample covariance matrix for common data-level models.

If \({{\varvec{\vartheta }}_{i}}\) is only weakly identified in \([\varvec{y}_i|\varvec{\vartheta }_i]\), then the MLE may be difficult to find. Thus, calculating an estimate from the posterior distribution \([\varvec{\vartheta }_i|\varvec{y}_{i}] = [\varvec{y}_{i}|\varvec{\vartheta }_i][\varvec{\vartheta }_i]\) might be desirable, where the temporary prior \([\varvec{\vartheta }_i]\) is chosen to make all parameters identifiable. In this case, one might choose to use the maximum a posteriori (MAP) by optimizing \([{\varvec{\vartheta }}_{i}|{\varvec{y}}_{i}]\) and using the inverse Hessian for uncertainty as with the MLE. Or one could use a posterior sample (e.g., MCMC or importance sampling) to find the posterior mean and covariance matrix. Similar general regularity conditions apply to the MLE, MAP, or MCMC estimates with respect to asymptotic approximation of well-identified parameters. However, for nearly unidentifiable parameters in \([{\varvec{\vartheta }}_{i}|{\varvec{y}}_{i}]\), asymptotic approximations can be poor for even large samples because no matter how large the data set is, the posterior will be nearly (or exactly) equal to the prior distribution. Therefore, because we are free to choose the temporary prior to aid stage I model fitting, we suggest using a normal prior

$$\begin{aligned} {[}{{\varvec{\vartheta }}_{i}}] = \text {N}({{\varvec{\vartheta }}_{i}}|{{\varvec{\vartheta }}_{0}}, {{\varvec{S}}_{0}}). \end{aligned}$$

A reparameterization of \({{\varvec{\vartheta }}_{i}}\) might be necessary such that a normal density is appropriate as a prior, but this will help the accuracy of the normal approximation in stage I even for poorly identified parameters. Note that the support of the temporary prior needs to be the same as the desired parameter model \([\varvec{\theta }_i|\varvec{\psi }][\varvec{\gamma }_i][\varvec{\eta }]\). With the use of the temporary prior, the approximation becomes

$$\begin{aligned} \frac{[{\varvec{\vartheta }}_{i}|{\varvec{y}}_{i}]}{[{\varvec{\vartheta }}_{i}]} \approx \frac{N({\varvec{\vartheta }}_{i}|{\check{\varvec{\vartheta }}}_{i}, {\check{\varvec{S}}}_{i})}{N({\varvec{\vartheta }}_{i}|{\varvec{\vartheta }}_{0}, {\varvec{S}}_{0})} \propto {\text {N}}({\varvec{\vartheta }}_{i}|{\hat{\varvec{\vartheta }}}_{i}, {\hat{\varvec{S}}}_{i}), \end{aligned}$$
(6)

where \({\check{\varvec{\vartheta }}}_{i}\) and \({\check{\varvec{S}}}_{i}\) are the posterior mode (or mean) and negative inverse Hessian (or covariance matrix) of \([{\varvec{\vartheta }_{i}}|{\varvec{y}_{i}}]\) and

$$\begin{aligned} {\hat{\varvec{S}}}_{i} = \left( {\check{\varvec{S}}}_{i}^{-1} - \varvec{S}_{0}^{-1}\right) ^{-1} \ \ \text {and}\ \ {\hat{\varvec{\vartheta }}}_{i} = {\hat{\varvec{S}}}_{i} \left( {\check{\varvec{S}}}_{i}^{-1}{\check{\varvec{\vartheta }}}_{i} - {{\varvec{S}}_{0}}^{-1}{{\varvec{\vartheta }}_{0}}\right) . \end{aligned}$$
(7)

This ratio was also used by Goudie et al. (2019) in an approximate two-stage MCMC method.

Using (4) and (5) for \([\varvec{\vartheta }_{i}|\varvec{y}_{i}]/[\varvec{\vartheta }_{i}]\), the desired full posterior density function for all the parameters can be approximated (up to a proportional constant) by

$$\begin{aligned} {[}\varvec{\psi },\varvec{\eta }, \varvec{\gamma }, \varvec{\theta }|\varvec{y}] \approx \left( \prod _{i=1}^n\text {N}(\hat{\varvec{\vartheta }}_i|\varvec{\vartheta }_i, \hat{\varvec{S}}_i)\right) [\varvec{\theta }|\varvec{\psi }][\varvec{\gamma }] [\varvec{\psi },\varvec{\eta }]. \end{aligned}$$
(8)

Note that, in (8), we have put \(\varvec{\vartheta }_i\) in the mean position of the normal specification. Due to the symmetry of the normal density function, it is equivalent, but this way \(\hat{\varvec{\vartheta }}_i\) can be interpreted as an observation. In stage II, we can make full inference for all model parameters by maximizing the approximate full posterior (8) to obtain the posterior mode \((\tilde{\varvec{\psi }}',\tilde{\varvec{\eta }}',\tilde{\varvec{\gamma }}',\tilde{\varvec{\theta }})'\), or maximum a posteriori (MAP) estimate and the negative inverse Hessian of the log posterior density estimates the posterior covariance matrix. Hats are used to represent stage I estimates and tildes to represent stage II estimates. For the remainder of the paper, we will assume that some measure of location, \(\hat{\varvec{\vartheta }}_i\), and scale \(\hat{\varvec{S}}_i\) are available where \(\hat{\varvec{S}}_i^{-1}\) is nonnegative definite. These can come from any combination of MLEs, MAPs, or MCMC samples, whatever software is most appropriate or available for the data model can be used. Different methods can be used between stage I units as well.

3.2 Normal Random Unit-Level Effects

While there is no requirement that the unit-level random effects, \(\varvec{\theta }\), be normally distributed in (8), it is the case in the vast majority of applications, or \(\varvec{\theta }_i\) can often be parameterized such that a normal distribution is appropriate. Thus, for the remainder of the paper, we assume that

$$\begin{aligned} {[}\varvec{\theta }|\varvec{\psi }] = \text {N}(\varvec{\theta }|\varvec{\mu }, \varvec{\Sigma }), \end{aligned}$$

where \(\varvec{\psi }= (\varvec{\mu }', \varvec{\sigma }')'\) and \(\varvec{\Sigma }\) is a covariance matrix parameterized by \(\varvec{\sigma }\). When this is the case, stage II becomes equivalent to fitting a multivariate normal random effects linear model.

First, we reparameterize the model using \(\varvec{\theta }_i = \varvec{\mu }+ \varvec{u}_i\) where \(\varvec{u}= (\varvec{u}_1',\dots ,\varvec{u}_n')'\sim \text {N}(\mathbf {0},\varvec{\Sigma })\), \(\varvec{\alpha }= (\varvec{\eta }', \varvec{\gamma }_1',\dots ,\varvec{\gamma }_n')'\), \(\varvec{\alpha }_i = (\varvec{\eta }', \varvec{\gamma }_i')'\), and \(\varvec{\beta }=(\varvec{\mu }', \varvec{\alpha }')'\). Now, the approximation (8) becomes

$$\begin{aligned} {[}\varvec{\mu },\varvec{\sigma }, \varvec{\eta }, \varvec{\gamma }, \varvec{u}|\varvec{y}] \approx \prod _{i=1}^n\left( \text {N}(\hat{\varvec{\vartheta }}_i|\varvec{X}_i\varvec{\beta }+\varvec{Z}_i\varvec{u}_i, \hat{\varvec{S}}_i)\right) \text {N}(\varvec{u}|\mathbf {0},\varvec{\Sigma }) [\varvec{\gamma }][\varvec{\mu }, \varvec{\sigma }][\varvec{\eta }], \end{aligned}$$
(9)

where \(\varvec{X}_i = \text {diag}(\varvec{I}, \varvec{K}_i)\), \(\varvec{I}\) is a \(|\varvec{\mu }|\) size identity matrix, \(\varvec{K}_i\) is a \(|\varvec{\alpha }_i|\times |\varvec{\alpha }|\) indicator matrix that selects the appropriate parameter, and \(\varvec{Z}_i\) is a \(|\varvec{\vartheta }_i|\times |\varvec{\theta }_i|\) indicator matrix that selects \(\varvec{u}_i\) to align with the \(\varvec{\theta }_i\) portion of \(\varvec{\vartheta }_i\). If \(\varvec{\mu }\) is represented as a linear model as in the moose browse example, then \(\varvec{I}\) is replaced with the appropriate design matrix (e.g., \(\varvec{L}_i\) in Sect. 2.1). After concatenation, \(\varvec{\vartheta }= (\varvec{\vartheta }_1',\dots ,\varvec{\vartheta }_n')'\), \(\varvec{X}= (\varvec{X}_1',\dots ,\varvec{X}_n')'\), \(\varvec{Z}= (\varvec{Z}_1',\dots ,\varvec{Z}_n')'\) and \(\varvec{\epsilon } \sim \text {N}(\mathbf {0}, \text {diag}(\hat{\varvec{S}}_1,\dots ,\hat{\varvec{S}}_n))\), finding the MAP of \((\varvec{\mu },\varvec{\sigma }, \varvec{\eta }, \varvec{\gamma }, \varvec{u})\) is equivalent to fitting the multivariate linear mixed model

$$\begin{aligned} \hat{\varvec{\vartheta }} = \varvec{X}\varvec{\beta }+ \varvec{Z}\varvec{u}+ \varvec{\epsilon }, \end{aligned}$$
(10)

subject to the prior distribution \([\varvec{\gamma }][\varvec{\mu }, \varvec{\sigma }][\varvec{\eta }]\).

In the analysis of the two ecological data sets, we use the R package TMB to fit (9), but it is just a multivariate linear mixed model, so any software that can account for the parameter prior distributions can be used. TMB uses Laplace’s method to find the MAP of \(\varvec{\psi }= (\varvec{\mu },\varvec{\sigma })\), \(\varvec{\eta }\), and \(\varvec{\gamma }\) using the marginal posterior \([\varvec{\mu },\varvec{\sigma }, \varvec{\eta }, \varvec{\gamma }|\varvec{y}]\), and then a generalized delta-method (Kass and Steffey 1989) is employed to estimate \(\varvec{u}\) accounting for uncertainty in the other parameters (Kristensen et al. 2016; Skaug and Fournier 2006). At first, it might seem as if we are adding an another layer of approximation by using the Laplace method; however, given the normal approximation already made in (8) and the normality of the unit-level parameters, the Laplace method of approximating the marginal posterior density of \((\varvec{\mu },\varvec{\sigma }, \varvec{\eta }, \varvec{\gamma })|\varvec{y}\) obtained from (9) is actually exact because the log-posterior is quadratic in \(\varvec{u}\) (Goutis and Casella 1999). After finding the MAP, traditionally, the inverse Hessian of (9) is used to estimate the covariance matrix of the full model (9), which is what we use in the examples that follow.

3.3 Stage III: Revisiting Stage I

Using the description to this point, we are set to perform two-stage inference for a CIHM. However, we noticed, in our initial analysis of the lapwing data in Section 4.2, when \(\varvec{S}_i^{-1} \approx \mathbf {0}\) (relative to \(\varvec{\vartheta }_i\)) for a particular unit in in stage I, then updating \(\hat{\varvec{\gamma }}_i\) might not be as efficient as desired. Therefore, we propose revisiting stage I estimation to update \(\varvec{\gamma }_i\) given the results of the second stage using the generalized delta method initially proposed by Kass and Steffey (1989). Note that we are not proposing that stage III is always necessary, but, that it is an option, especially if \(\varvec{\gamma }_i\) is not well identified in the data model alone.

For a heuristic example of lack of updating for \(\hat{\varvec{\gamma }}_i\), consider the simple two-parameter model structure:

$$\begin{aligned} \varvec{y}_1 \sim [\varvec{y}_1|\varvec{\eta }]\ \text { and } \varvec{y}_2 \sim [\varvec{y}_2|\varvec{\eta },\varvec{\gamma }_2]. \end{aligned}$$
(11)

Further, suppose that \(\varvec{\eta }\) and \(\varvec{\gamma }_2\) jointly are only vaguely identifiable in \([\varvec{y}_2|\varvec{\eta },\varvec{\gamma }_2]\), whereas it is fully identifiable in \([\varvec{y}_1|\varvec{\eta }]\). In stage I, we can obtain well-informed estimates \(\hat{\varvec{\eta }}_1\) and \(\hat{\varvec{S}}_1\). However, for \(\varvec{y}_2\), \(\hat{\varvec{S}}_2^{-1}\) will be small. In the extreme case that \(\hat{\varvec{S}}_2^{-1} \rightarrow \varvec{0}\), one can show (through the generalized least-square estimate of \(\varvec{\beta }= (\varvec{\eta },\varvec{\gamma }_2)\) in (10)) that \(\tilde{\varvec{\eta }} \rightarrow {\hat{\varvec{\eta }}}_1\) for any \({\hat{\varvec{\eta }}}_2\). However, \(\tilde{\varvec{\gamma }}_2 \rightarrow {\hat{\varvec{\gamma }}}_2\), remaining unchanged from stage I. So, we see satisfactory updating of \(\hat{\varvec{\eta }}_2\), but no updating of \(\hat{\varvec{\gamma }}_2\) in stage II. This case is admittedly pathological, but it illustrates that the Hessian matrix of \([\varvec{y}_2|\varvec{\eta }=\hat{\varvec{\eta }}_2, \hat{\varvec{\gamma }}_2]\) may not be the best representation of the likelihood shape at the full MAP.

Stage III proceeds by first marginalizing over \(\varvec{\gamma }_i\). This can be done analytically if possible. However, based on the initial normal approximation after stage I, after \({\hat{\varvec{\vartheta }}}_{i}\) and \(\hat{\varvec{S}}_i\) are calculated, the rows and columns associated with \(\varvec{\gamma }_i\) can be ignored when moving to stage II. Let \(\varvec{\xi }\) represent all parameters except \(\varvec{\gamma }_i\); after stage II, we obtain the marginal posterior approximation:

$$\begin{aligned} {[}\varvec{\xi }|\varvec{Y}] \approx \text {N}(\varvec{\xi }\mid \tilde{\varvec{\xi }}, \widetilde{\varvec{V}}_{\xi }), \end{aligned}$$

where \(\widetilde{\varvec{V}}_{\xi _i}\) is the negative inverse of the Hessian of the log-posterior distribution represented by the stage II model (9) and the associated prior distributions. If \(\varvec{\gamma }_i\) is a nuisance parameter and of little interest, we can stop here and there is no need for stage III. However, if as estimate of \(\varvec{\gamma }_i\) is desired, we can obtain an updated mean and covariance matrix for the full posterior approximation such that

$$\begin{aligned} {[}\varvec{\gamma }_i ,\varvec{\xi }|\varvec{Y}] \approx \text {N}((\varvec{\gamma }_i ,\varvec{\xi }) \mid \bar{\varvec{\xi }}, \overline{\varvec{V}}), \end{aligned}$$

where \(\bar{\varvec{\xi }} = (\bar{\varvec{\gamma }}_i,\tilde{\varvec{\xi }})\), \(\bar{\varvec{\gamma }}_i\) is the value that maximizes the conditional posterior \([\varvec{\gamma }_i|\tilde{\varvec{\theta }}_i, \tilde{\varvec{\eta }}, \varvec{y}_i]\) and \(\overline{\varvec{V}}\) is a updated posterior covariance matrix. To calculate \(\overline{\varvec{V}}\), we need to define the matrices:

$$\begin{aligned} \begin{aligned} \mathbf {H}_{\gamma \gamma }&= \frac{\partial ^2}{\partial \varvec{\gamma }_i \partial \varvec{\gamma }_i'} \left\{ \log [\varvec{y}_i|\tilde{\varvec{\theta }}_i, \tilde{\varvec{\eta }}, \bar{\varvec{\gamma }}_i] + \log [\bar{\varvec{\gamma }}_i] \right\} \text { and } \\ \mathbf {H}_{\gamma \xi }&= \frac{\partial ^2}{\partial \varvec{\gamma }_i \partial \varvec{\xi }_i'} \log [\varvec{y}_i|\tilde{\varvec{\theta }}_i, \tilde{\varvec{\eta }}, \bar{\varvec{\gamma }}_i]. \end{aligned} \end{aligned}$$

Now, the full covariance matrix can be approximated by (Kass and Steffey 1989; Skaug and Fournier 2006):

$$\begin{aligned} \begin{aligned} \overline{\varvec{V}}&= \left[ \begin{array}{cc} -\mathbf {H}_{\gamma \gamma }^{-1} &{} \varvec{0}\\ \varvec{0}&{} \varvec{0}\end{array}\right] + \mathbf {J} \widetilde{\varvec{V}}_{\xi } \mathbf {J}', \text { where } \\ \mathbf {J}&= \left[ \begin{array}{c} -\mathbf {H}_{\gamma \gamma }^{-1}\mathbf {H}_{\gamma \xi } \\ \mathbf {I}_{\xi } \end{array}\right] . \end{aligned} \end{aligned}$$

This may not be applicable in all situations, because one must be able to evaluate \([\varvec{\gamma }_i|\tilde{\varvec{\theta }}_i, \tilde{\varvec{\eta }}, \varvec{y}_i]\) to a proportional constant as well as its gradient and Hessian for various parameter values. This may not be practical for most production software used in stage I.

One could question whether or not \(\varvec{\theta }_i\) might suffer the same lack of sufficient updating in stage II as well as \(\varvec{\gamma }_i\) if the likelihood is generally flat. However, this is much less likely due to the fact that there is a unit-level model, \([\varvec{\theta }|\varvec{\psi }]\), which is being explicitly fit in the second stage. This allows information in other data sets to directly influence and update \(\varvec{\theta }_i\) at stage II. However, \(\varvec{y}_j\), \(j\ne i\), can only contribute to updates of \(\hat{\varvec{\gamma }}_i\) via the off-diagonal elements of \(\hat{\varvec{S}}_i^{-1}\).

4 Model Fitting and Results

4.1 Moose Browse GLMM

To effectively use the two-stage approach in this example, it was necessary to use a stage I temporary prior (6) and (7) because browsing is rare (or absent) in some of the plantations. We used the normal g-prior (Hanson et al. 2014) where \(\varvec{\theta }_0 = \mathbf {0}\), \(\varvec{S}_0 = \pi ^2 n_i (\mathbf {H}_i'\mathbf {H}_i)^{-1}/6\), and \(\mathbf {H}_i\) is the usual design matrix with a column for intercept and the other containing tree height (\(h_{ij}\)). This prior produces \(p_{ij}\) that are approximately uniform(0,1) distributed for a randomly selected \(h_{ij}\). The effect of this prior was removed prior to stage II using (7).

For stage I model fitting, we used two approaches. In the first version (A), we fitted each individual model using the R package mgcv (Wood 2011) to obtain the posterior mode and Hessian covariance matrix from \([\varvec{\theta }_i|\varvec{y}_i] \propto [\varvec{y}_i|\varvec{\theta }_i][\varvec{\theta }_i]\). In the second version (B), we used the approximate posterior mean and covariance matrix obtained from a deterministic posterior sampling procedure detailed in Johnson et al. (2011) (see online appendix S1). This deterministic procedure estimates the posterior mean and variance without asymptotic results. For version B, the likelihood must be calculated with bespoke code. We are not aware of any production-level software that returns only the value of the posterior for logistic regression, which makes this version more challenging in practice than the first version, but we explored here nonetheless. In stage II, we used TMB (Kristensen et al. 2016) to fit the linear mixed model in (10). Code for this example as well as the IPM example is provided in an online supplement.

We examined predictive estimates \(\tilde{\varvec{\theta }}_i = \varvec{X}_i\tilde{\varvec{\beta }} + \tilde{\varvec{u}}_i\), fixed effects \(\tilde{\varvec{\beta }}\), and variance parameters \((\log \tilde{\sigma }_1, \log \tilde{\sigma }_2)\) with respect to a full MCMC analysis. Both versions of the two-stage approach produced results similar to a gold standard MCMC analysis using all the data simultaneously (Figs. 1 and 2). Heuristically, for both \(\varvec{\theta }_i\) and \(\varvec{\Sigma }\), the mean and covariance matrix obtained from stage I, version B (deterministic posterior sample), produced estimates more similar to the full-model MCMC estimates than those derived from the Bayesian MAP fitting with mgcv, but only slightly so. Scientific inference for any of the three methods would be identical, especially for the fixed parameters \(\varvec{\beta }\).

4.2 Lapwing IPM

To demonstrate how we might utilize production software, in the first stage, we fitted the ring-recovery model using MARK (White and Burnham 1999) via the user interface R package RMark (Laake 2013). MARK provides the MLE and the large-sample covariance matrix for \(\hat{\varvec{\alpha }}_1 = (\hat{\varvec{\eta }}_1, \hat{\varvec{\gamma }}_1)'\) and \(\hat{\varvec{S}}_1\). For the state-space model of the female count data, we used bespoke model code written for analysis with the R package TMB (Kristensen et al. 2016). However, the state-space model is the simpler of the two data models in terms of programming complexity. From TMB, we obtain the MAP estimate and large-sample covariance matrix, \(\hat{\varvec{\alpha }}_2 = (\hat{\varvec{\eta }}_2, \hat{\varvec{\gamma }}_2)'\) and \(\hat{\varvec{S}}_2\). Because the survival and production parameters are not well identified in the state-space model, we used vaguely informative temporary stage I prior distributions. For \((\eta _{y0},\eta _{y1})\) and \((\eta _{a0},\eta _{a1})\), we used independent normal g-priors to induce approximately uniform priors on \(\phi _{yt}\) and \(\phi _{at}\) (Hanson et al. 2014). For the production parameters, \((\gamma _{0\rho },\gamma _{1\rho })\), we used a normal g-prior such that \(\rho _t\) lies within [0.01, 2] with prior probability \(\approx 0.95\). Based on other analyses, this prior covers a fairly broad range of values for the average number of females produced per female. For the remaining parameters, temporary priors were not used. In stage II, we used the linear mixed model in (10) along with the population prior distributions:

$$\begin{aligned} \begin{aligned} {[}\eta _{y0}, \eta _{y1}, \eta _{a0}, \eta _{a1}]&= \text {N}(\mathbf {0}, 10^2\varvec{I}),\\ [\gamma _{\lambda 0}, \gamma _{\lambda 1}]&= \text {N}(\mathbf {0}, 10^2\varvec{I}),\ [\gamma _{\rho 0}, \gamma _{\rho 1}] = \text {N}(\mathbf {0}, 10^2\varvec{I}) \\ [\log N_{y0}, \log N_{a0}]&= \text {N}(0, 1000^2\varvec{I}), \text { and }[\log \sigma ] \propto \text {constant (improper)}. \end{aligned} \end{aligned}$$

We also consider an additional complexity beyond the traditional IPM specification. One of the tenets of the IPM is that the multiple data models share common parameters of scientific interest, in our case the survival parameters, \(\varvec{\eta }\). It seems reasonable, however, that differences in data collection procedures or model specification might imply separate “true” values of the survival parameters, that is, \(\hat{\varvec{\eta }}_i\) is actually estimating \(\varvec{\eta }_i\ne \varvec{\eta }\). So, one might consider each \([\varvec{y}_i|\varvec{\eta },\varvec{\gamma }_i]\) to be an imperfect source of information about \(\varvec{\eta }\). This is how meta-analysts would traditionally envision combining these separate studies. Therefore, we also fit a model with random effects \(\varvec{\theta }_i \sim \text {N}(\varvec{\eta }, \sigma ^2_\theta \varvec{I})\). If the available data suggest common survival parameters between the two data sets, then \(\tilde{\sigma }_\theta \) will be small and all the stage II estimates of each \(\tilde{\varvec{\theta }}_i \approx \tilde{\varvec{\eta }}\). However, if \(\tilde{\sigma }_\theta \gg 0\), then a fixed \(\varvec{\eta }\) model is probably inappropriate and there are possible biases in one or both of the different data sets. Thus, \(\sigma _\theta \) serves as a regulator to assess the degree to which the data sets support a fixed \(\varvec{\eta }\) value. For this model, we used \([\sigma _\theta ]\) = Exponential(1) in addition to the previous prior specifications.

Fig. 1
figure 1

Random effect inference for Example I using two-stage and full MCMC methods. In the two-stage analyses, this was estimated using \(\tilde{\varvec{\theta }}_i = \varvec{X}_i\tilde{\varvec{\beta }} + \tilde{\varvec{u}}_i\), where the tilde represents the second-stage estimates

Fig. 2
figure 2

Fixed effect and variance parameter inference for Example I using two-stage and full MCMC methods

To judge the quality of the posterior approximation, we compared our results to those presented by Besbeas and Morgan (2019), hereafter, B&M. Table 2 (pg. 483) of that publication provides the results of a full joint analysis for several different models and inference methods. Our specification is most similar to their “MLE KF” model, that is a model where \(\log N_{y0}\) and \(\log N_{a0}\) are explicitly estimated. We only compare our results to that model (Fig. 3), but the other results are very similar. All of the multistage approximate fitting methods produced results very similar to the B&M results (Fig. 3). A full table of the multistage results is available in Appendix B. For the two- and three-stage fixed \(\varvec{\eta }\) models, estimates of \(\varvec{\eta }\) were virtually indistinguishable from those given by B&M, both point estimates and standard deviations. Also, the band reporting parameters, \(\varvec{\gamma }_1\), are nearly identical. From stage II to stage III, \(\bar{\gamma }_{\rho 1}\) became more similar to the B&M results, while \(\bar{\gamma }_{\rho 0}\) became less similar. Although they are both close to the B&M results overall, a marked improvement in estimates from stage II to III is evident in the \(\log \sigma \), \(\log N_{y0}\), and \(\log N_{a0}\) parameters, with all of them becoming nearly equivalent to the B&M estimates after stage III.

Fig. 3
figure 3

Parameter estimates for the lapwing IPM model. The B&M 2019 results were taken from Table 2 of Besbeas and Morgan (2019) and represent estimates from single-stage estimation of the full model. In the three-stage RE model, \(\varvec{\theta }_i \sim \text {N}(\varvec{\eta },\sigma _{\theta }^2\varvec{I})\). Parameter estimates are also provided in online supplement Table S1

Fig. 4
figure 4

Estimates of year and age-specific survival for lapwings. The different age classes are represented in the rows; adult and first year. The columns illustrate estimates from the two different models. The first column is the traditional IPM where the survival parameters are common between the data sets (i.e., \(\varvec{\eta }\) is fixed). In the second column, we modeled \(\varvec{\theta }_i \sim \text {N}(\varvec{\eta }, \sigma _{\theta }^2\varvec{I})\), where \(\sigma _{\theta }\) is estimated in the second stage

We now turn our attention to the random effect version of the model. The estimate of \(\tilde{\sigma }_\eta \) = 0.017 (0.006–0.049), which is small, but not trivial. Thus, there is some evidence that each data set contains different information about the true survival process. The population-level effect of frost days on survival becomes slightly steeper and more uncertain (Fig. 3) for young animals. For adult animals, frost days have less effect on survival, but this is again, more uncertain. The increase in uncertainty and change in frost effect are plainly evident when examining survival on the natural scale (Fig. 4). Although they did not fit separate survival processes, Goudie et al. (2019) noted this discrepancy as well in their two-stage MCMC analysis of these same data.

5 Discussion

We have proposed a multistage method for estimating parameters in a broad class of BHMs. Our approach is computationally efficient and flexible and draws inspiration from traditional meta-analysis where individual studies using differing methods contribute unit-level parameter estimates which are then combined using a CIHM format. In stage I, practitioners can use software expressly designed for a particular data model. We use traditional GLM/GAM software in the first example, and a mixture of production software (MARK/RMark) and pseudo-coding software (TMB) in the IPM example. This increases the efficiency of the analysis in development time because software expressly designed for the data at-hand is being used. Analysts do not have to create bespoke MCMC that will often be substandard to production-level software in terms of ease-of-use and computational speed. In addition, as with other two-stage methods, the stage I analyses are “embarrassingly parallel,” so they can be executed in parallel if necessary, further increasing efficiency. In stage II, the multivariate linear mixed model is very quick to analyze because the dimension of the problem is substantially reduced from the data models. The linear mixed model represents a compression of the information in the various data sets about the parameters. Finally, the generalized delta method allows one to revisit the estimates of distinct parameters that is usually not possible in traditional meta-analysis because the raw data are not available.

This efficiency comes at a cost, however, because it relies on approximations in two places during the development, the normal approximation in stage I and the normal approximation of the stage II posterior distribution. We do not view this as a detriment because this method is designed for computationally challenging situations. These usually occur in settings with a large amount of data and hence a substantial amount of information about the parameters. That is, situations for which the large sample approximations were developed. Both of the models in our examples could be fitted readily in a single stage BHM analysis. Even so, the multistage method produced estimates very close to the “gold standard” single-stage MCMC estimates. The two-stage approach may not be optimal in all settings, thus if one is capable of fitting the full CIHM in one step that would be the recommended course of action. We suggest our two-stage approach for instances where computation or development of software to fit the full model at once is impractical.

Even though the approximations produced good results in our examples, we should examine the seriousness of these approximations and what options practitioners have for checking the degree of error introduced by these approximations. The first approximation is the Gaussian approximation of the data-level likelihood shape at the first stage. This approximation is justified by the standard large sample results that coincide for MLE and posterior modes (MAP) and means. There are some structural situations with stage I models where the assumptions underlying those results would be questionable. Gelman et al. (2013, Section 4.2) presents several scenarios where the assumptions of large sample theory fail to be met. Most of these situations are able to be fixed with appropriate reparameterization or prior selection. For example, parameters should be specified on the log or logit scale, as well as the use of the Gaussian temporary prior in (6).

If the data-level models are sufficiently regular, one may still want to assess whether the sample size is sufficient to warrant a Gaussian approximation. We note that, in a variety of previous studies, this assumption has often worked well with little technical justification (e.g., Geyer 2005; Goudie et al. 2019; Scott et al. 2016). If the validity of the MLE or MAP estimates and Hessian derived covariance are in question, one can also use bootstrap bias and variance corrections to improve the results (Geyer 2005; Scott et al. 2016). One can also increase data size by combining the original data partitions where appropriate. To ascertain whether sample size is too small for the shape of the normal density function to be appropriate, one could also perform a MCMC analysis of some of the smallest units to see how posterior density appears. If the approximating normal posterior is used as a proposal distribution in a simple independent Metropolis–Hastings algorithm, then the acceptance rate can be used to judge the closeness of the approximation. For example, an acceptance rate of 0.9 (i.e., 90% of proposals accepted) would imply that the stage I approximation is very close to the actual stage I posterior shape. Michelot et al. (2019) use this type of diagnostic for assessing adequacy of Euler approximations for Langevin diffusion models. Because the partitions were designed such that individual data models are relatively easy to evaluate this should pose little challenge for at least a few low sample size partitions. After diagnostic checking, if a normal approximation appears inappropriate, Gelman et al. (2013) suggests a normal mixture,

$$\begin{aligned} {[}\varvec{y}_i|\varvec{\theta }_i] \approx \sum _{k=1}^{K_i}\omega _{ik}\text {N}(\varvec{\vartheta }_{i}|\varvec{\vartheta }_{ik}^*, \varvec{S}_{ik}^*), \end{aligned}$$

where \(\sum _k \omega _{ik} = 1\), and \(\{\varvec{\vartheta }_{ik}^*, \varvec{S}_{ik}^*\}\) are selected in an advantageous way in stage I. If \(\varvec{\theta }_i \sim \text {N}(\varvec{\mu }, \varvec{\Sigma })\), then we can follow the same development in Section 3.2 to obtain the stage II model that is a mixture of linear mixed models. That is, the approximate posterior would be

$$\begin{aligned} {[}\varvec{\mu },\varvec{\sigma }, \varvec{\eta }, \varvec{\gamma }, \varvec{u}|\varvec{y}] \propto \prod _{i=1}^n\left( \sum _{k=1}^{K_i}\omega _{ik}\text {N}(\varvec{\vartheta }_{ik}^*|\varvec{X}_i\varvec{\beta }+\varvec{Z}_i\varvec{u}_{ik}, \varvec{S}_{ik}^*)\right) \text {N}(\varvec{u}|\mathbf {0},\varvec{\Sigma }) [\varvec{\gamma }][\varvec{\mu }, \varvec{\sigma }][\varvec{\eta }]. \end{aligned}$$

Although it is no longer a traditional linear mixed model, it is still relatively easy to fit because the weights are known.

In stage II, when using the generalized delta method for estimating \(\varvec{\theta }\) and the stage III method for reassessing \(\varvec{\gamma }\), we make the additional assumption that \([\varvec{\mu },\varvec{\sigma },\varvec{\eta },\varvec{\gamma }|\varvec{y}]\) is approximately normal. Again, given we have already made the assumption that the stage I posteriors are approximately normal. This new approximation is not adding a substantial new element of error to the analysis because, for a given value of the random effect variance parameters, \(\varvec{\sigma }\), normal random effects, and the original normal assumption, the stage II posterior is multivariate normal (assuming normal priors on \(\varvec{\beta }\)). However, if needed, then bootstrap, MCMC, or other method could be used for the stage II inference instead of the MAP and Hessian.

We developed the proposed multistage method under the assumptions of the CIHM (1); however, we can relax some of the independence assumptions in the CIHM at the data level and the method is still valid in its current form. First, the data do not need to be strictly independent at the unit level. That is, the multistage method can still be applied when the data models can be partitioned as

$$\begin{aligned} \varvec{y}\sim \prod _i[\varvec{y}_i|\varvec{y}_{-i}, \varvec{\theta }_i, \varvec{\gamma }_i, \varvec{\eta }], \end{aligned}$$

where \(\varvec{y}_{-i} = \{\varvec{y}_1,\dots ,\varvec{y}_{i-1},\varvec{y}_{i+1},\dots ,\varvec{y}_n\}\). Having a meaningful partition of the data is not the important part of any multistage analysis, it is the ability to partition the data-level model (Hooten et al. 2021). In stage I, these dependent data models can be individually fit in the same way.

Hierarchical Bayesian models have brought a wealth of analytical capability for extracting scientific signal out of complex and noisy data. Moreover, they provide an honest assessment of uncertainty by accounting for many different forms of stochasticity and relationships in the observable data. By assembling relatively simple components, a complex model can emerge. With the advent of modern data collection capabilities, default methods for inference have started to become obsolete. Gelfand and Ghosh (2015) note that BHMs have been a great benefit to science, but the current rate of data acquisition and model complexity are “beginning to stretch our computational capabilities” and future inference will have to be the result of “an artful mixture of model specification and approximation.” Hierarchical conditioning provides a beneficial way to build large complex models; thus, it is sensible that inferential computing for these models may also be best accomplished in a hierarchical fashion (McCaslin et al. 2021).