1 Introduction

Importance sampling is a class of procedures for computing expectations using draws from a proposal distribution that is different from the distribution over which the expectation was originally defined (Robert and Casella 2013). A primary field of application for importance sampling is Bayesian statistics where we commonly sample from the posterior distribution of a probabilistic model as we are unable to obtain the distribution in closed form. After generating a sample from the posterior distribution, it is commonplace to use it as a proposal distribution for computing a large number of expectations over closely related distributions for tasks such as bootstrap and leave-one-out cross-validation (Gelfand et al. 1992; Gelfand 1996; Peruggia 1997; Epifani et al. 2008; Vehtari et al. 2017; Giordano et al. 2019). However, in the presence of influential observations in the data or target distributions that are difficult to approximate, such importance sampling procedures may be inefficient or inaccurate. In order to avoid explicitly generating Monte Carlo draws from each closely related distribution, it is desirable to find adaptive importance sampling methods that can utilize the information in the already generated posterior draws in a computationally efficient manner.

Fig. 1
figure 1

Bivariate density plots from the posterior distribution of the logistic regression model for the Ovarian data. The posterior is 3075-dimensional and is highly multimodal.

The contributions of this paper can be summarized as follows:

  • We present a novel implicitly adaptive importance sampling method. The method adapts the importance sampling proposal distribution implicitly by applying affine transformations to a Monte Carlo sample from the proposal distribution.

  • We propose specific adaptations and estimators for simple Monte Carlo sampling as well as standard and self-normalized importance sampling. We show that our proposed double adaptation framework for self-normalized importance sampling significantly improves the accuracy of existing adaptive importance sampling methods in many settings.

  • We also propose to use an existing importance sampling convergence diagnostic as an acceptance and stopping criterion for the adaptive method, and discuss its applicability also outside of importance sampling.

The proposed method does not require any tuning from the user and is easily automatized and applied to a variety of different problems. Because it can be used with arbitrary proposal distributions, it is most beneficial for complex distributions that would be difficult to capture with a parametric form. We demonstrate its usefulness with Bayesian leave-one-out cross-validation (LOO-CV) and the probabilistic programming framework Stan (Carpenter et al. 2017).

As an illustrative example, we show a Bayesian model posterior that is used in the experiments in Sect. 3.4. Figure 1 represents bivariate density plots of the marginal distributions of three pairs of parameters in the full data posterior distribution. In total, the model has 3075 parameters. We can use the Monte Carlo sample from the full data posterior as an importance sampling proposal distribution for computing cross-validation scores over posterior distributions where single observations have been left out, and our proposed adaptive method for improving accuracy with a small additional computational cost. Because the posterior is high-dimensional and multimodal, standard adaptive methods that use parametric proposal distributions may require multiple proposal distributions to be efficient, which is more computationally costly and further complicates the choice of appropriate proposal distributions. The \(\lambda \) parameters in Fig. 1 are local shrinkage parameters of a logistic regression model with a regularized horseshoe prior on the regression coefficients (Piironen and Vehtari 2017b). The parameters are constrained to be positive, so they are sampled in logarithmic space with dynamic Hamiltonian Monte Carlo (Hoffman and Gelman 2014; Betancourt 2017). More details are given in Sect. 3.4.

1.1 Overview of importance sampling

Let us consider an inference problem where a vector of unknown parameters has a probability density function \(p (\varvec{\theta })\). Our task is to estimate integrals of the form

$$\begin{aligned} \mu = \mathbb {E}_p [h(\varvec{\theta })] = \int h(\varvec{\theta }) p (\varvec{\theta }) \mathrm {d} \varvec{\theta } , \end{aligned}$$
(1)

where \(h (\varvec{\theta })\) is some function of the parameters \(\varvec{\theta }\) that is integrable with respect to \(p (\varvec{\theta })\). These kinds of integrals are ubiquitous in Bayesian inference, where quantities of interest are computed as expectations over the inferred posterior distribution of the model. However, the same formulation is used for many other problems, such as rare event estimation (Rubino and Tuffin 2009), optimal control (Kappen and Ruiz 2016), and signal processing (Bugallo et al. 2015). Using a set of independent draws \(\{ \varvec{\theta }^{(s)} \}_{s = 1}^S\) from \(p (\varvec{\theta })\), the simple Monte Carlo estimator of \(\mu \) is

$$\begin{aligned} \hat{\mu }_{\text {MC}} = \frac{1}{S} \sum _{s = 1}^S h (\varvec{\theta }^{(s)}) \, , \, \text {when} \, \, \varvec{\theta }^{(s)} \sim p (\varvec{\theta }) . \end{aligned}$$

In this work, we use the term draw to represent a single \(\varvec{\theta }^{(s)}\), and the term sample to represent a set of draws \(\{ \varvec{\theta }^{(s)} \}_{s = 1}^S\). If the expectation \(\mu \) exists, the simple Monte Carlo estimator is a consistent and unbiased estimator of \(\mu \), meaning that asymptotically it will converge towards \(\mu \) by the strong law of large numbers. If also the expectation of \(h^2\) is finite, the central limit theorem holds, and the asymptotic convergence rate of the simple Monte Carlo estimator is proportional to \(\mathscr {O} (S^{-1/2})\). Given finite variance, a similar convergence rate holds for uniformly ergodic Markov chains (e.g. Roberts and Rosenthal 2004).

In some cases, it is not possible or it is expensive to generate draws from \(p (\varvec{\theta })\), but the expectation \(\mu \) is still of interest. In this case, we may generate a sample from a proposal distribution \(g (\varvec{\theta })\) and compute the expectation of Eq. (1) using the standard importance sampling estimator

$$\begin{aligned} \mathbb {E}_p [h(\varvec{\theta })]\approx & {} \hat{\mu }_{\text {IS}} = \frac{1}{S} \sum _{s = 1}^S \frac{ p (\varvec{\theta }^{(s)} )}{g (\varvec{\theta }^{(s)})} h (\varvec{\theta }^{(s)}) \nonumber \\= & {} \frac{1}{S} \sum _{s = 1}^S w^{(s)} h (\varvec{\theta }^{(s)}) , \quad \text {when} \, \, \varvec{\theta }^{(s)} \sim g (\varvec{\theta }) . \end{aligned}$$
(2)

Here \(w^{(s)}\) are called importance weights or importance ratios, and they measure the mismatch between \(p (\varvec{\theta })\) and \(g (\varvec{\theta })\) for a specific draw \(\varvec{\theta }^{(s)}\). In principle, the proposal distribution can be any probability distribution which has the same support as the target distribution \(p (\varvec{\theta })\) and is positive whenever \(p (\varvec{\theta }) h (\varvec{\theta }) \ne 0\). The standard importance sampling estimator \(\hat{\mu }_{\text {IS}}\) is also a consistent and unbiased estimator of \(\mu \) as long as the expectation \(\mu \) exists. Its variance depends largely on the choice of the proposal distribution \(g (\varvec{\theta })\). For a good choice, the variance can be smaller than the variance of the simple Monte Carlo estimator, but it can also be much larger, or infinite, if the choice is less ideal. In the context of this paper, we will consider the simple Monte Carlo estimator simply as a special case of standard importance sampling, where the proposal distribution is \(p (\varvec{\theta })\), the distribution over which the expectation is defined.

A commonly used alternative estimator is the self-normalized importance sampling (SNIS) estimator

$$\begin{aligned} \hat{\mu }_{\text {SNIS}}= & {} \frac{ \sum _{s = 1}^S \frac{ p (\varvec{\theta }^{(s)} )}{g (\varvec{\theta }^{(s)})} h (\varvec{\theta }^{(s)})}{\sum _{s = 1}^S \frac{ p (\varvec{\theta }^{(s)} )}{g (\varvec{\theta }^{(s)})}} \nonumber \\= & {} \frac{ \sum _{s = 1}^S w^{(s)} h (\varvec{\theta }^{(s)})}{\sum _{s = 1}^S w^{(s)}} , \quad \text {when} \, \, \varvec{\theta }^{(s)} \sim g (\varvec{\theta }) . \end{aligned}$$
(3)

This estimator is more generally applicable, because it can be used even if the normalization constants of the densities \(p(\varvec{\theta })\) or \(g(\varvec{\theta })\) are not known. The self-normalized estimator is also consistent, but has a small bias of order \(\mathscr {O} (1/S)\) (Owen 2013). All three introduced Monte Carlo estimators are consistent, and thus converge to the true value \(\mu \) asymptotically as \(S \rightarrow \infty \), if \(\mu \) itself exists. However, there are many cases where these estimators can have poor pre-asymptotic behaviour despite having asymptotically guaranteed convergence (Vehtari et al. 2019c). That is, in cases with poor pre-asymptotic behaviour, convergence for any achievable finite set of S draws may be so bad, that we cannot get sufficiently accurate results in reasonable time. We discuss this issue in Sect. 2.3.

To unify the notation and nomenclature of the different Monte Carlo estimators, we define the ratio of the target density \(p (\varvec{\theta } )\) and the proposal density \(g (\varvec{\theta } )\) as the common importance weights because they do not depend on the function \(h (\varvec{\theta } )\) whose expectation is computed:

$$\begin{aligned} w = w(\varvec{\theta }) = \frac{p(\varvec{\theta })}{g(\varvec{\theta })} . \end{aligned}$$
(4)

Analogously, we define the product

$$\begin{aligned} v= v(\varvec{\theta }) = \frac{p(\varvec{\theta })}{g(\varvec{\theta })} h(\varvec{\theta }) \end{aligned}$$
(5)

as the expectation-specific importance weights for the expectation \(\mathbb {E}_p [h(\varvec{\theta })]\). With this notation, both the simple Monte Carlo and standard importance sampling estimators are defined as the sample mean of the expectation-specific weights \(v\). On the other hand, the self-normalized importance sampling estimator in Eq. (3) is defined as the ratio of the sample means of \(v\) and w.

1.2 Multiple importance sampling

In this section, we briefly discuss multiple importance sampling, which forms the basis for many existing adaptive importance sampling techniques (Cornuet et al. 2012; Martino et al. 2015; Bugallo et al. 2017). Multiple importance sampling refers to the case of sampling independently from many proposal distributions (Hesterberg 1995; Veach and Guibas 1995; Owen and Zhou 2000). Let us denote the J proposal distributions as \(\{ g_1, \ldots , g_J \}\) and the number of draws from each as \(\{ S_1 , \ldots S_J \}\) such that \(\sum _{j = 1}^J S_j = S\). The multiple importance sampling estimator is a weighted combination of the individual importance sampling estimators:

$$\begin{aligned} \hat{\mu }_{\text {MIS}}= & {} \sum _{j = 1}^J \frac{1}{S_j} \sum _{s = 1}^{S_j} \beta _j (\varvec{\theta }^{(j, s)}) \frac{h (\varvec{\theta }^{(j, s)}) p(\varvec{\theta }^{(j, s)})}{g_j (\varvec{\theta }^{(j, s)})} , \\&\quad \text {when} \, \, \varvec{\theta }^{(j,s)} \sim g_j (\varvec{\theta }) . \end{aligned}$$

where \(\{ \beta _j \}_{j = 1}^J\) is a partition of unity, i.e. for every \(\varvec{\theta }\), \(\beta _j (\varvec{\theta }) \ge 0\) and \(\sum _{j = 1}^J \beta _j (\varvec{\theta }) = 1\). With different ways of choosing the weighting functions \(\beta _j\), one can vary between locally emphasizing one of the proposal distribution \(g_j\), or considering them in a balanced way for every value of \(\varvec{\theta }\).

The weighting functions are commonly chosen using a balance heuristic

$$\begin{aligned} \beta _j (\varvec{\theta }) = \frac{S_j g_j (\varvec{\theta })}{\sum _{k = 1}^{J} S_k g_k (\varvec{\theta })} , \end{aligned}$$

whose variance is proven to be smaller than the variance of any weighting scheme plus a term that goes to zero as the smallest \(S_j \rightarrow \infty \) (Veach and Guibas 1995). The balance heuristic is also a quite natural way of combining the draws from different proposal distributions, as the importance weights for all draws are computed as if they were sampled from the same mixture distribution \(g_{\alpha } (\varvec{\theta })\)

$$\begin{aligned} w_{\text {DM-MIS}}^{(j,s)} = \frac{ p (\varvec{\theta }^{(j,s)} )}{g_{\alpha } (\varvec{\theta }^{(j,s)})} = \frac{ p (\varvec{\theta }^{(j,s)} )}{\sum _{j = 1}^J \alpha _j g_j (\varvec{\theta }^{(j,s)})} , \, \, \alpha _j = \frac{S_j}{S} .\nonumber \\ \end{aligned}$$
(6)

With these weights, the multiple importance sampling estimator is then computed using the usual equations of standard [Eq. (2)] or self-normalized [Eq. (3)] importance sampling.

Weights computed using Eq. (6) are sometimes called deterministic mixture weights, whereas an alternative is to only evaluate a single proposal distribution in the denominator:

$$\begin{aligned} w_{\text {s-MIS}}^{(j,s)} = \frac{ p (\varvec{\theta }^{(j,s)} )}{ g_j (\varvec{\theta }^{(j,s)})} \, , \, \text {when} \, \, \varvec{\theta }^{(j,s)} \sim g_j (\varvec{\theta }) . \end{aligned}$$

Deterministic mixture weighting requires more evaluations of the proposal densities, but the variance of the resulting estimator is lower (Elvira et al. 2019) There are techniques for improving the efficiency of the balance heuristic (Havran and Sbert 2014; Elvira et al. 2015, 2016; Sbert et al. 2016; Sbert and Havran 2017; Sbert and Elvira 2019). In this work, we use the balance heuristic because of its simplicity and empirically shown good performance.

1.3 Adaptive importance sampling

Adaptive importance sampling is a general term that refers to an iterative process for updating a single or multiple proposal distributions to approximate a given target distribution. The details of the adaptation can vary in multiple ways, but most methods consist of three steps: (i) generating draws from the proposal distribution(s), (ii) computing the importance weights of the draws, and (iii) adapting the proposal distribution(s).

Adaptive importance sampling methods can be categorized in multiple ways, for example based on the type and number of proposal distributions, weighting scheme, and adaptation strategy. Most methods use one or more parametric proposal distributions, such as Gaussian or Student-t distributions. Typical adaptation strategies are resampling or moment estimation based on the importance weights. A good review of many different methods and their classification is presented in Bugallo et al. (2017). For discussion about the convergence of adaptive importance sampling methods, see, for example, Feng et al. (2018) and Akyildiz and Míguez (2019).

Some notable recent algorithms are adaptive multiple importance sampling (AMIS; Cornuet et al. 2012) and adaptive population importance sampling (APIS; Martino et al. 2015), which both use multiple proposal distributions, and weighting based on deterministic mixture weights. For the adaptation, they rely on weighted moment estimation to adapt the mean (and possibly covariance) of the proposal distributions. Population Monte Carlo algorithms are another class of adaptive importance sampling methods, which typically use weighted resampling as the means of adaptation (Cappé et al. 2004, 2008; Elvira et al. 2017).

2 Importance weighted moment matching

In this section, we present our proposed implicit adaptive importance sampling method, importance weighted moment matching (IWMM). We start from the assumption that we have a Monte Carlo sample and we are computing an expectation of some function as in Eq. (1). The sample can be from an arbitrary importance sampling proposal distribution, or it can be from the actual distribution over which the expectation is defined. As with any adaptive importance sampling method, our motivation is that the accuracy of the expectation using the current sample is not good enough. The situation where the proposed framework is most beneficial is when the sample is from a relatively good, complex proposal distribution with no closed form and which is expensive to sample from. Situations like this arise often in Bayesian inference, when a Monte Carlo sample from the full data posterior distribution has been sampled, and model evaluations using cross-validation or bootstrap are of interest (Gelfand et al. 1992; Gelfand 1996; Peruggia 1997; Epifani et al. 2008; Vehtari et al. 2017; Giordano et al. 2019). In this case, implicit adaptation of the proposal distribution can benefit from the existing sample and improve Monte Carlo accuracy with a small computational cost.

There are similar approaches that adapt proposal distributions nonparametrically using, e.g. kernel density estimates (Zhang 1996). With the implicit adaptation, we avoid both the resampling and density estimation steps. Unbiased path sampling by Rischard et al. (2018) can also use arbitrary proposal distributions, but their approach requires a considerable amount of tuning from the user.

2.1 Target of adaptation

Let us recap the three general steps of adaptive importance sampling, which are (i) generating draws from the proposal distribution(s), (ii) computing the importance weights of the draws, and (iii) adapting the proposal distribution(s) based on the weights. In our proposed method, step (i) is omitted because we do not resample during adaptation, and instead use the same sample that is transformed directly. For this reason, the method can be used with any Monte Carlo sample whose probability density function is known.

For step (ii), unlike most adaptive importance sampling methods, we are not primarily interested in perfectly adapting the proposal distribution to the distribution over which the expectation is defined, which is often called the target distribution in the importance sampling literature. While this is a reasonable goal in many cases, in Sects. 3.1 and 3.2 we show examples where sampling from the target distribution itself leads to extremely biased estimates. Instead, we are mainly interested in adapting to the theoretical optimal proposal distribution of a given expectation \(\mathbb {E}_p [h(\varvec{\theta })]\), which depends on three things: the distribution \(p (\varvec{\theta })\) over which the expectation is defined, the function \(h (\varvec{\theta })\) whose expectation is computed, and the Monte Carlo estimator that is used. For the standard importance sampling estimator, the optimal proposal distribution is proportional to (Kahn and Marshall 1953)

$$\begin{aligned} g_{\text {IS}}^{\text {opt}}&(\varvec{\theta }) \propto p (\varvec{\theta }) \, |h(\varvec{\theta }) \, | , \end{aligned}$$
(7)

and for the self-normalized importance sampling estimator, it is (Hesterberg 1988)

$$\begin{aligned} g_{\text {SNIS}}^{\text {opt}}&(\varvec{\theta }) \propto p (\varvec{\theta }) \, |h(\varvec{\theta }) - \mathbb {E}_p [h(\varvec{\theta })]\, | . \end{aligned}$$
(8)

The more complicated form for the self-normalized estimator is due to the requirement of accurate estimation of both the numerator and denominator of Eq. (3) simultaneously.

In order to approach the optimal proposal distribution, we define the importance weights for adaptation as follows: When using the standard importance sampling (or simple Monte Carlo) estimator, we use the absolute values of the expectation-specific weights of Eq. (5) for adaptation, because they quantify the mismatch between the current proposal and the optimal proposal density. For the self-normalized importance sampling estimator, we recommend separate adaptations for the numerator and the denominator, using the absolute values of the expectation-specific weights and the common importance weights, respectively.

The results of the two adaptations are combined with multiple importance sampling to approximate the optimal proposal density of Eq. (8). To combine the two adaptations into an efficient proposal distribution, we use an approximation based on superimposing a simpler distribution on top of the optimal proposal:

$$\begin{aligned} g_{\text {SNIS}}^{\text {split}} (\varvec{\theta }) \propto | h(\varvec{\theta }) | p (\varvec{\theta }) + \mathbb {E}_p [h(\varvec{\theta })] p (\varvec{\theta }) . \end{aligned}$$
(9)

We call this the split proposal density, because it splits the piecewise defined density of Eq. (8) into two clear components. The first component is proportional to Eq. (7) and is thus approximated with the adaptation using the absolute expectation-specific weights. The second component is proportional to \(p (\varvec{\theta })\) and is reached with the adaptation using the common weights.

Equation (9) is a convenient approximation to the optimal proposal of self-normalized importance sampling because it has similar tails while being simpler to sample from because it has two clear components, whereas the density in Eq. (8) can easily be multimodal even when the expectation is defined over a unimodal distribution. The drawback of this approximation is that it places unnecessary probability mass in areas where \(h(\varvec{\theta }) \approx \mathbb {E}_p [h(\varvec{\theta })] \), thus losing some efficiency. However, generally the more distinct \(p (\varvec{\theta })\) is from \(p (\varvec{\theta }) |h (\varvec{\theta })|\), the smaller \(\mathbb {E}_p [h(\varvec{\theta })]\) becomes and hence the approximation becomes closer to the optimal form. Fortunately, these are the cases when adaptive importance sampling techniques are most needed. In Fig. 4 in “Appendix C.1”, we show an example of this phenomenon.

Because Eq. (9) is a sum of two terms, it is essentially a multiple importance sampling proposal distribution with two components. The number of Monte Carlo draws that should be allocated to each component depends on the properties of \(p(\varvec{\theta })\) and \(h(\varvec{\theta })\). A conservative choice is to allocate the same number to both terms. When \(h(\varvec{\theta })\) is nonnegative, this is actually the optimal allocation, because both terms in Eq. (9) integrate to \(\mathbb {E}_p [h(\varvec{\theta })]\). With this allocation, multiple sampling with the balance heuristic is safe in the sense that the asymptotic variance of the estimator is never larger than 2 times the variance of standard importance sampling using the better component (He and Owen 2014). We note that the double adaptation and combining with Eq. (9) is possible and beneficial also with existing parametric adaptive importance sampling methods. We demonstrate this in the experiments section.

2.2 Affine transformations

Step (iii) of adaptive importance sampling is the adaptation of the current proposal distribution(s). Two commonly used adaptation techniques are weighted resampling and moment estimation (Bugallo et al. 2017). In parametric adaptive importance sampling methods, weighted moment estimation is used to update the location and scale parameters of the parametric proposal distribution. We employ a similar idea, but instead directly transform the Monte Carlo sample using an affine transformation. This enables adaptation of proposal distributions which do not have a location-scale parameterisation or even a closed form representation. The only requirement is that the (possibly unnormalized) probability density is computable. This property makes it useful in many practical situations where a Monte Carlo sample has been generated with probabilistic programming tools, or other Markov chain Monte Carlo methods. By using a reversible transformation, we can compute the probability density function of the transformed draws in the adapted proposal with the Jacobian of the transformation.

In this work, we consider simple affine transformations, because both the transformation and its Jacobian are computationally cheap. Consider approximating the expectation \(\mathbb {E}_p [h(\varvec{\theta })]\) with a set of draws \(\{ \varvec{\theta }^{(s)} \}_{s = 1}^S\) from an arbitrary proposal distribution \(g (\varvec{\theta })\) (which can also be \(p (\varvec{\theta })\) itself). For a specific draw \(\varvec{\theta }^{(s)}\), a generic affine transformation includes a square matrix \(\mathbf {A}\) representing a linear map, and translation vector \(\mathbf {b}\):

$$\begin{aligned} T: \varvec{\theta }^{(s)} \mapsto \mathbf {A} \varvec{\theta }^{(s)} + \mathbf {b} =: \breve{\varvec{\theta }}^{(s)}. \end{aligned}$$
(10)

Because the transformation is affine and the same for all draws, the new implicit density \(g_T\) evaluated at every \(\breve{\varvec{\theta }}^{(s)}\) changes by a constant, namely the inverse of the determinant of the Jacobian, \(|\mathbf {J}_T|^{-1} = \left| \frac{\mathrm {d} T (\varvec{\theta })}{\mathrm {d} \varvec{\theta }} \right| ^{-1}\). Note that to compute the inverse, the matrix \(\mathbf {A}\) must be invertible. After the transformation, the implicit probability density of the adapted proposal \(g_T\) for the transformed draw \(\breve{\varvec{\theta }}^{(s)}\) is

$$\begin{aligned} g_T(\breve{\varvec{\theta }}^{(s)}) = g(\varvec{\theta }^{(s)}) |\mathbf {J}_T|^{-1} . \end{aligned}$$

If the original proposal density g was known only up to an unknown normalizing constant, the adapted proposal \(g_T\) has that same unknown constant. This is crucial in order to be able to use the split proposal distribution of Eq. (9) for self-normalized importance sampling.

To reduce the mismatch between a Monte Carlo sample \(\{ \varvec{\theta }^{(s)} \}_{s = 1}^S\) and a given adaptation target, we consider three affine moment matching transformations with varying degrees of simplicity. The transformations have a similar idea as warp transformations used for bridge sampling by Meng and Schilling (2002). The importance weights used in the transformations can be either the common weights or the expectation-specific weights, depending on what the adaptation target is, as discussed in Sect. 2.1. We define the first transformation, \(T_1\), to match the mean of the sample to its importance weighted mean:

$$\begin{aligned} \breve{\varvec{\theta }}^{(s)}= & {} T_1 ( \varvec{\theta }^{(s)} ) = \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} + \overline{\varvec{\theta }}_{w}, \\ \overline{\varvec{\theta }}= & {} \frac{1}{S} \sum _{s=1}^S \varvec{\theta }^{(s)} , \\ \overline{\varvec{\theta }}_{w}= & {} \frac{\sum _{s=1}^S w^{(s)} \varvec{\theta }^{(s)}}{\sum _{s=1}^S w^{(s)}} . \end{aligned}$$

We define \(T_2\) to match the marginal variance in addition to the mean:

$$\begin{aligned} \breve{\varvec{\theta }}^{(s)}= & {} T_2 ( \varvec{\theta }^{(s)} ) = \mathbf {v}_{w}^{1/2} \circ \mathbf {v}^{-1/2} \circ ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) + \overline{\varvec{\theta }}_{w}, \\ \mathbf {v}= & {} \frac{1}{S} \sum _{s=1}^S ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) \circ ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) , \\ \mathbf {v}_{w}= & {} \frac{\sum _{s=1}^S w^{(s)} ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) \circ ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) }{\sum _{s=1}^S w^{(s)}} , \end{aligned}$$

where \(\circ \) refers to a pointwise product of the elements of two vectors. The final transformation, \(T_3\), matches the covariance and the mean:

$$\begin{aligned} \breve{\varvec{\theta }}^{(s)}= & {} T_3 ( \varvec{\theta }^{(s)} ) = \mathbf {L}_{w} \mathbf {L}^{-1} ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) + \overline{\varvec{\theta }}_{w}, \\ \mathbf {L} \mathbf {L}^{\mathsf {T}}= & {} \varvec{\Sigma } = \frac{1}{S} \sum _{s=1}^S ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} ) ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }} )^{\mathsf {T}} , \\ \mathbf {L}_{w} \mathbf {L}_{w}^{\mathsf {T}}= & {} \varvec{\Sigma }_{w} = \frac{\sum _{s=1}^S w^{(s)} ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }}_{w} ) ( \varvec{\theta }^{(s)} - \overline{\varvec{\theta }}_{w} )^{\mathsf {T}} }{\sum _{s=1}^S w^{(s)}} . \end{aligned}$$

If the weights are available with the correct normalization, the weighted moments can be computed using standard importance sampling, but for a more general case, we show the self-normalized estimators of the weighted moments. When relying on self-normalized importance sampling, we recommend two separate adaptations, as discussed in Sect. 2.1. We perform both adaptations separately with the full Monte Carlo sample, but for the multiple importance sampling estimator of Eq. (9) we split the existing sample into two equally sized parts to avoid causing bias from using the same draws twice.

The three affine transformations are defined from simple to complex in terms of the effective sample size required to accurately compute the moments (Kong 1992; Martino et al. 2017; Chatterjee and Diaconis 2018; Elvira et al. 2018). Particularly in the third transformation, the weighted covariance can be impossible to compute if the variance of the weight distribution is large. For this reason, we first iterate only \(T_1\) repeatedly, and move on to \(T_2\) and \(T_3\) only when \(T_1\) is no longer helping. To determine this, we use finite sample diagnostics which will be discussed next.

2.3 Stopping criteria and diagnostics

Even if an (adaptive) importance sampling procedure has good asymptotic properties or the used proposal distribution guarantees finite variance by construction, its pre-asymptotic behaviour can be poor. Because of this, finite sample diagnostics are extremely important for assessing pre-asymptotic behaviour. For example, Vehtari et al. (2019c) demonstrate importance sampling cases with asymptotically finite variance, but pre-asymptotic behavior indistinguishable from cases with unbounded importance weights or infinite variance. Vehtari et al. (2019c) propose a finite sample diagnostic based on fitting a generalized Pareto distribution to the upper tail of the distribution of the importance weights. Because theoretically the shape parameter k of the generalized Pareto distribution determines the number of its finite moments, the fitted distribution and its shape parameter \(\hat{k}\) are useful for estimating practical pre-asymptotic convergence rate. The authors propose \(\hat{k} = 0.7\) as an upper limit of practically useful pre-asymptotic convergence. The stability of (self-normalised) importance sampling can be improved by replacing the largest weights with order statistics of the generalized Pareto distribution estimated already for the diagnostic purposes.

The Pareto \(\hat{k}\) diagnostic can also be used as a stopping criterion for adaptive importance sampling methods in order to not run the adaptation excessively long and waste computational resources. In addition to that, in the importance weighted moment matching method we use the diagnostic for estimating whether a specific transformation improves the proposal distribution or not.

Table 1 Total computational costs of different adaptive importance sampling algorithms after T iterations. S represents the number of draws sampled per iteration from each proposal distribution, except for IWMM which does not resample. N represents the number of proposal distributions.

We use the Pareto diagnostic as follows. First, we compute the Pareto \(\hat{k}\) diagnostic value for the original (common or expectation-specific) weights. After a transformation (\(T_1\), \(T_2\) or \(T_3\)) we recompute the diagnostic, and only accept the transformation if the diagnostic value has decreased. If it has, the transformation is accepted and the weights and diagnostic value are updated. We begin the adaptation by repeating transformation \(T_1\), and only when it is no longer accepted, we move on to attempt transformation \(T_2\), and eventually \(T_3\). As a criterion for stopping the whole algorithm, we use the diagnostic value \(\hat{k} = 0.7\) as recommended by Vehtari et al. (2019c) as a practical upper limit for useful accuracy.

The full importance weighted moment matching algorithm for standard importance sampling or simple Monte Carlo sampling is presented in Algorithm 1. When using self-normalized importance sampling, the algorithm is very similar, with the exception of having two separate adaptations and combining them with multiple importance sampling in the end. It is presented as Algorithm 2 in “Appendix A”.

figure a

2.4 Computational cost

The computational cost of several popular adaptive importance sampling methods are compared in Table 1. We show here the methods which also use moment estimation and are thus most similar to the proposed importance weighted moment matching method. For a more exhaustive comparison, see the review paper by Bugallo et al. (2017). Because of the implicit adaptation in IWMM, the proposal density needs to be computed only once at the beginning of the algorithm. Thus, the computational complexity of IWMM is smaller than even the simplest single-proposal adaptive importance sampling methods. It is thus well suited for problems where proposal evaluations are expensive. We note that IWMM could also replicate the proposal distribution of consecutive transformations in a similar fashion as adaptive multiple importance sampling to increase performance at the cost of increased computational complexity (Cornuet et al. 2012). However, this may cause bias because there is no resampling. We leave this as possible direction for future research.

If the importance weights have large variance, the moment matching transformations can be noisy because of inaccurate computation of the weighted moments. There are two principal ways to remediate this. First, increasing the number of draws generally increases the accuracy of the computed moments. Second, the importance weights used for computing the weighted moments can be regularized with truncation or smoothing methods (Ionides 2008; Koblents and Míguez 2015; Miguez et al. 2018; Vehtari et al. 2019c; Bugallo et al. 2017). In the experiments section, we demonstrate that the accuracy of the moment matching can be improved with Pareto smoothing from Vehtari et al. (2019c).

Another shortcoming of the method is that the adaptation target is not always well characterized by its first and second moments, and the target and proposal distributions can differ in several characteristics, such as tail thickness, correlation structure, or number of modes. For complex targets, more elaborate transformations may be needed to reach a good enough proposal distribution. That being said, it is not necessary to match all characteristics of the proposal distribution to the target for importance sampling to be effective.

3 Experiments

In this section, the proposed implicit adaptation method is illustrated with a variety of numerical experiments using leave-one-out cross-validation (LOO-CV) as an example application. With both simulated and real data sets, we evaluate the predictive performance of different Bayesian models using leave-one-out cross-validation, and demonstrate the improvements that the implicit adaptation methods can provide. We also compare it to existing adaptive importance sampling methods that use parametric proposal distributions.

All of the simulations were done in R (R Core Team 2020), and the models were fitted using rstan, the R interface to the Bayesian inference package Stan (Carpenter et al. 2017; Stan Development Team 2018). To sample from the posterior of each model, we ran four Markov chains using a dynamic Hamiltonian Monte Carlo (HMC) algorithm (Hoffman and Gelman 2014; Betancourt 2017) which is the default in Stan. We monitor convergence of the chains with the split-\(\widehat{R}\) potential scale reduction factor from Vehtari et al. (2019b) and by checking for divergence transitions, which is a diagnostic specific to adaptive HMC. We note that the finite sample behaviour of Monte Carlo integrals depends on the algorithm used to generate the sample. For example, if one uses an MCMC algorithm less efficient than HMC, the resulting Monte Carlo approximations will generally be worse than those illustrated in the next sections. R and Stan codes of the experiments and the used data sets are available on Github (https://github.com/topipa/iter-mm-paper).

Because probabilistic programming tools generally give only unnormalized posterior densities, we mostly focus on self-normalized importance sampling. As the default case, we take the situation that Monte Carlo draws are available from the full data posterior distribution, and these are adapted using our proposed method. We note that leave-one-out cross-validation in this setting is a special case such that the double adaptation which is discussed in Sect. 2.1 is not needed even when using self-normalized importance sampling. The split proposal of Eq. (9) is still used, but the other term uses the full data posterior draws. To help the reader in understanding or implementing the methods, we have presented the basics of Bayesian leave-one-out cross-validation as well as instructions for implementing the proposed methods in “Appendix B”. In addition to importance sampling, we also discuss simple Monte Carlo sampling results when sampling from each leave-one-out posterior explicitly.

By default, we use Pareto smoothing to stabilize importance weights, but we also present results without smoothing (Vehtari et al. 2017, 2019c). This enables us to also monitor the reliability of the Monte Carlo estimates using the Pareto \(\hat{k}\) diagnostics. We show that the diagnostics accurately identify convergence problems in not only importance sampling, but also when using the simple Monte Carlo estimator or adaptive importance sampling algorithms. Based on Vehtari et al. (2019c), we use \(\hat{k} = 0.7\) as an upper threshold to indicate practically useful finite sample convergence rate.

We compare our proposed method to several existing adaptive importance sampling methods. For comparison, we chose algorithms that are conceptually similar to our proposed implicit adaptation method. As the first comparison, we have generic adaptive importance sampling methods, which use a single proposal distribution and adapt the location and scale parameters of this distribution using weighted moment estimation. As the proposal distribution we have either a multivariate Gaussian distribution, or a Student-\(t_3\) distribution. Moreover, we test these algorithms in the traditional way of adapting using the common importance weights, and also using our proposed double adaptation, resulting in 4 different algorithms. To compare to a more powerful and computationally expensive algorithm, we chose adaptive multiple importance sampling (AMIS; Cornuet et al. 2012), which uses multiple proposal distributions and deterministic mixture weighting, increasing the number of proposal distributions over time. Also for this algorithm, we test 4 versions by having either Gaussian or Student-\(t_3\) distributions as well as with and without double adaptation. We start all the parametric adaptive methods with mean and covariance estimated from a sample from the full data posterior. Also for the parametric adaptive importance sampling methods, we use the Pareto \(\hat{k}\) diagnostic to determine when to stop the algorithm. For all eight algorithms, we adapt both the mean and covariance if it is feasible, but for very high-dimensional distributions we only adapt the mean, because otherwise the adaptation is unstable given the used sample sizes.

Sections 3.1 and 3.2 show low-dimensional examples where the function h whose expectation is being computed gets large values in the tails of the distribution over which the expectation is being computed. These cases highlight the importance of our proposed double adaptation, as the target densities are available in unnormalized form. Sections 3.3 and 3.4 show correlated and high-dimensional examples which are significantly more difficult. In Sect. 3.4, the distribution over which the expectation is defined is also multimodal. In these cases, we demonstrate the usefulness of using a complex nonparametric proposal distribution instead of Gaussian or Student-t densities.

3.1 Experiment 1: gaussian data with a single outlier

In this section, we demonstrate with a simple example what happens when we try to assess the predictive performance of a misspecified model. We emphasize that even though this is a simple example, it still provides valuable insight for real world data and models as evaluating misspecified models is an integral part of any Bayesian modelling process. In terms of Monte Carlo sampling, this is an example of an expectation (1) where the largest values of the function h are in the tails of the target distribution p.

We generate 29 observations from a standard normal distribution, and manually set the value for a 30’th observation to introduce an outlier. This mimics a situation where the true data generating mechanism has thicker tails than the assumed observation model. Keeping the randomly generated observations fixed, we repeat the experiment for different values of the outlier ranging from \(y_{30} = 0\) to \(y_{30} = 20\). We model the data with a Gaussian distribution with unknown mean and variance, generate draws from the model posterior, and evaluate the predictive ability of the model using leave-one-out cross-validation.

For all 30 observations, represented jointly by the vector \(\mathbf {y}\), the model is thus

$$\begin{aligned} \mathbf {y} \sim \text {Normal} (\mu ,\sigma ^2) \end{aligned}$$

with mean \(\mu \) and standard deviation \(\sigma \). We set improper uniform priors on \(\mu \) and \(\log (\sigma )\). In this model, the posterior predictive distribution \(p (\widetilde{y} \mid \mathbf {y})\) is known analytically, and is a Student t-distribution with \(n-1\) degrees of freedom, mean at the mean of the data, and scale \(\sqrt{1 + 1/n}\) times the standard deviation of the data, where n is the number of observations. Thus, we can compute the Bayesian LOO-CV estimate for the single left out point analytically via

$$\begin{aligned} \text {elpd}_{\text {loo},i} = \log p (\widetilde{y} = y_i \mid \mathbf {y}_{-i}). \end{aligned}$$

The left plot of Fig. 2 shows the computed \(\widehat{\text {elpd}}_{\text {loo},30}\) estimates for the 30’th observation based on different sampling methods, which are compared to the analytical \(\text {elpd}_{\text {loo},30}\) values when the outlier value is varied. When the outlier becomes more and more different from the rest of the observations and the analytical \(\text {elpd}_{\text {loo},30}\) decreases, both the simple Monte Carlo estimate from the true leave-one-out posterior and the PSIS estimate from the full data posterior become more and more biased due to insufficient accuracy in the tails of the posterior predictive distribution. The same happens to adaptive importance sampling using a single Gaussian proposal (AIS-G), and to a smaller extent when using a Student-\(t_3\) proposal (AIS-t). Our proposed importance weighted moment matching from either the full posterior (PSIS+MM) or the leave-one-out posterior (naive+MM) almost perfectly align with the analytical solution. Also the AIS-G and AIS-t give very accurate results when using our proposed double adaptation. Similarly, the results of all 4 AMIS algorithms align well with the analytical solution and are omitted in Fig. 2 for improved readability. While not shown in the plot, also PSIS+MM gives highly biased results if omitting the split proposal of Eq. (9). In “Appendix C”, we show the results of a similar experiment, where the randomly generated points \(y_{1}\) to \(y_{29}\) are re-generated at every repetition to show that the results are not just specific to this particular data realization.

Fig. 2
figure 2

Computed \(\widehat{\text {elpd}}_{\text {loo},30}\) estimates of the left out observation \(y_{30}\) for the normal model for different values between \(y_{30} = 0\) and \(y_{30} = 20\). The black crosses depict the analytical results. The sampling results are averaged from 100 independent Stan runs, and the error bars represent \(95 \%\) intervals of the mean across these runs. The dashed line at \(\hat{k} = 0.7\) presents the diagnostic threshold indicating practically useful finite sample convergence rate.

The right plot of Fig. 2 shows the Pareto \(\hat{k}\) diagnostic values corresponding to the different algorithms. The diagnostic values are computed from both common and expectation-specific weights, and the larger is reported. The plot shows that both moment matching algorithms have \(\hat{k} < 0.7\) which indicates good finite sample accuracy. For all of the other algorithms, the diagnostic value grows over 0.7 when the problem becomes more difficult, which correlates well with the biased results in the left plot. From the AIS algorithms, the Student-\(t_3\) proposal distribution has much smaller bias compared to the Gaussian proposal due to its much thicker tails. Still, the Pareto \(\hat{k}\) diagnostic indicates poor finite sample convergence. When looking at the importance weights of the individual runs, it is indeed clear that the result is based on only a few Monte Carlo draws from the thick tails of the Student-\(t_3\) distribution. Because of that, the variance between different runs is large. In the most difficult case when \(y_{30} = 20\) and \(\text {elpd}_{\text {loo},30} = -44.3\), the variance of the estimated \(\text {elpd}_{\text {loo},30}\) for \(\text {AIS-}t\) is more than 1000 times higher than for PSIS+MM.

Figure 2 highlights the importance of our proposed double adaptation when some densities are available in unnormalized form. All of the proposal distributions that we compared fail without the double adaptation and split proposal of Eq. (9). The Student-t proposal does quite well, but it has high variance because of relying only on a few draws from the tails. In more high-dimensional situations, it will also fail quicker, as we show later. AMIS gives good results even without the double adaptation because it was started from an initial distribution based on the mean and covariance of the full posterior, and it retains all earlier proposal distributions. Because the initial distribution is already close to the target of the second adaptation, the second adaptation is not needed for the AMIS algorithms in this low-dimensional example.

3.2 Experiment 2: poisson regression with outliers

In the second experiment, we illustrate with a real data set how poor finite sample convergence can cause significant errors when estimating predictive performance of models. The data are from Gelman and Hill (2006), where the authors describe an experiment that was performed to assess how efficiently a pest management system reduces the amount of roaches. The target variable y describes the number of roaches caught in a set of traps in each apartment. The model includes an intercept plus three regression predictors: the number of roaches before treatment, an indicator variable for the treatment or control group, and an indicator variable for whether the building is restricted to elderly residents. We will fit a Poisson regression model with a log-link to the data set. The traps were held in the apartments for different periods of time, so the measurement time is included by adding its logarithm as an offset to the linear predictor. The model has only 4 parameters, so this is again a quite simple example.

On the left side of Fig. 3 we show the computed \(\widehat{\mathrm {elpd}}_{\mathrm {loo}}\) estimates averaged from 100 independent Stan runs as a function of the number of posterior draws S. On the right side, the mean of the largest Pareto \(\hat{k}\) diagnostic values out of all of the observations are presented. The diagnostic is always computed from both the common and expectation-specific weights, and the larger is reported. There is a large difference between the PSIS and naive estimates, and they approach each other very slowly when increasing S, which is due to the poor convergence rate, as indicated by the high Pareto \(\hat{k}\) values on the right side plot. Importance weighted moment matching from either the full posterior or leave-one-out posteriors gives reliable estimates with very small error already from 1000 draws. The accuracy is confirmed by the changed Pareto \(\hat{k}\) values which are always below 0.7. For the single-proposal parametric methods, using the Student-\(t_3\) proposal distributions and doing our proposed double adaptation (AIS-t \(\times 2\)) gives good results from 2000 draws onwards, but the rest of the methods give highly biased results even with \(S = 64000\). Contrary to the previous example, now even the double adaptation converges extremely slowly when using a Gaussian proposal distribution (AIS-G \(\times 2\)), which indicates that the posterior distribution is non-Gaussian. All 4 versions of the AMIS algorithm had Pareto \(\hat{k}\) values below 0.7 already with 1000 draws, and had elpd estimates almost indistinguishable from the importance weighted moment matching results. These are omitted from Fig. 3 for improved readability.

Fig. 3
figure 3

Left: Computed \(\widehat{\mathrm {elpd}}_{\mathrm {loo}}\) estimates over the whole Roach data set as a function of the number of posterior draws S. Right: The mean of the largest Pareto \(\hat{k}\) diagnostic values among the observations. The results are averaged from 100 independent Stan runs, and the error bars represent \(95 \%\) intervals of the mean across these runs. The dashed line at \(\hat{k} = 0.7\) presents the diagnostic threshold indicating practically useful finite sample convergence rate. The largest Pareto \(\hat{k}\) for AIS-t is around 10, and it is left out of the right plot for clarity.

3.3 Experiment 3: linear regression with correlated predictor variables

In the previous examples, the used models were quite simple and had a small number of parameters. In this and the following sections, we study the limitations of the importance weighted moment matching method by considering models with more parameters and correlated or non-Gaussian posteriors. The two previous experiments showed that the Pareto \(\hat{k}\) diagnostic is a reliable indicator of finite sample accuracy for adaptive importance sampling methods. To demonstrate the performance and computational cost of the different adaptive algorithms, we report the number of leave-one-out (LOO) folds where the algorithms fail to decrease the \(\hat{k}\) diagnostic value below 0.7. In order to get reliable results for these failed LOO folds, the user should generate new MCMC draws from the LOO posterior, which can be very costly. We fit all models to the full data set, and report the number of leave-one-out folds where the \(\hat{k}\) diagnostic value is above 0.7 when using the full data posterior directly as a proposal distribution. These are reported in the column PSIS in Table 2. We run the moment matching algorithm for all these LOO folds, and report how many \(\hat{k}\) values are still above 0.7 (PSIS+MM). Similarly, we run the 8 parametric adaptive methods for the same LOO folds. In Table 2, we only show the best performing parametric methods, which are AMIS with double adaptation using either Gaussian or Student-\(t_3\) proposals (AMIS \(\times 2\) and AMIS-t \(\times 2\)). In the lower part of Table 2, we report run times in seconds for all the reported algorithms. The run times are based on single core runs with an Intel Xeon X5650 2.67 GHz processor.

For this experiment, we simulated data from a linear regression model. The data consists of \(n = 60\) observations of one outcome variable and 30 predictors that are correlated with each other by correlation coefficient of \(\rho = 0.8\). Three of the true regression coefficients are nonzero, and the rest are all zero. Independent Gaussian noise was added to the outcomes \(\mathbf {y}\). Because the predictors are strongly correlated, importance sampling leave-one-out cross-validation is difficult and we get multiple high Pareto \(\hat{k}\) values when using the full data posterior as the proposal distribution. The results of Table 2 show that already with 2000 posterior draws, the moment matching algorithm is able to decrease the Pareto \(\hat{k}\) values of all LOO folds below 0.7. In contrast, none of the parametric algorithms ever succeed in reducing \(\hat{k}\) values below 0.7, even when increasing the number of draws to 8000. This highlights the difficulty of adapting to a highly correlated distribution. Because the moment matching starts from the full data posterior sample, which is similarly correlated, the moment matching can successfully improve the proposal distribution with a small cost. The AMIS algorithms were run for 10 iterations to limit the computational cost. By increasing the number of iterations, they should succeed eventually, but at a high computational cost. In Table 3 in “Appendix C”, we show results for importance weighted moment matching without Pareto smoothing the importance weights. The results are slightly worse compared to the Pareto smoothing case.

3.4 Experiment 4: binary classification in a small n large p data set

In the fourth experiment, we have a real microarray Ovarian cancer classification data set with a large number of predictors and small number of observations. The data set has been used as a benchmark by several authors (e.g. Schummer et al. 1999; Hernández-Lobato et al. 2010, and references). The data consists of 54 measurements and has 1536 predictor variables. We will fit a logistic regression model using a regularized horseshoe prior (Piironen and Vehtari 2017b) on the regression coefficients because we expect many of them to be zero. This data set and model are difficult for several reasons. First, because the amount of observations is quite low, leaving out single observations changes the posterior significantly, indicated by a large number of high Pareto \(\hat{k}\) values. Second, because the number of parameters in the model is 3075, moment matching in the high-dimensional space is difficult. Third, the posterior distribution of several parameters is multimodal, as illustrated in Fig. 1. Because of the multimodality, we used Monte Carlo chains of length 1000, and increased the number of chains when increasing S.

When fitting the model to the full data posterior, Table 2 shows the number of LOO folds with \(\hat{k} > 0.7\) before and after moment matching. The results show that already with 1000 draws, PSIS+MM is able to reduce \(\hat{k}\) of many LOO folds below 0.7. Investing more computational resources by collecting more posterior draws increases the moment matching accuracy, and more LOO folds can be improved. However, even with 8000 posterior draws some folds have \(\hat{k} > 0.7\) after moment matching, and thus the \(\widehat{\text {elpd}}_{\text {loo}}\) estimate may not be reliable. Again, none of the parametric adaptive methods succeed in reducing Pareto \(\hat{k}\) values below 0.7 in 10 iterations. The lower part of Table 2 also shows the significantly higher computational time of the AMIS algorithms compared to importance weighted moment matching.

Table 2 Upper part: Numbers of LOO folds with Pareto \(\hat{k}\) diagnostic above 0.7 when the models are fitted to the full data set (lower is better). Lower part: Average run times in seconds for different algorithms. Column PSIS corresponds to using the full data posterior directly as the proposal distribution. Column PSIS+MM corresponds to importance weighted moment matching. Column AMIS \(\times 2\) corresponds to adaptive multiple importance sampling with our proposed double adaptation using Gaussian proposal distributions. Column AMIS-t \(\times 2\) is the same, but using Student-\(t_3\) proposals.

The used data set and model are complex enough that using naive LOO-CV by fitting to each LOO fold separately takes a nontrivial amount of time. Omitting parallelization, the model fit using Stan took an average of 27 minutes when generating 4000 posterior draws. Naive LOO-CV would be costly as fitting the model 54 times would take around 24 hours. With the same hardware, standard PSIS took less than a second, but refitting the 34.8 (on average) problematic LOO folds would take more than 15 hours. For the problematic LOO folds, the total run time of PSIS+MM was only 26 minutes on average. This is less time than a single model fit while decreasing the number of required refits from 34.8 to 16.2 on average, which shows that the importance weighted moment matching is computationally efficient.

4 Conclusion

We proposed a method for improving the accuracy of Monte Carlo approximations to integrals via importance sampling and importance weighted moment matching. By matching the moments of an existing Monte Carlo sample to its importance weighted moments, the proposal distribution is implicitly modified and improved. The method is easy to use and automate for different applications because it has no parameters that require tuning. We proposed separate adaptation schemes and estimators for different importance sampling estimators. In particular, we proposed a novel double adaptation scheme that is beneficial for many existing adaptive importance sampling methods when relying on the self-normalized importance sampling estimator.

We also showed that the Pareto diagnostic method from Vehtari et al. (2019c) is able to notice poor finite sample convergence for different Monte Carlo estimators and adaptive algorithms when taking into account both the common and expectation-specific importance weights. We also showed that it is useful as a stopping criterion in adaptive importance sampling methods, reducing computational cost by not running the algorithm excessively long.

We evaluated the efficacy of the proposed methods in self-normalized importance sampling leave-one-out cross-validation (LOO-CV), and demonstrated that they can often increase the accuracy of model assessment and even surpass naive LOO-CV that requires expensive refitting of the model. Moreover, in complex or high-dimensional cases we demonstrated that our proposed method has much better performance compared to existing adaptive importance sampling methods that use Gaussian or Student-\(t_3\) proposal distributions. Additionally, our method has a small computational cost as it does not require recomputing proposal densities during iterations. We also showed that our proposed double adaptation scheme for self-normalized importance sampling is crucial for cases where the function whose expectation is being computed has large values in the tails of the distribution over which the expectation is computed. We showed that the double adaptation can also significantly improve the performance of existing parametric adaptive importance sampling methods.

The performance of the proposed implicit adaptation method depends highly on the goodness of the initial proposal distribution. Bayesian leave-one-out cross-validation or bootstrap are examples where the full data posterior distribution is already a good proposal, and moment matching can improve performance with a small computational cost. In the most complex cases, the simple affine transformations proposed in this work are not enough to produce a good proposal distribution, and more complex methods may be required. Such methods are left for future research.