1 Introduction

Bayesian inference in data assimilation (DA) has been researched for several decades and is frequently applied in the petroleum industry today. Due to the typically vast computational cost of the forward simulation problem, classic Bayesian Monte Carlo methods, such as Markov chain Monte Carlo (MCMC) and importance sampling, are not applicable. Therefore, approximate Bayesian methods, which require fewer computational resources, are applied in most real-world cases. It is convenient to separate these methods into derivative-based and derivative-free approaches. Derivative-based approximations include Randomized Maximum Likelihood (RML, Kitanidis 1995; Oliver et al. 1996), distributed Gauss–Newton solvers (Gao et al. 2017), and EDA-4DVar (Carrassi et al. 2018).

However, it is more common to use derivative-free methods due to the common lack of adjoint code in both commercial simulators and large scale models in general. Many of the modern derivative-free methods are based on the ensemble Kalman filter (EnKF, Evensen 2004). During the last decade the method of choice has become the iterative ensemble methods such as iterative EnKF/EnKS (Bocquet and Sakov 2014), the ensemble RML (Chen and Oliver 2013) and the ensemble smoother with multiple data assimilations (ESMDA, Emerick and Reynolds 2013). Unfortunately, in the general (non-linear) case, none of these ensemble methods converge to the true posterior distribution in the limit of an infinite sample size. It is possible to nest some of these methods in the importance sampling framework to achieve asymptotic optimality (Stordal and Elsheikh 2015; Stordal 2015; Stordal and Karlsen 2017), but at a considerable cost. Other approximate approaches to Bayesian sampling that go beyond the traditional approach of MCMC include sampling via optimal transport (El Moshely and Marzouk 2012; Reich 2013; Marzouk et al. 2017) and implicit sampling (Skauvold et al. 2019)

In this work, the recently published Stein variational gradient descent method (SVGD, Liu and Wang 2016) is applied in history matching for the first time. The kernels introduced in the algorithm are more appropriate for high dimensional applications. In addition, an alternative derivative-free implementation is discussed. Previous applications of SVGD includes Bayesian logistic regression (Liu and Wang 2016), training of neural nets (Feng et al. 2017), sequential filtering of the Lorenz-63 and -96 models (Pulido and van Leeuwen 2019; Pulido et al. 2019), inference on a simple linear and nonlinear partial differential equation (PDE) using a subspace Hessian projection (Chen et al. 2019) and a Gauss–Newton formulation (Detommaso et al. 2018).

The outline of this paper is as follows: Sect. 2 introduces the SVGD theory. In Sect. 3, the implementation is presented with and without an adjoint code, and the choice of kernels for high dimensional applications. Examples with a toy model, with the Lorenz systems, and with a reservoir model, are given in Sect. 4. A summary and discussion of future work concludes the paper in Sect. 5.

2 Stein Variational Gradient Descent

Let \(X\in {\mathscr {X}} \subseteq {\mathbb {R}}^{N_x}\) be the unknown vector of interest. In Bayesian statistics, it is assumed random. Denote its (prior) probability density function (PDF), p(x). It is observed through the noise-corrupted measurements

$$\begin{aligned} Y={\mathcal {H}}(X)+\epsilon , \end{aligned}$$
(1)

where \(\epsilon \in {\mathbb {R}}^{N_y}\) is a random noise vector, and \({\mathcal {H}}\) is a (possibly nonlinear) mapping \({\mathcal {H}}:{\mathscr {X}}\rightarrow {\mathscr {Y}}\subseteq {\mathbb {R}}^{N_y}\). The likelihood, \(p(y\vert x)\), follows from the distribution of \(\epsilon \), whose framing as an additive error is mere convention. The posterior is given by Bayes’ rule: \(p(x\vert y) \propto p(y\vert x) \, p(x)\). For complex problems, such as large scale inversion and DA, Bayesian inference often takes the form of (approximate) sampling from the posterior.

Given two densities p and q, the Stein discrepancy can be used to quantify their difference. It is defined as

$$\begin{aligned} {\mathbb {S}}(q,p)=\underset{f\in {\mathscr {F}}}{\max }\left( {\mathbb {E}}_q[(\partial _x \log p(X)) \, f(X)+\partial _x f(X)]\right) , \end{aligned}$$
(2)

where \({\mathscr {F}}\) is a set of test functions, \(f:{\mathscr {X}}\rightarrow {\mathbb {R}}\), that are sufficiently smooth and satisfy

$$\begin{aligned} {\mathbb {E}}_p[\bigl (\partial _x \log p(X)\bigr ) \, f(X)+\partial _x f(X)]=0. \end{aligned}$$
(3)

That is, f is in the Stein class of p (Liu et al. 2016). The variational problem Eq. 2 is computationally intractable in most cases, and therefore Chwialkowski et al. (2016) and Liu et al. (2016) introduced the kernelized Stein discrepancy, where the function space \({\mathscr {F}}\) is set to the unit ball within a reproducible kernel Hilbert space (RKHS), \({\mathscr {H}}\), for which analytical solutions may be obtained. The Kullback–Leibler (KL) divergence from p to q is given by

$$\begin{aligned} D_{\text {KL}}(q \Vert p)=\int q(x)\log \left( \frac{q(x)}{p(x)}\right) \,dx. \end{aligned}$$
(4)

It was related to kernelized Stein discrepancy by Liu and Wang (2016), who showed that if X has density q, then the KL divergence to p from the PDF \(q_{{\mathcal {T}}}\) of the transformation

$$\begin{aligned} {\mathcal {T}}(x)=x+\epsilon \phi (x), \end{aligned}$$
(5)

has a derivative, \(\frac{d}{d \epsilon } D_{\text {KL}}(q_{{\mathcal {T}}}\Vert p)\), that is maximized at \(\epsilon =0\) for

$$\begin{aligned} \phi (\cdot )={\mathbb {E}}_{q}[K(X,\cdot ) \, \partial _x \log p(X)+\partial _x K(X,\cdot )], \end{aligned}$$
(6)

where \(K(x,x')\) is the unique kernel defining \({\mathscr {H}}\).

In the context of Bayesian inference, Bayes’ rule may be computed by a KL minimizing algorithm (Liu and Wang 2016): the SVGD starts with a sample \(\{x_0^i\}_{i=1}^N\) from the prior density p(x) and iteratively updates \(x_k^i\) using the empirical (Monte Carlo) version of Eq. 6 in Eq. 5

$$\begin{aligned} x_{k+1}^i=x_k^i+\epsilon _k N^{-1}\sum _{j=1}^N\left( K(x_k^j,x_k^i)\partial _x \log p(x_k^j\vert y)+\partial _x K(x_k^j,x_k^i)\right) , \end{aligned}$$
(7)

for all particles, or ensemble members, \(i=1,\ldots ,N\). If the kernels are chosen to be Gaussian, i.e. \(K(x,x')=\exp (-\frac{1}{2}\left\Vert x-x'\right\Vert _2^2)\), then Eq. 7 reduces to

$$\begin{aligned} x_{k+1}^i=x_k^i+\epsilon _kN^{-1}\sum _{j=1}^N K(x_k^j,x_k^i)\left( \partial _x \log p(x_k^j\vert y)+(x_k^i-x_k^j)\right) . \end{aligned}$$
(8)

The last term may be seen as a weighted average of the gradient of the log posterior and of its similarity to the other members. The first term guides the sample points towards the maximum of the log posterior, while the second term repulses sample points that are too close. Equation 8 presupposes the use of simple gradient descent optimization In practice, Liu and Wang (2016); Pulido and van Leeuwen (2019) both used an adaptive subgradient variation of this (ADAGRAD, Duchi et al. 2011), which is also the choice made here.

It is interesting to note that if the kernel is degenerate, then the SVGD algorithm produces N copies of the MAP estimate (or N local optima in the general non-convex case). It was shown in Liu (2017) that the continuous-time infinite-sample limit of the density induced by SVGD satisfies the Vlasov equation (Vlasov 1961)

$$\begin{aligned} \partial _t \mu _t=-\nabla \cdot (\phi \mu _t)(\mu _t). \end{aligned}$$
(9)

Using Stein’s lemma (Stein 1972) it may be shown that \(\partial \mu _t=0\) for \(\mu _t=p(x\vert y)\), meaning that the SVGD algorithm can be viewed as particle flow. Furthermore, Zhang et al. (2018) showed that SVGD can be combined with Langevin dynamics to obtain a stochastic particle flow version of SVGD that may avoid the issue of particles collapsing to a single point or getting stuck in a local mode. This will not be discussed further here as it did not have any impact on the problems presented here. For more details the reader is referred to Zhang et al. (2018) and references therein.

3 Implementation

In the following it is assumed that the prior and the measurement error are both Gaussian. Hence

$$\begin{aligned} \partial _x \log p(x\vert y)={\mathbf {C}}_x^{-1}(\mu -x)+{\mathbf {H}}^{\top }{\mathbf {C}}_{\epsilon }^{-1}(y-{\mathcal {H}}(x)), \end{aligned}$$
(10)

where \({\mathbf {H}}\) is the sensitivity matrix (the derivative of \({\mathcal {H}}\) with respect to x) and \({\mathbf {C}}_x\) and \({\mathbf {C}}_{\epsilon }\) are the covariance matrices of X and \(\epsilon \), respectively. Computing \({\mathbf {H}}\) is challenging since it requires an adjoint code. It is therefore of great interest to study alternative implementations of SVGD using approximate gradients. It should also be mentioned that Han and Liu (2018) presented a weighted SVGD wherein the gradient is computed using a surrogate density instead of the target density. In the following description, however, the focus is to use either an adjoint code, or an ensemble approximation of \({\mathcal {H}}\).

3.1 Adjoint Implementation

In large scale PDE-constrained optimization for the solution of inverse problems (e.g. in reservoir history matching) it is convenient to formulate \({\mathcal {H}}\) as depending on two separate variables, \({\mathcal {H}} = {\mathcal {H}}(\xi , x)\), where the dynamic ones, \(\xi \), depend deterministically on the fixed (but unknown) parameter fields x through \(g(\xi ,x) = 0\), where g is the set of (discretized) forward model equations. Implicit differentiation yields \( \partial _x \xi = - (\partial _\xi g)^{-1} \, \partial _x g \), so that the (total) derivative of \({\mathcal {H}}\) with respect to x, required in Eq. 10, is given by

$$\begin{aligned} {\mathbf {H}}&= \partial _\xi {\mathcal {H}} \; \partial _x \xi + \partial _x {\mathcal {H}} \end{aligned}$$
(11)
$$\begin{aligned}&= - \partial _\xi {\mathcal {H}} \; (\partial _\xi g)^{-1} \; \partial _x g + \partial _x {\mathcal {H}}. \end{aligned}$$
(12)

Using the shorthand

$$\begin{aligned} w = {\mathbf {C}}_{\epsilon }^{-1}(y-{\mathcal {H}}), \end{aligned}$$
(13)

the last term in Eq. 10 becomes

$$\begin{aligned} {\mathbf {H}}^{\top } w&= \big ( - \partial _\xi {\mathcal {H}} \; (\partial _\xi g)^{-1} \partial _x g + \partial _x {\mathcal {H}} \big )^\top w \end{aligned}$$
(14)
$$\begin{aligned}&= - (\partial _x g)^\top \Big \{(\partial _\xi g)^{-\top } \bigl [ (\partial _\xi {\mathcal {H}})^{\top } w\bigr ]\Big \} + (\partial _x {\mathcal {H}})^\top w. \end{aligned}$$
(15)

The use of brackets in Eq. 15 seems heavy-handed. However, in contrast to the explicit computation of \( {\mathbf {H}} \) of Eq. 12 (known in the literature as the forward, or direct, method (Rodrigues 2006; Anterion et al. 1989), or gradient simulator (Oliver et al. 2008)), the ordering of the brackets in Eq. 15 yields a sequence of computations, now involving the transposed system, with a cost proportional to a single simulation. This sequence of operations is known in the literature as adjoint, or backward, method (Rodrigues 2006; Chavent et al. 1975).

A major hurdle in applying the adjoint method (as well as in the Forward/Direct method) is writing the code for the computation of the partial derivatives of g and \({\mathcal {H}}\) with regard to both \(\xi \) and (mainly) x. In this study, automatic differentiation (Bendtsen and Stauning 1996) is applied.

3.2 Adjoint-Free Implementation

Zhou (2008) showed that, by the reproducing property of \(f\in {\mathscr {H}}\),

$$\begin{aligned} \partial _x f(x) = \langle f, \partial _{x'} K(x,\cdot )\rangle _{{\mathscr {H}}}. \end{aligned}$$
(16)

However, the inner product on \({\mathscr {H}}\) is given by

$$\begin{aligned} \langle f,g\rangle _{{\mathscr {H}}}=\sum _{i=1}^{\infty }\frac{\langle f,\phi _i\rangle _{L_2}\langle g,\phi _i\rangle _{L_2}}{\lambda _i}, \end{aligned}$$
(17)

where \(\phi \) and \(\lambda \) are the eigenfunctions and eigenvalues of K. This is not trivial to compute due to the high-dimensional integrals and the computation of eigenfunctions.

An alternative formulation was used by Pulido et al. (2019) where the gradient of \({\mathcal {H}}\) was approximated by Monte Carlo integration

$$\begin{aligned} \tilde{{\mathbf {H}}}(x)=N^{-1}\sum _{i=1}^N{\mathcal {H}}(x^i) K(x,x^i), \end{aligned}$$
(18)

using the sample available from the SVGD algorithm at each iteration. A further normalization was later introduced to improve the approximation by dividing each value of \(\tilde{{\mathbf {H}}}\) by the sum of the kernels at that point. The approximation was based on the claim that Eq. 18 would converge to

$$\begin{aligned} \widehat{{\mathbf {H}}}(x) =\int {\mathcal {H}}(x') \, \partial _{x'} K(x,x') \,d x', \end{aligned}$$
(19)

for \({\mathcal {H}}\in {\mathscr {H}}\) as N goes to infinity. However, this uses the plain \(L_2\) inner product,

$$\begin{aligned} \langle {\mathcal {H}},K\rangle =\int {\mathcal {H}}(x') \, K(x,x')\, d x', \end{aligned}$$
(20)

which is not a RKHS, so Eq. 19 is incorrect. However, if it assumed that \({\mathcal {H}}(x)K(x,x')=0\) in the limits or at the boundary (in case of finite support), then integration by parts yields

$$\begin{aligned} \int {\mathcal {H}}(x') \, \partial _{x'} \, K(x,x')\, d x' = -\int {\mathbf {H}}(x') \, K(x,x')\, d x' =-{\mathbb {E}}_{K(x,\cdot )}[{\mathbf {H}}]. \end{aligned}$$
(21)

Hence, Eq. 19 can be seen as a kernel-smoothed derivative of \({\mathcal {H}}\) around the point x, provided the kernel is normalized (integrates to one).

Furthermore, their unsatisfactory results are a consequence, not only of the approximation in Eq. 19, but also of the fact that the sample is drawn from a particular distribution and not uniformly in space. Hence the notion that Eq. 18 will converge to \(\int {\mathcal {H}}K\) is also incorrect. This can be corrected by noting that the samples are drawn from a distribution q, hence, Eq. 18 converges to \({\mathbb {E}}_q[{\mathcal {H}}K]\) unless the integrand is normalized by q. An alternative approach is therefore to draw the sample from the kernel density, \({\widehat{q}}(x) = N^{-1}\sum _i K(x,x^i)\) at each iteration, yielding the Monte Carlo estimate of Eq. 19 given by

$$\begin{aligned} {\mathbf {H}}(x) \approx -\sum _i {\mathcal {H}}(x^i)\frac{\partial _{x'} K(x,x^i)}{\sum _j K(x^i,x^j)}. \end{aligned}$$
(22)

Alternatively, Pulido et al. (2019) suggest using an average ensemble approximation for the derivative, as in the EnKF. This does not allow for local information, and the algorithm falls into the ensemble-based category where quasi-linearity of the model is assumed.

Finally, since \(\tilde{{\mathbf {H}}}\) is estimated using a sample, it will suffer from Monte Carlo errors. This can, just like in the EnKF and its variants, be reduced by introducing localization. A user defined correlation matrix \(\rho \) can be used to taper \({\mathcal {H}}(x)\) in Eq. 22 via the Schur product \(\rho \circ {\mathcal {H}}(x)\).

3.3 Extension to p-Kernels

In the literature discussed above, almost all applications selected \(K(x,x')\) to be a Gaussian kernel. The choice of the kernel may not be critical in low dimensional problems. For high dimensional problems, however, the Gaussian kernel is not able to separate points locally, due to the curse of dimensionality. It will produce degeneracy, as illustrated in importance sampling. To overcome this issue, scaled kernels were used both in Liu and Wang (2016) and Detommaso et al. (2018). In many cases, a simple scaling of the kernel bandwidth is not sufficient to separate points in high dimensions (increasing the bandwidth in high dimensions results in almost uniform weights) and an alternative approach of using p-kernels (Francois et al. 2005) is therefore chosen here. The kernel is specified by

$$\begin{aligned} K(x,x')=\exp \bigl (-\bigl (d(x,x')/\sigma \bigr )^{p}\bigr ). \end{aligned}$$
(23)

The special case \(p=2\) and \(d(x,x')=\left\Vert x-x'\right\Vert _2\) reduces the p-kernel to a Gaussian kernel. As the dimension increases, the distances (\(i\ne j\)), which are \(\chi ^{2}\) if X is Gaussian, tend to cluster away from zero, hence the term with \(i=j\) in Eq. 8 increasingly dominates unless the bandwidth of the kernel is chosen to be very large. In this case the kernel will not separate points, and the local property of the kernel is lost. To overcome this, p-kernels force the kernel value to stay close to 1 until it reaches the lower tail of the distance distribution of the sample, and then decay appropriately. Specifically, the parameters p and \(\sigma \) are here set to match the \(\alpha \)-percentile, \(d_{\alpha }\), and the \((1{-}\alpha )\)-percentile, \(d_{1-\alpha }\), of the empirical distribution of distances, meaning

$$\begin{aligned} K(0,d_{\alpha })=1-\alpha , \quad K(0,d_{1-\alpha })=\alpha , \end{aligned}$$
(24)

which has the following solution

$$\begin{aligned} p=\frac{\log \left( \frac{\log {\alpha }}{\log {(1-\alpha )}}\right) }{\log \left( \frac{d_{1-\alpha }}{d_{\alpha }}\right) };\quad \sigma =\frac{d_{1-\alpha }}{(-\log {\alpha })^{1/p}}. \end{aligned}$$
(25)

In Fig. 1, the distribution of Euclidean distances for a standard Gaussian random vector of dimension 1000 with \(\alpha =0.05\) is plotted. Also included is a set of Gaussian kernels with different bandwidths and the p-kernel. The problem of using Gaussian kernels is evident: either the kernel function dies out before it reaches the sampled distances, or the bandwidth is so large that it does not separate the points. Hence uniform weights could in practice replace the kernel.

Fig. 1
figure 1

p-kernel (red) and Gaussian kernels with different bandwidths together with the sample distribution of distances

4 Examples

4.1 Univariate Example

The first test is to reproduce the toy example in Pulido et al. (2019) in order to validate our derivative approximation (Eq. 22) and demonstrate the sampling properties of SVGD. In the simple toy problem of Pulido et al. (2019), the prior is given by a univariate Gaussian density with mean 0.5 and variance of 1. The measurement model is \(y=x^2+\epsilon \), where \(\epsilon \) is a zero-mean Gaussian random variable with a variance of 0.5. The actual measurement is set to \(y=3\). Although \(x^2\) is not in the Gaussian RKHS (Minh 2010), the reproducing property is still valid for the derivative in this particular case. That is, \(\int u^2 \partial _x K(x,u)\, d u=2x\).

Fig. 2
figure 2

The exact and approximate derivatives of the model \(y=x^2\)

Figure 2 shows the derivative approximation of Eq. 18 and its normalized version (blue and red), reproduced from Pulido et al. (2019), together with the true derivative (black) and our proposed method (cyan) in Eq. 22 of the prior sample. The results clearly show the inadequacy of the approximation (blue line) in Eq. 18, which in this case is not even monotonic. The normalized version (red line) is just a damped version and exhibits the same non-monotonic behavior. However, it can also be seen that the corrected Monte Carlo estimate is slightly biased due to deficiency the of kernel density estimation (Silverman 1986) for the tails of the prior density.

Fig. 3
figure 3

Posterior and posterior estimates in the toy model, using two different ensemble sizes, N

In Fig. 3 we plot density estimates, including SVGD using the exact (magenta) and the new approximate (cyan) derivative of Eq. 22. The standard EnKF solution (green) and the prior (red) are also included. The sample size is 100 and 1000. The error in SVGD is due to the sampling error and bias in the kernel density estimation. The SVGD with the (new) approximate derivative has an additional bias due to the error in the derivative estimate, as seen by the left and right skewness in Fig. 3. Note that this error is reduced with increased sample size.

4.2 Data Assimilation Tests with the Lorenz Systems

Pulido and van Leeuwen (2019) tested the SVGD in the sequential filtering setting (described below) of data assimilation (DA, Wikle and Berliner 2007), applying it in the analysis step by constituting the prior from the ensemble using Gaussian mixtures. They called this the mapping particle filter (MPF), however, the SVGD label is maintained here. Their results suggest that the SVGD filter achieves the accuracy of the (bootstrap) particle filter on the Lorenz-63 system, and the EnKF on the Lorenz-96 system.

Unfortunately, their result lacks relevance because the Lorenz systems, used as coarse prototypes of atmospheric dynamics, are inundated with noise (see appendix A), so that the extended Kalman filter (EKF, Jazwinski 1970), or even 3D-Var, can achieve optimal accuracy, at much lower costs. This issue is rectified here by benchmarking the SVGD filter (i.e. MPF) with more standard settings.

The Lorenz-63 system (Lorenz 1963; Sakov et al. 2012) is given by the \(N_x=3\) coupled ordinary differential equations

$$\begin{aligned} \dot{x}&= \sigma (y - x), \end{aligned}$$
(26)
$$\begin{aligned} \dot{y}&= r x - y - x z, \end{aligned}$$
(27)
$$\begin{aligned} \dot{z}&= x y - b z, \end{aligned}$$
(28)

with parameter values \(r=28\), \(\sigma =10\), and \(b=8/3\). The true trajectory, x(t), is computed using the fourth-order Runge–Kutta scheme, with time steps of 0.01 time units, and no model noise. Observations of the entire state vector (i.e. \({\mathbf {H}}={\mathbf {I}}_{N_x}\)) are taken \(\varDelta t_\text {Obs} = 0.25\) time units apart with a noise variance of 2 (i.e. \({\mathbf {C}}_{\epsilon }=2 {\mathbf {I}}_{N_x}\)).

The Lorenz-96 system (Lorenz 1996; Ott et al. 2004) is given by the coupled ordinary differential equations

$$\begin{aligned} {\dot{x}}_m = \left( x_{m+1} - x_{m-2} \right) x_{m-1} - x_{m} + F, \end{aligned}$$
(29)

for \(m = 1,\ldots ,N_x\), with periodic boundary conditions, and a constant “force” of \(F=8\). These are integrated using the fourth-order Runge–Kutta scheme, with time steps of 0.05 time units, and no model noise. The state dimension is set to \(N_x=10\) (rather than the typical value of 40) so that the particle filter is practical. Observations of the entire state vector (i.e. \({\mathbf {H}}={\mathbf {I}}_{N_x}\)) are taken \(\varDelta t_\text {Obs} = 0.1\) time units apart with unit noise variance (i.e. \({\mathbf {C}}_{\epsilon }={\mathbf {I}}_{N_x}\)).

The methods are assessed by their accuracy, as measured by root-mean squared error (RMSE)

$$\begin{aligned} \text {RMSE}(t) = \sqrt{\frac{1}{N_x} \big \Vert x(t) - {\bar{x}}(t)\big \Vert ^2_2}, \end{aligned}$$
(30)

which is recorded following each analysis of the latest observation y(t). After the experiment, the instantaneous \(\text {RMSE}(t)\) are averaged for all \(t>20\).

Comparison of the benchmark performance of SVGD is made to that of the EnKF (Hunt et al. 2004) and the bootstrap particle filter (with universal resampling, triggered if \(\Vert w \Vert ^{-2} < N/2\), where w is the vector of weights). In addition, baseline methods are included for context. Their analysis estimates, \(x_a\), are computed as follows: \( {\bar{x}}\) for Climatology, \( {\bar{x}} + {\mathbf {K}}(\mathbf {{\overline{C}}})\, [y - {\bar{x}}]\) for Optimal Interpolation, and \( x_f + {\mathbf {K}}(c {\mathbf {I}})\, [y - x_f]\) for 3D-Var. Here, \({\bar{x}}\) and \(\mathbf {{\overline{C}}}\) are the mean and covariance of the (invariant measure of the) system dynamics, \({\mathbf {K}}({\mathbf {C}}) = {\mathbf {C}}{\mathbf {H}}^\top ({\mathbf {H}}{\mathbf {C}}{\mathbf {H}}^\top + {\mathbf {R}})^{-1}\) is a gain matrix, \(x_f\) is the model forecast of the previous \(x_a\), and c is a scaling factor subject to tuning.

The RMSE averages of each method are tabulated for a range of ensemble sizes, N, and plotted as curves in Figs. 4 and 5. The plotted scores represent the lowest obtained among a large number of tuning settings, selected for optimality at each point. For the PF the tuning parameter is the bandwidth (scaling) of the regularizing post-resample jitter, whose covariance is computed from the weighted ensemble. For the EnKF (Hunt et al. 2004) the tuning parameters are (i) the post-analysis inflation factor and (ii) whether or not to apply random, covariance-preserving, post-analysis rotations (Sakov and Oke 2008). For the SVGD the tuned parameters are: (i) the number of iterations (maximum: 100) and (ii) the step size for ADAGRAD, (iii) the bandwidth of the isotropic Gaussian kernels of the priors’ mixture, and (iv) the bandwidth of the isotropic Gaussian kernels defining the RKHS. In the case of p-kernels, the latter bandwidth is determined automatically by Eq. 25, along with the p parameter.

Fig. 4
figure 4

Benchmarks of filter accuracy (RMSE) from synthetic DA experiments on the Lorenz-63 system, plotted as functions of the ensemble size, N. For the SVGD (label “Stein”) filter, one curve is plotted for each tuning value tested for the kernel bandwidth. Note that the vertical scale is compressed above 2

Fig. 5
figure 5

Benchmarks of filter accuracy (RMSE) from synthetic DA experiments on the Lorenz-96 system, plotted as functions of the ensemble size, N. For the SVGD (label “Stein”) filter, one curve is plotted for each tuning value tested for the kernel bandwidth. Note that the vertical scale is compressed above 2

As can be seen in both Figs. 4 and 5, the abscissa (N-axis) can be roughly partitioned into three segments, with regard to the ordering of the method scores: For intermediate ensemble sizes, N, the EnKF obtains the lowest RMSE score among all of the methods. For the largest ensemble size, \(N=400\), the particle filter obtains the best score, while in the case of Lorenz-63, the SVGD filter achieves the same score as the EnKF (which it already obtained with \(N=8\)). Tests with \(N>400\) were not affordable due to the cost of SVGD. For small N, the SVGD filter obtains lower RMSE than both the particle filter and the EnKF. However, as this score is not better than 3D-Var using a background matrix proportional to identity, it does not hold much relevance.

Furthermore, the p-kernel modification generally scores worse than SVGD with Gaussian kernels and tuned bandwidths. This is not very surprising, because the dimensionality is low, preempting their rationale. Moreover, an investigation reveals that, for small ensemble sizes, the p-kernel approach has a tendency to use very large values of p. For example, the corresponding curve in Fig. 4 spikes at \( N=3 \). The reason is that, sometimes, the three distances between the three ensemble members are almost equal, but far from zero, requiring, on average, \(p=22\). By contrast, this does not occur at \(N=2\), because then there is only one distance, nor does it occur when N is very large and the distances rarely coincide.

In summary, the SVGD filter of Pulido and van Leeuwen (2019), with or without our p-kernel modification, while functional, does not achieve the same accuracy as standard DA methods. It is important to remember that the SVGD filter involves many iterations, quite a few tuning parameters, and linear algebra with \(N_x \times N_x\) matrices, all of which add to its cost. It seems likely that the disappointing performance of SVGD stems from several issues: Firstly, it is limited by the precision of the prior approximations, which come from Gaussian mixtures, and may suffer from the curse of dimensionality, or from using simple, identity covariances. Secondly, the optimization process may not be very successful. In cursory experiments, other optimization routines were tested, and testw were condunctued using more iterations and the use of pre-conditioning, all of which sometimes could yield improved performance. A thorough examination is left for future work.

4.3 Reservoir History Matching Example

This section presents the results of SVGD with Gaussian kernels and p-kernels on a synthetic reservoir case. The reservoir is a two-dimensional model with 21 by 21 grid cells consisting of incompressible two-phase flow (oil-water). In each corner cell there is a production well and in the center an injection well (see Fig. 6). The porosity is assumed to be known in each grid cell with a value of 0.3. The forward model equations, as well as the adjoint model for this example, are fully depicted in de Moraes et al. (2018).

Fig. 6
figure 6

The synthetic inverted five-spot model used in the numerical experiments. One of the 1000 permeability realizations is shown

The permeability field is considered unknown with a prior distribution that is Gaussian. There are 1000 realizations available for this model (Jansen 2011), which specify the prior mean and covariance. Figure 7 shows four different permeability realizations from the ensemble. This is the same set up as in Stordal and Elsheikh (2015).

Fig. 7
figure 7

Four different permeability realizations from the ensemble of 1000 members

The observed data are generated from a synthetic truth (of the permeability field), chosen at random as one realization from the ensemble. Specifically, the observations are the water rates resulting from the simulation of 10 years of the truth, observed every six months, with a 5% white noise level. For more details on the reservoir settings, the reader is referred to Jansen (2011) and de Moraes et al. (2018).

The ADAGRAD method for optimization was implemented with a step size of 0.1 and an autocorrelation of 0.9, which are the standard settings for ADAGRAD. The maximum number of iterations was set to 50. The experiments were conducted with both Gaussian and p-kernels, and with ensemble sizes \(N=100\) and \(N=1000\).

The results for the data match with 100 members are shown in Fig. 8, as well as the data match with 1000 members in Fig. 9.

Fig. 8
figure 8

Data predictions from prior (green) and posterior (red) ensemble with 100 members. The blue circles represent the observed data. The y-axis represents water rate (\(q_w\)) in m\(^3\)/day and the x-axis represents time in days. Data match using Gaussian kernel (a) and p-kernel (b)

Fig. 9
figure 9

Data predictions from prior (green) and posterior (red) ensemble with 1000 members. The blue circles represent the observed data. The y-axis represents water rate (\(q_w\)) in \(m^3/day\) and the x-axis represents time in days. Data match using Gaussian kernel (a) and p-kernel (b)

As can be observed from the results, both history matching using 100 members (Fig. 8) and 1000 members (Fig. 9) provide a reasonably good match for each individual well. Even though the prior ensemble either mostly underestimates the water rate for PROD1 and PROD3 wells (top-left and bottom-right panels, respectively) or overestimates the PROD2 and PROD4 water rates (top-right and bottom-left panels, respectively), the posterior data predictions capture the observed data well. Additionally, better data predictions are obtained with 1000 members. Furthermore, underestimation of uncertainty is often a problem in history matching. It therefore is interesting to see that the posterior ensemble using the p-kernel contains a much larger uncertainty when compared to the posterior ensemble using the Gaussian kernel. This is even clearer if one observes the uncertainty around the water break-through of wells PROD2 and PROD4 (top-right and bottom-left panels on Fig. 9b).

Fig. 10
figure 10

Marginal PDFs of permeability. Each curve represents one ensemble member. Top: 100 members. Bottom: 1000 members. Left: Gaussian kernel. Right: p-kernel. The prior is shown in green, and the posterior in red. The blue curves represent the true (reference) marignal pdf

Next, the history matching results are further investigated by comparing the permeability marginal PDFs conditioned to production data, shown in Fig. 10. In general, the conditioned marginal PDFs for both Gaussian kernel and p-kernel provide a reasonably good representation of uncertainty. While subtle, one may note that the conditioned PDFs using p-kernel (right-side panels of Fig. 10) better represent the uncertainty when compared to the conditioned PDFs using Gaussian kernel, mainly when looking at the permeability range (horizontal axis) and the spread of the PDFs at the bottom, that use a 1000 member ensemble.

Comparisons with RML (Oliver et al. 1996) in order to further investigate the usage of p-kernels to better represent different high-dimensional geological settings will be investigated in the future.

Even though the results presented here indicate that the SVGD is a promising alternative for reservoir uncertainty quantification, it is only feasible, even considering this relatively small example, with an efficient gradient computation strategy. However, other data assimilation/uncertainty quantification strategies, such as RML (Chen and Oliver 2012), are also only feasible combined with efficient gradient computation strategies. Even though derivative-free formulations have been devised to overcome the adjoint implementation challenges (e.g. EnRML), the performance of both formulations (derivative and derivative-free) in an SVGD setting should be investigated in reservoir history matching problems.

5 Conclusions

The Stein variational gradient descent (SVGD) algorithm was extended to p-kernels, and we discussed derivative and derivative-free implementations. With an adjoint code, the algorithm was applied to subsurface petroleum history matching for the first time. The results showed that it can obtain reasonable data match with both Gaussian kernels and p-kernels, but that the posterior uncertainty was larger using the p-kernels, hence demonstrating the potential usefulness of p-kernels in higher dimensions.

The SVGD was also tested on two small chaotic dynamical systems. For these models the measurement operator is linear, thus the sensitivity matrix is trivial. However, as the prior distribution is unknown, it has to be approximated using kernel density estimation. The results showed that the SVGD is outperformed by the EnKF for intermediate ensemble sizes and the particle filter for large ensemble sizes, tempering the impression given by Pulido and van Leeuwen (2019). This contradiction is most likely a result of using a less noisy system (the dynamics in Pulido and van Leeuwen (2019) were entirely driven by noise) and the fact that the posterior sample from SVGD cannot overcome the deficiencies of the density estimate of the prior. Further analysis and improvements are left for future work.

The reservoir example indicated more potential for the SVGD algorithm. In particular, the uncertainty quantification of the posterior seemed to improve when using the p-kernel instead of the Gaussian kernel. A more detailed investigation of the uncertainty quantification properties of SVGD, in particular when compared to RML or ensemble smoothers, is left for future work. In addition, the ensemble approximation of the derivative for reservoir models will also be investigated in the future, in combination with localization.