1 Introduction

Statistical inference for logistic regression and classification problems is a well studied problem from both parameter optimisation and Bayesian inference perspectives [5]. While many classical optimisation and Markov chain Monte Carlo (MCMC) methods are applicable, an efficient inference in the context of semi-parametric models still poses computational challenges. With this paper, we address this challenge by investigating the extension of coupling-of-measures and ensemble transform methodologies [29] from the squared error loss function to the cross entropy loss function typically used for logistic regression and classification.

More specifically, previous work on ensemble methods for Bayesian inference, such as the ensemble Kalman filter (EnKF) [8] and the feedback particle filter (FPF) [35, 38, 39], have almost exclusively focused on the squared error loss function of the form

$$\begin{aligned} \varPsi _{\mathrm{data}}(\theta ) = \frac{1}{2} (g(\theta ) - t)^{\mathrm{T}} \varGamma ^{-1} (g(\theta ) -t), \end{aligned}$$
(1)

with \(g:{\mathbb {R}}^D \rightarrow {\mathbb {R}}^N\) a forward map, \(t\in {\mathbb {R}}^N\) the data, \(\varGamma \in {\mathbb {R}}^{N\times N}\) the measurement error covariance matrix, and \(\theta \in {\mathbb {R}}^D\) the parameters to be estimated. Notable exceptions include the application of ensemble Kalman inversion (EKI) [17] and the modified EnKF formulation of [13] to the training of neural networks with a cross entropy loss function. While these methods seek to minimise a regularised cross entropy loss function, the more recent work [14] has also investigated ensemble-based sampling methods for Bayesian inference in the context of logistic regression and classification. This work relies on the time-stepping of appropriate stochastic differential equations (SDEs) and is in line with several other ensemble-based sampling methods such as the ensemble preconditioned MCMC methods of [19], the ensemble Kalman sampler (EKS) [9], and affine invariant Langevin dynamics (ALDI) [10]. Contrary to such invariance-of-measures based SDE methodologies, the approach taken in this paper is instead founded on the homotopy approach to Bayesian inference, as first formulated in a time-continuous framework in [6, 26], and which is close in spirit to the iterative application of the EnKF [31] and parameter estimation methods based on sequential Monte Carlo (SMC) methods [4].

In addition to expanding the homotopy-based approach to logistic regression, we address two further important concepts, namely affine-invariance of the proposed methods [10, 11, 19] and sub-sampling of data points, as widely used in stochastic gradient descent [20]. Both concepts are used to improve the computational efficiency of optimisation and sampling methods. Take, for example, a two dimensional Gaussian random variable with mean zero and covariance matrix

$$\begin{aligned} \varSigma = \left( \begin{array}{cc} \epsilon ^2 &{}\quad 0 \\ 0 &{}\quad 1 \end{array} \right) . \end{aligned}$$

A random walk Metropolis–Hastings algorithm will sample inefficiently whenever \(\epsilon ^2\) is vastly different from one. An affine-invariant modification, on the other hand, will sample this problem as efficiently as if \(\epsilon \) were set to \(\epsilon = 1\) [11]. Sub-sampling replaces the exact gradient of a cost functional by a cheaper to evaluate stochastic approximation, which agrees with the exact gradient in expectation.

We also discuss the possibility for derivative-free implementations [8,9,10, 17], localisation [8, 29] via dropouts [32], and efficient linearly implicit time-stepping methods [2].

Our main contributions with regard to Bayesian homotopy methods for logistic regression are:

  • an extension of affine-invariant ensemble transform approaches, such as the EnKF, to logistic regression,

  • an affine-invariant generalisation of the FPF and its application to logistic regression,

  • an extension of data sub-sampling (mini-batches) to Bayesian homotopy methods.

In a further step, we combine these homotopy approaches with SDE-based sampling methods in order to derive

  • a derivative-free and affine-invariant SDE-based sampling methods for logistic regression.

We demonstrate the appropriateness of the proposed methods by means of a set of numerical experiments. Extensions to nonlinear and multi-class logistic regression [5] are also discussed. We also briefly discuss the application to sigmoidal Cox processes [1].

The layout of this paper is as follows. The required mathematical background material on both logistic regression is collected in Sect. 2. The homotopy approach to Bayesian inference is summarised in Sect. 3. There we also present an affine-invariant formulation of the homotopy approach and discuss and analyse data sub-sampling in the spirit of [20]. Section 4 develops three different algorithmic approaches for the implementation of homotopy-based Bayesian inference for logistic regression. More specifically, we propose an affine-invariant modification of the FPF and two extensions of the EnKF to logistic regression. We also discuss robust numerical implementations combining dropouts [32] with localisation [8, 29] and linearly implicit time-stepping methods [2]. Section 5 combines SDE-based sampling methods with a homotopy-based drift term in order to derive a gradient-free and affine-invariant algorithm for Bayesian inference. While the proposed method is only exact for Gaussian measures, it can be used for faster equilibration and approximate inference provided the posterior distribution is close to Gaussian. Numerical results are presented and evaluated in Sect. 6 and provide a proof-of-concept while a more detailed exploration is left to follow-up work. Several possible extensions of the proposed methods, namely multi-class classification, nonlinear logistic regression, and sigmoidal Cox processes, are discussed in Sect. 7 followed by a summary statement in Sect. 8.

2 Mathematical Problem Formulation

This paper is motivated by the classical logistic regression problem arising from classification into \(L>1\) classes, denoted by \(C_l\), over an input space \(x \in {\mathbb {R}}^{J}\) [5]. In order to simplify the exposition, we start with the binary classification problem, that is, \(L=2\) and focus on the linear case. The extension to nonlinear and multi-class logistic regression is discussed in Sect. 7.

The posterior distribution for class \(C_1\) is defined as a logistic sigmoid

$$\begin{aligned} \sigma (a) = \frac{1}{1+\exp (-a)} \end{aligned}$$

acting on a linear combination of x-dependent and vector-valued features \(\phi _x \in {\mathbb {R}}^D\) so that

$$\begin{aligned} \pi (C_1|\phi _x) = \sigma \left( \theta ^{\mathrm{T}} \phi _x \right) . \end{aligned}$$
(2)

The model has D adjustable parameters \(\theta \in {\mathbb {R}}^{D}\). The probability of the complimentary class \(C_2\) is given by \(\pi (C_2|\phi _x) = 1-\pi (C_1|\phi _x)\). Given a data set \(\{(x_n,t_n)\}_{n=1}^N\) with labels \(t_n \in \{0,1\}\), the negative log-likelihood function is given by the cross-entropy function

$$\begin{aligned} \varPsi _{\mathrm{data}} (\theta ) = - \sum _{n=1}^N \left\{ t_n \ln y_n + (1-t_n)\ln (1-y_n)\right\} , \end{aligned}$$
(3)

where \(y_n := y_{x_n}(\theta )\) with

$$\begin{aligned} y_x(\theta ) := \sigma \left( \theta ^{\mathrm{T}} \phi _x \right) . \end{aligned}$$
(4)

Here \(t_n = 1\) for samples \(x_n\) which are assigned to class \(C_1\) and \(t_n = 0\) for samples from the complementary class \(C_2\).

We introduce the further shorthand

$$\begin{aligned} \phi _n = \phi _{x_n} \end{aligned}$$

and, using

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}a} \sigma (a) = \sigma (a) (1-\sigma (a)), \end{aligned}$$

obtain the gradient of the cross-entropy function

$$\begin{aligned} \nabla _\theta \varPsi _{\mathrm{data}} (\theta ) = \sum _{n=1}^N (y_n-t_n) \phi _n . \end{aligned}$$
(5)

We rewrite this gradient more compactly as as

$$\begin{aligned} \nabla _\theta \varPsi _{\mathrm{data}}(\theta ) = \varPhi \,(y(\theta )-t) \end{aligned}$$

using

$$\begin{aligned} y(\theta ) = (y_1,\ldots ,y_n)^{\mathrm{T}} \in {\mathbb {R}}^N, \qquad \qquad t = (t_1,\ldots ,t_N)^{\mathrm{T}} \in {\mathbb {R}}^N, \end{aligned}$$

and

$$\begin{aligned} \varPhi = \left( \phi _1,\ldots ,\phi _N \right) \in {\mathbb {R}}^{D\times N}. \end{aligned}$$

The term \(y(\theta )-t\) is often called the innovation in the Kalman filter literature. The Hessian matrix of second-order derivative is given by

$$\begin{aligned} D^2_\theta \varPsi _{\mathrm{data}}(\theta ) = \sum _{n=1}^N \phi _n y_n (1-y_n) \phi _n^{\mathrm{T}} = \varPhi R(\theta ) \varPhi ^{\mathrm{T}}, \end{aligned}$$

where \(R(\theta ) \in {\mathbb {R}}^{N\times N}\) denotes the diagonal matrix with entries \(r_{nn} = y_n (1-y_n)\).

Let us denote a minimiser of \(\varPsi _{\mathrm{data}}(\theta )\) by \(\theta _{\mathrm{MLE}}\). We note that the maximum likelihood estimators (MLEs) are not uniquely determined in the over-parametrised setting, that is, when \(D \gg N\). For that reason and in order to avoid overfitting when \(D \approx N\), we introduce a Gaussian prior probability density function (PDF) over the parameters via

$$\begin{aligned} \pi _{\mathrm{prior}}(\theta ) \propto \exp (-\varPsi _{\mathrm{prior}}(\theta )) \end{aligned}$$

with

$$\begin{aligned} \varPsi _{\mathrm{prior}}(\theta ) := \frac{1}{2} (\theta -m_{\mathrm{prior}})^{\mathrm{T}} \varSigma _{\mathrm{prior}}^{-1}(\theta -m_{\mathrm{prior}}), \end{aligned}$$

that is, the posterior parameter distribution \(\pi _{\mathrm{post}}\) is given by

$$\begin{aligned} \pi _{\mathrm{post}}(\theta ) \propto \exp (-\varPsi _{\mathrm{post}}(\theta )), \quad \varPsi _{\mathrm{post}}(\theta ):= \varPsi _{\mathrm{data}}(\theta ) + \varPsi _{\mathrm{prior}}(\theta ). \end{aligned}$$
(6)

Here the prior mean \(m_{\mathrm{prior}} \in {\mathbb {R}}^D\) and covariance matrix \(\varSigma _{\mathrm{prior}} \in {\mathbb {R}}^{D\times D}\) need to be chosen appropriately.

It is known that \(\varPsi _{\mathrm{post}}\) is a strongly convex function since \(D^2_\theta \varPsi _{\mathrm{data}}(\theta ) \ge 0\) and, hence, the MAP estimator, that is,

$$\begin{aligned} \theta _{\mathrm{MAP}} = \arg \min _{\theta \in {\mathbb {R}}^{\mathrm{D}}} \varPsi _{\mathrm{post}}(\theta ), \end{aligned}$$

is uniquely defined. Furthermore, assuming the availability of sufficiently many data points, that is \(N\gg 1\), the posterior is well approximated by a Gaussian with mean \(\theta _{\mathrm{MAP}}\) and covariance matrix

$$\begin{aligned} \varSigma _{\mathrm{MAP}} = \left( D^2_\theta \varPsi _{\mathrm{post}}(\theta _{\mathrm{MAP}})\right) ^{-1}, \end{aligned}$$

where

$$\begin{aligned} D^2_\theta \varPsi _{\mathrm{post}}(\theta ) = \varPhi R(\theta ) \varPhi ^{\mathrm{T}} + \varSigma _{\mathrm{prior}}^{-1} \in {\mathbb {R}}^{D \times D} \end{aligned}$$

denotes the Hessian matrix of second-order derivatives. This is an implication of the Bernstein–von Mises theorem [33].

3 The Homotopy Approach to Bayesian Inference

While there are many Monte Carlo methods available for sampling from the posterior \(\pi _{\mathrm{post}}\), we focus in this paper on ensemble transform methods based on the homotopy approach [6, 26]

$$\begin{aligned} \pi _\tau (\theta ) \propto \exp (-\tau \varPsi _{\mathrm{data}}(\theta )) \,\pi _{\mathrm{prior}}(\theta ) \end{aligned}$$
(7)

with \(\tau \in [0,1]\), that is, \(\pi _0 = \pi _{\mathrm{prior}}\) and \(\pi _{\mathrm{post}} = \pi _1\). The PDFs \(\pi _\tau \) satisfy the evolution equation

$$\begin{aligned} \partial _\tau \pi _\tau = - (\varPsi _{\mathrm{data}} - \left[ \varPsi _{\mathrm{data}}\right] )\,\pi _\tau , \end{aligned}$$
(8)

where \(\pi _\tau [f]\) denotes the expectation value of a function \(f:{\mathbb {R}}^D \rightarrow {\mathbb {R}}\) under the PDF \(\pi _\tau \).

In order to derive the associated evolution equations for random variables \(\theta _\tau \sim \pi _\tau \), we introduce potentials \(V_\tau :{\mathbb {R}}^D \rightarrow {\mathbb {R}}\), which are defined by the elliptic partial differential equation (PDE)

$$\begin{aligned} \nabla _\theta \cdot (\pi _\tau \varSigma _\tau \nabla _\theta V_\tau ) = -(\varPsi _{\mathrm{data}} - \pi _\tau \left[ \varPsi _{\mathrm{data}}\right] ) \,\pi _\tau , \end{aligned}$$
(9)

where \(\varSigma _\tau \) denotes the covariance matrix of \(\pi _\tau \), that is,

$$\begin{aligned} \varSigma _\tau = \pi _\tau \left[ (\theta -\pi _\tau [\theta ])(\theta -\pi _\tau [\theta ])^{\mathrm{T}} \right] . \end{aligned}$$

Hence the evolution equations are given by

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } \theta _\tau = -\varSigma _\tau \nabla _\tau V_\tau (\theta _\tau ), \qquad \theta _0 \sim \pi _{\mathrm{prior}} , \end{aligned}$$
(10)

such that

$$\begin{aligned} \theta _1 \sim \pi _{\mathrm{post}}. \end{aligned}$$
(11)

See [26] for further details.

Remark 1

If different models \({\mathcal {M}}_i\), \(i=1,\ldots ,I_M\), need to be compared via their Bayes factors \(\mathbb {P}[{\mathcal {M}}_i]\) [16], then the homotopy approach (8) can also be used and gives rise to the evolution equation

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } \mathbb {P}_\tau = - \pi _\tau \left[ \varPsi _{\mathrm{data}}\right] , \end{aligned}$$

\(\tau \in [0,1]\), which is to be solved for each model \({\mathcal {M}}_i\) with \(\mathbb {P}_0\) its prior probability, \(\pi _\tau \) its transformed parameter PDFs, as defined by (8), and \(\varPsi _{\mathrm{data}}\) the model’s negative log-likelihood function. It then holds that \(\mathbb {P}[{\mathcal {M}}_i] = \mathbb {P}_1\).

Remark 2

It is, of course, feasible to use a stopping time different from \(\tau = 1\) in the homotopy approach (10). This has already been hinted at in in the context of optimisation and Tikhonov regularisations in [26]; while subsequently being fully explored by the EKI methodology. See [15, 17] for a detailed description in the context of machine learning as well as for an extensive literature survey.

3.1 Affine-Invariance

Note that any vector field \(F_\tau (\theta )\), which satisfies

$$\begin{aligned} \nabla _\theta \cdot (\pi _\tau F_\tau ) = - (\varPsi _{\mathrm{data}} - \pi _\tau \left[ \varPsi _{\mathrm{data}}\right] )\,\pi _\tau , \end{aligned}$$

will lead to an associated evolution equation

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } \theta _\tau = -F_\tau (\theta _\tau ), \qquad \theta _0 \sim \pi _{\mathrm{prior}} \end{aligned}$$
(12)

such that (11) holds. The specific choice \(F_\tau = \varSigma _\tau \nabla V_\tau \) in (9) is motivated by the concept of affine-invariance [10, 11], which we explain next.

Let

$$\begin{aligned} \theta = A {\overline{\theta }} + b \end{aligned}$$
(13)

denote an invertible affine transformation between two variables \(\theta \in {\mathbb {R}}^D\) and \({\overline{\theta }}\in {\mathbb {R}}^D\). Furthermore, the Bayesian inference problem for the variable \(\theta \) is defined by \(\varPsi _{\mathrm{data}}\) and \(\varPsi _{\mathrm{prior}}\) and the corresponding problem for the transformed variable \({\overline{\theta }}\) by

$$\begin{aligned} {\overline{\varPsi }}_{\mathrm{data}}({\overline{\theta }}) := \varPsi _{\mathrm{data}}(A{\overline{\theta }} + b), \qquad {\overline{\varPsi }}_{\mathrm{prior}}({\overline{\theta }}) := \varPsi _{\mathrm{prior}}(A{\overline{\theta }} + b). \end{aligned}$$

Hence the PDFs \({\overline{\pi }}_\tau \) for \({\overline{\theta }}\) satisfy

$$\begin{aligned} {\overline{\pi }}_\tau ({\overline{\theta }}) = \pi _\tau (A{\overline{\theta }} + b) \,|A|,\qquad \tau \in [0,1] . \end{aligned}$$

Definition 1

A homotopy formulation (12) is called affine-invariant if the associated vector fields \(F_\tau \) and \({\overline{F}}_\tau \) satisfy

$$\begin{aligned} F_\tau (A{\overline{\theta }} + b) = A {\overline{F}}_\tau ({\overline{\theta }}) \end{aligned}$$
(14)

for any suitable choice of \(A \in {\mathbb {R}}^{D\times D}\) and \(b \in {\mathbb {R}}^D\), which implies \(\theta _\tau = A {\overline{\theta }}_\tau + b\) for all \(\tau >0\) if it holds at \(\tau =0\).

Lemma 1

The choice

$$\begin{aligned} F_\tau (\theta ) = \varSigma _\tau \nabla _\theta V_\tau (\theta ) \end{aligned}$$
(15)

leads to an affine-invariant homotopy formulation and the transformed vector field is given by

$$\begin{aligned} {\overline{F}}_\tau ({\overline{\theta }}) = {\overline{\varSigma }}_\tau \nabla _{{\overline{\theta }}} {\overline{V}}_\tau ({\overline{\theta }}) \end{aligned}$$

with the potential \({\overline{V}}_\tau \) satisfying the elliptic PDE

$$\begin{aligned} \nabla _{{\overline{\theta }}} \cdot ({\overline{\pi }}_\tau {\overline{\varSigma }}_\tau \nabla _{{\overline{\theta }}} {\overline{V}}_\tau ) = -{\overline{\pi }}_\tau ( {\overline{\varPsi }}_{\mathrm{data}} - {\overline{\pi }}_\tau \left[ {\overline{\varPsi }}_{\mathrm{data}}\right] ). \end{aligned}$$
(16)

Proof

Since the covariance matrices satisfy

$$\begin{aligned} \varSigma _\tau = A {\overline{\varSigma }}_\tau A^{\mathrm{T}}, \end{aligned}$$

the PDE (9) transforms into the PDE (16) under the affine transformation (13). Hence we have

$$\begin{aligned} {\overline{V}}_\tau ({\overline{\theta }}) = V_\tau (A{\overline{\theta }} + b) \end{aligned}$$

and obtain

$$\begin{aligned} \varSigma _\tau \nabla _\theta V_\tau (\theta )&= A {\overline{\varSigma }}_\tau A^{\mathrm{T}} \nabla _\theta V_\tau (A{\overline{\theta }}+b) \\&= A {\overline{\varSigma }}_\tau \nabla _{{\overline{\theta }}} V_\tau (A{\overline{\theta }}+b) \\&= A {\overline{\varSigma }}_\tau \nabla _{{\overline{\theta }}} {\overline{V}}_\tau ({\overline{\theta }}) , \end{aligned}$$

which verifies (14). \(\square \)

We note that the affine invariance of (10) is maintained under the forward Euler time discretisation

$$\begin{aligned} \theta _{\tau _{k+1}} = \theta _{\tau _k} - \varDelta \tau \varSigma _{\tau _k} \nabla _\theta V_{\tau _k}( \theta _{\tau _k}) \end{aligned}$$

with step-size \(\varDelta \tau >0\) and \(\tau _k = k\varDelta \tau \), \(k = 0,\ldots ,K-1\) such that \(K\varDelta \tau = 1\). This immediately follows from property (14).

Remark 3

The choice of the vector field \(F_\tau (\theta )\) in (15) is not unique. It is, for example, possible to add any drift term \(\varSigma _\tau \nabla _\theta U_\tau (\theta )\) with the potential \(U_\tau \) chosen such that

$$\begin{aligned} \nabla _\theta \cdot (\pi _\tau \varSigma _\tau \nabla _\theta U_\tau ) = 0. \end{aligned}$$
(17)

A possible choice is

$$\begin{aligned} U_\tau (\theta ) = \log \pi _\tau + \tau \varPsi _{\mathrm{data}}(\theta ) - \log \pi _{\mathrm{prior}}(\theta ) \end{aligned}$$

and the associated operator on the left hand side of (17) corresponds to the nonlinear Fokker–Planck equation arising from the EKS/ALDI mean-field evolution equations [9, 10] for sampling from the PDF \(\pi _\tau \). Because of (7), \(U_\tau (\theta ) = \text{ const. }\) and (17) follows trivially. Upon formally replacing the drift term \(\varSigma _\tau \nabla _\theta U_\tau (\theta )\) by its EKS/ALDI stochastic mean-field representation, (10) turns into the affine-invariant stochastic evolution equation

$$\begin{aligned} {\mathrm{d}} \theta _\tau = -\varSigma _\tau \nabla _\theta \left\{ V_\tau (\theta ) + \tau \nabla _\theta \varPsi _{\mathrm{data}}(\theta ) - \log \pi _{\mathrm{prior}}(\theta )\right\} {\mathrm{d}}t + \varSigma ^{1/2}_\tau {\mathrm{d}}W_\tau , \end{aligned}$$
(18)

where \(W_\tau \) denotes standard D-dimensional Brownian motion. The associated marginal PDF \(\pi _\tau \) still satisfies (8) provided \(\pi _0 = \pi _{\mathrm{prior}}\) at initial time \(\tau =0\).

3.2 Data Sub-sampling

If the data sets are large, then it makes sense to randomly sub-sample the data and to only use these subsets in the computation of the gradient (5). More precisely, let \(\varDelta \) denote a randomly selected subset of \(\{1,\ldots ,N\}\) without replacement of cardinality \(N' \le N\). Given such a mini-batch \(\varDelta \), the full gradient (5) is replaced by its approximation

$$\begin{aligned} \nabla _\theta \varPsi _{\mathrm{data}} (\theta ) \approx \frac{N}{N'} \sum _{n\in \varDelta }(y_n-t_n)\phi _n . \end{aligned}$$

More abstractly, data sub-sampling in terms of mini-batches gives rise to random negative log-likelihood functions \(\varPsi _{\mathrm{data}}^\gamma \), where \(\gamma \) characterises the chosen mini-batch \(\varDelta ^\gamma \) and is the realisation of an appropriate random variable \(\varGamma \) such that

$$\begin{aligned} \mathbb {E}_\varGamma [\varPsi _{\mathrm{data}}^\gamma (\theta )] = \varPsi _{\mathrm{data}}(\theta ) \end{aligned}$$
(19)

for all parameter values \(\theta \in \mathbb {R}^D\). See [20] for a detailed investigation of data sub-sampling in the context of stochastic gradient descent and related methods. If \(\varPsi _{\mathrm{data}}^\gamma \) is now being used in (9) instead of \(\varPsi _{\mathrm{data}}\), then we denote the resulting potential by \(V_\tau ^\gamma \), which is also unbiased, that is,

$$\begin{aligned} \mathbb {E}_\varGamma [\nabla _\theta V_\tau ^\gamma (\theta )] = \nabla _\theta V_\tau (\theta ) \end{aligned}$$

due to (19). Let us now time-step the implied mean-field ODEs (10) by the forward Euler method, that is,

$$\begin{aligned} \theta _{\tau _{k+1}} = \theta _{\tau _k} - \varDelta \tau \varSigma _{\tau _k} \nabla _\theta V_{\tau _k}^{\gamma _k}( \theta _{\tau _k}) , \end{aligned}$$
(20)

where the \(\gamma _k\)’s are independent realisations of the random variable \(\varGamma \). A formal application of the modified equation analysis of [20] leads to the stochastic differential equation (SDE)

$$\begin{aligned} {\mathrm{d}}\theta _\tau = -\varSigma _\tau \left\{ \nabla _\theta V_\tau (\theta _\tau )\,{\mathrm{d}}\tau - (\varDelta \tau \varOmega _\tau (\theta _\tau ))^{1/2} {\mathrm{d}}W_\tau \right\} , \end{aligned}$$
(21)

where \(W_\tau \) denotes standard D-dimensional Brownian motion and

$$\begin{aligned} \varOmega _\tau (\theta ) := \text{ cov}_\varGamma \left[ \nabla _\theta V_\tau ^{\gamma }(\theta ) \right] . \end{aligned}$$

Here \(\text{ cov}_\varGamma [f^\gamma ]\) denotes the covariance matrix of a vector-valued random function \(f^\gamma \in {\mathbb {R}}^D\) with respect to the random variable \(\varGamma \). We conclude from (21) that the approximation error inflicted on the posterior distribution at \(\tau = 1\) can be made arbitrarily small by reducing the time-step \(\varDelta \tau \) of the forward Euler method (20). While a rigorous mathematical analysis is left to future work, we discuss an application to Cox point processes in Sect. 7.3 below and analyse a Gaussian approximation in the Appendix in more detail.

4 Ensemble Transform Algorithms

We now discuss several Monte Carlo implementations of the proposed affine-invariant homotopy approach to Bayesian inference for logistic regression. All methods start with a set of M samples \(\theta _0^{(i)}\), \(i=1,\ldots ,M\), from the prior distribution \(\pi _0 = \pi _{\mathrm{prior}}\). The three methods considered here differ in the evolution equations for the ensemble \(\{\theta _\tau ^{(i)}\}\). We introduce the empirical mean

$$\begin{aligned} {\hat{m}}_\tau = \frac{1}{M} \sum _{i=1}^M \theta _\tau ^{(i)} \end{aligned}$$

and the empirical covariance matrix

$$\begin{aligned} {{\hat{\varSigma }}}_\tau = \frac{1}{M-1} \sum _{i=1}^M (\theta _\tau ^{(i)}-{\hat{m}}_\tau )(\theta _\tau ^{(i)}-{\hat{m}}_\tau )^{\mathrm{T}} \end{aligned}$$

for \(\tau \ge 0\). We also introduce the matrix \(\varTheta _\tau \in {\mathbb {R}}^{D \times M}\) of ensemble deviations

$$\begin{aligned} \varTheta _\tau = \left( \theta _\tau ^{(1)}-{\hat{m}}_\tau , \theta _\tau ^{(2)}-{\hat{m}}_\tau , \ldots , \theta _\tau ^{(M)}-{\hat{m}}_\tau \right) , \end{aligned}$$
(22)

which leads to the compact representation

$$\begin{aligned} {{\hat{\varSigma }}}_{\tau } = \frac{1}{M-1} \varTheta _\tau \varTheta _\tau ^{\mathrm{T}} \end{aligned}$$

of the empirical covariance matrix. The empirical measure associated with the ensemble \(\{\theta _\tau ^{(i)}\}\) at time \(\tau \) is denoted by \({\hat{\pi }}_\tau \) implying, for example,

$$\begin{aligned} {\hat{m}}_\tau = {\hat{\pi }}_\tau [\theta ]. \end{aligned}$$

More generally,

$$\begin{aligned} {\hat{\pi }}_\tau [f] = \frac{1}{M} \sum _{i=1}^M f(\theta _\tau ^{(i)}) \end{aligned}$$

for an observable \(f(\theta )\). We describe an affine-invariant extension of the FPF next. The key building stone is a diffusion map approximation to the PDE (9).

4.1 Diffusion Maps and the FPF

The right-hand side of the elliptic PDE (9) leads to the elliptic operator

$$\begin{aligned} \varDelta _\pi \phi := \frac{1}{\pi } \nabla _\theta \cdot (\pi \varSigma \nabla _\theta \phi ) \end{aligned}$$

with covariance matrix

$$\begin{aligned} \varSigma = \pi \left[ (\theta -\pi [\theta ])(\theta -\pi [\theta ])^{\mathrm{T}} \right] \end{aligned}$$

and (9) can be rewritten as

$$\begin{aligned} -\varDelta _{\pi _\tau } V_\tau = \varPsi _{\mathrm{data}} - \pi _\tau \left[ \varPsi _{\mathrm{data}}\right] . \end{aligned}$$

Following the FPF methodology, as carefully described in [34], we regularise the elliptic PDE problem (9) through the fixed-point equation

$$\begin{aligned} {{\tilde{V}}}_\tau = P_\epsilon {{\tilde{V}}}_\tau + \epsilon (\varPsi _{\mathrm{data}} - \pi _\tau \left[ \varPsi _{\mathrm{data}}\right] ) \end{aligned}$$

with regularisation parameter \(\epsilon >0\) and semigroup \(P_\epsilon = e^{\epsilon \varDelta _{\pi _\tau }}\). One then replaces \(V_\tau \) by \({{\tilde{V}}}_\tau \) in the evolution Eq. (10).

We note that \(P_\epsilon \) is the semi-group associated with the mean-field SDE

$$\begin{aligned} {\mathrm{d}}S_\epsilon = \varSigma _\tau \nabla _\theta \log \pi _\tau (S_\epsilon ) \,{\mathrm{d}}\epsilon + \sqrt{2} \varSigma _\tau ^{1/2} {\mathrm{d}}W_\epsilon , \end{aligned}$$
(23)

where \(W_\epsilon \), \(\epsilon \ge 0\), is standard Brownian motion in \({\mathbb {R}}^D\). The SDE (23) is affine invariant and is closely related to the EKS/ALDI mean-field equations [9, 10].

The diffusion map approximation \(T_\epsilon \) of \(P_\epsilon \) is defined as follows:

$$\begin{aligned} T_\epsilon f(\theta ) = \frac{1}{n_\epsilon (\theta )} \int _{R^D} k_\epsilon (\theta ,\theta ')f(\theta ') \pi _\tau (\theta ') {\mathrm{d}}\theta ' , \end{aligned}$$

where \(n_\epsilon (\theta ) := \int k_\epsilon (\theta ,\theta ')\pi _\tau (\theta '){\mathrm{d}}\theta '\) is the normalisation constant,

$$\begin{aligned} k_\epsilon (\theta ,\theta ') := \frac{g_\epsilon (\theta ,\theta ')}{\sqrt{\int g_\epsilon (\theta ,\theta '') \pi _\tau (\theta ''){\mathrm{d}} \theta ''} \sqrt{\int g_\epsilon (\theta ',\theta '') \pi _\tau (\theta ''){\mathrm{d}}\theta ''}} \end{aligned}$$

and

$$\begin{aligned} g_\epsilon (\theta ,\theta ') := \exp \left( -\frac{1}{4\epsilon } (\theta -\theta ')^{\mathrm{T}} \varSigma _\tau ^{-1} (\theta -\theta ') \right) \end{aligned}$$
(24)

is the Gaussian kernel associated with the pure diffusion in (23).

The next step is to replace \(\pi _\tau \) by the ensemble-induced empirical measure \({{\hat{\pi }}}_\tau \), which results in the empirical approximation

$$\begin{aligned} T_\epsilon ^{(M)} f(\theta ) := \frac{1}{n_\epsilon ^{(M)} (\theta )} \sum _{i=1}^M k_\epsilon ^{(M)} (\theta ,\theta ^{(i)}_\tau )f(\theta ^{(i)}_\tau ) , \end{aligned}$$

where \(n_\epsilon ^{(M)}(\theta ) := \sum _i k_\epsilon ^{(M)} (\theta ,\theta ^{(i)}_\tau )\) is the normalisation constant, and

$$\begin{aligned} k_\epsilon ^{(M)} (\theta ,\theta ') := \frac{g^{(M)}_\epsilon (\theta ,\theta ')}{\sqrt{\sum _i g^{(M)}_\epsilon (\theta ,\theta ^{(i)}_\tau )} \sqrt{\sum _i g^{(M)}_\epsilon (\theta ',\theta ^{(i)}_\tau )}} . \end{aligned}$$

Here \(g^{(M)}_\epsilon \) denotes the Gaussian kernel (24) with the covariance matrix \(\varSigma _\tau \) replaced by its empirical estimate \({{\hat{\varSigma }}}_\tau \). One finally introduces the \(M\times M\) Markov matrix \({\mathrm{T}}\) with entries

$$\begin{aligned} {\mathrm{T}}_{ij} = \frac{1}{n_\epsilon ^{(M)} (\theta _\tau ^{(i)})} k_\epsilon ^{(M)} (\theta _\tau ^{(i)},\theta _\tau ^{(j)}) \end{aligned}$$
(25)

and the finite-dimensional fixed-point equation

$$\begin{aligned} \tilde{\mathrm{V}} = {\mathrm{T}}\tilde{\mathrm{V}} + \epsilon \,\varDelta {\varPsi }_{\mathrm{data}} \end{aligned}$$

with

$$\begin{aligned} \varDelta {\varPsi }_{\mathrm{data}} := \left( \varPsi _{\mathrm{data}}(\theta _\tau ^{(1)})- {\hat{\pi }}_\tau [\varPsi _{\mathrm{data}}],\ldots , \varPsi _{\mathrm{data}}(\theta _\tau ^{(M)})- {\hat{\pi }}_\tau [\varPsi _{\mathrm{data}}]\right) ^{\mathrm{T}} . \end{aligned}$$

Given the solution vector \(\tilde{\mathrm{V}} \in {\mathbb {R}}^M\) with entries \(\tilde{\mathrm{V}}^{(j)}\), \(j=1,\ldots ,M\), the approximation to the drift term in the ODE (10) is now provided as follows [34]:

$$\begin{aligned} {\hat{\varSigma }}_\tau \nabla _\theta {\tilde{V}}_\tau (\theta ) := {\hat{\varSigma }}_\tau \nabla _\theta \left\{ \frac{1}{n_\epsilon ^{(M)} (\theta )} \sum _{j=1}^M k_\epsilon ^{(M)} (\theta ,\theta ^{(j)}_\tau ) \left( \tilde{\mathrm{V}}^{(j)} + \epsilon \,\varDelta \varPsi _{\mathrm{data}}^{(j)} \right) \right\} . \end{aligned}$$

Lemma 2

The affine-invariant FPF drift term for the ensemble members \(\theta _\tau ^{(i)}\), \(i=1,\ldots ,M\), is given by

$$\begin{aligned} {\hat{\varSigma }}_\tau \nabla _\theta {\tilde{V}}_\tau (\theta _\tau ^{(i)}) = \sum _{j=1}^M s_{ij} \theta _\tau ^{(j)} \end{aligned}$$
(26)

with coefficients

$$\begin{aligned} s_{ij} = \frac{1}{2\epsilon } {\mathrm{T}}_{ij}\left( r_j - \sum _{k=1}^M {\mathrm{T}}_{ik} r_k \right) , \qquad r_j := \tilde{\mathrm{V}}^{(j)} + \epsilon \,\varDelta \varPsi _{\mathrm{data}}^{(j)} . \end{aligned}$$

Proof

The calculations leading to (26) are identical to those for the standard FPF [34] taking note that the covariance matrix \({\hat{\varSigma }}_\tau \) cancels out. \(\square \)

The resulting interacting particle system can be propagated in time using the forward Euler method with step size \(\varDelta \tau = 1/K\), that is,

$$\begin{aligned} \theta _{\tau _{k+1}}^{(i)} = \theta _{\tau _k}^{(i)} - \varDelta \tau {\hat{\varSigma }}_{\tau _k} \nabla _\theta {\tilde{V}}_{\tau _k} (\theta _{\tau _k}^{(i)}) , \qquad i = 1,\ldots ,M, \end{aligned}$$
(27)

\(k=0,\ldots ,K-1\), with the right-hand side defined by (26). Also note that \(M\rightarrow \infty \) leads formally to the mean-field forward Euler method (20).

Remark 4

We note that the Markov matrix \({\mathrm{T}}\), defined by (25), is non-reversible contrary to the underlying stochastic process (23). This problem can be addressed by a bi-stochastic approximation, which also leads to improved convergence rates. See [37] for more details.

While the FPF leads to a consistent approximation in the limit \(M\rightarrow \infty \) and \(\epsilon \rightarrow 0\), its application is limited to low-dimensional problems. We next describe two approximations that extend to high-dimensional inference problems.

4.2 Second-Order Methods

One can derive evolution equations for the mean \(m_\tau \) and the covariance matrix \(\varSigma _\tau \) of \(\pi _\tau \) from Eq. (8). Furthermore, if one assumes that \(\pi _\tau \) is well approximated by a Gaussian PDF with mean \(m_\tau \) and covariance matrix \(\varSigma _\tau \), the following evolution equations for the mean and the covariance matrix arise.

Lemma 3

If the temporal PDF \(\pi _t\) in (9) is Gaussian with mean \(m_\tau \) and covariance matrix \(\varSigma _\tau \), then their evolution equations are given by

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } m_\tau = -\varSigma _\tau \,\pi _\tau [ \nabla _\theta \varPsi _{\mathrm{data}}] = - \varSigma _\tau \varPhi (\pi _\tau [y] - t), \end{aligned}$$

and

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } \varSigma _\tau = -\varSigma _\tau \, \pi _\tau [ D^2_\theta \varPsi _{\mathrm{data}}] \varSigma _\tau = - \varSigma _\tau \varPhi \, \pi _\tau [R] \varPhi ^{\mathrm{T}}\, \varSigma _\tau , \end{aligned}$$

respectively.

Proof

The evolution equations are derived using the following well-known identity for Gaussian PDFs \(\pi _\tau \) (see, for example, [23]):

$$\begin{aligned} \pi _\tau [\theta \, (f(\theta )- \pi _\tau [f])] = \varSigma _\tau \pi _\tau [\nabla _\theta f] \end{aligned}$$
(28)

as well as

$$\begin{aligned} \pi _\tau [(\theta -\pi _\tau [\theta ])(\theta -\pi _\tau [\theta ])^{\mathrm{T}} (f(\theta )-\pi _\tau [f])] = \varSigma _\tau \pi _\tau [D^2_\theta f] \varSigma _\tau \end{aligned}$$
(29)

for any scalar-valued smooth function \(f:{\mathbb {R}}^D \rightarrow {\mathbb {R}}\). \(\square \)

In order to derive corresponding equations for the ensemble \(\{\theta _\tau ^{(i)}\}\), we introduce

$$\begin{aligned} {\hat{y}}_\tau := {\hat{\pi }}_\tau [y] = \frac{1}{M} \sum _{j=1}^M y_\tau ^{(j)}, \qquad y_\tau ^{(j)} := y\left( \theta _\tau ^{(j)}\right) \in {\mathbb {R}}^N , \end{aligned}$$

as well as

$$\begin{aligned} {\hat{R}}_\tau := {\hat{\pi }}_\tau [R] = \frac{1}{M}\sum _{j=1}^M R_\tau ^{(j)}, \end{aligned}$$

where the \(R_\tau ^{(j)} \in {\mathbb {R}}^{N\times N}\) are diagonal matrices with entries \(r_{nn}^{(j)} = y_{\tau ,n}^{(j)}\left( 1-y_{\tau ,n}^{(j)}\right) \) for \(n=1,\ldots ,N\) with

$$\begin{aligned} y_{\tau ,n}^{(j)} = \sigma \left( (\theta _\tau ^{(j)})^{\mathrm{T}} \phi _{x_n}\right) \in {\mathbb {R}}. \end{aligned}$$

We finally obtain the evolution equations

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } {\hat{m}}_\tau = - {\hat{\varSigma }}_\tau \varPhi ({\hat{y}}_\tau - t) \end{aligned}$$
(30)

for the ensemble mean and

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } \varTheta _\tau = -\frac{1}{2}{\hat{\varSigma }}_\tau \varPhi {\hat{R}}_\tau \varPhi ^{\mathrm{T}} \varTheta _\tau \end{aligned}$$
(31)

for the ensemble deviations (22). Note that these equations can be used to propagate the ensemble \(\{\theta _\tau ^{(i)}\}\) regardless of the Gaussian assumption made for their derivation. In the following section, we will introduce further approximations which allow us to introduce an affine-invariant gradient flow structure.

4.3 Ensemble Kalman–Bucy Filter

We now turn to a formulation with gradient flow structure in the spirit of the ensemble Kalman–Bucy filter (EnKBF) for quadratic loss functions (1) [3, 26, 28]. Provided that

$$\begin{aligned} {\hat{y}}_\tau = {\hat{\pi }}_\tau [y] \approx y({\hat{m}}_\tau ) = y({\hat{\pi }}_\tau [\theta ]) \end{aligned}$$

and

$$\begin{aligned} y_\tau ^{(i)} = y\left( \theta _\tau ^{(i)}\right) \approx {\hat{y}}_\tau + {\hat{R}}_\tau \varPhi ^{\mathrm{T}} (\theta _\tau ^{(i)}-{\hat{m}}_\tau ) , \end{aligned}$$

one formally obtains the approximation

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}t} \theta _\tau ^{(i)} = - \frac{1}{2} {\hat{\varSigma }}_\tau \varPhi \,(y_\tau ^{(i)}+ y({\hat{m}}_\tau ) - 2t) . \end{aligned}$$
(32)

to the second-order Eqs. (30)–(31).

Lemma 4

The EnKBF equations (32) are of affine-invariant gradient structure

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}t} \theta _\tau ^{(i)} = -{\hat{\varSigma }}_\tau \nabla _{\theta ^{(i)}} \varPsi _{\mathrm{KBF}} (\{\theta _\tau ^{(j)}\}) , \end{aligned}$$
(33)

with potential

$$\begin{aligned} \varPsi _{\mathrm{KBF}}(\{\theta ^{(j)}\}) = \frac{1}{2}\sum _{j=1}^M \varPsi _{\mathrm{data}}(\theta ^{(j)}) + \frac{M}{2} \varPsi _{\mathrm{data}}({\hat{m}}_\tau ) . \end{aligned}$$

Proof

The lemma can be verified by direct calculation and taking note of

$$\begin{aligned} \nabla _{\theta ^{(i)}} \varPsi _{\mathrm{data}}({{\hat{m}}}_\tau ) = \frac{1}{M} \varPhi \,(y({{\hat{m}}}_\tau )-t) \end{aligned}$$

in particular. \(\square \)

The Eq. (33) also possesses an affine-invariant gradient-flow structure in the space of probability measures in the mean-field limit \(M\rightarrow \infty \). See [9, 10, 28] for details.

Remark 5

The authors of [17] proposed a different modification of the EnKF for classification problems. In our notation, it corresponds to

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}t} \theta _\tau ^{(i)} = -\sum _{n=1}^N \left\{ \frac{1}{M-1}\sum _{j=1}^M \left( y_{\tau ,n}^{(j)}- \frac{1}{M}\sum _{l=1}^M y_{\tau ,n}^{(l)} \right) \left( \frac{t_n}{y_{\tau ,n}^{(i)}}- \frac{1-t_n}{1- y_{\tau ,n}^{(i)}} \right) \theta _\tau ^{(j)} \right\} . \end{aligned}$$

The modified EnKF formulation of [13] uses an implementation that is closer to ours but does not actually propagate ensembles. Also recall that both methods are to be used for reducing the loss function \(\varPsi _{\mathrm{data}}(\theta )\) via minimisation instead of sampling from the posterior PDF (6), which is the subject of this paper.

Remark 6

In addition to the EnKBF formulation considered so far, there exists another variant which is based on stochastic perturbations in the innovation [8, 26]. We briefly explain how to extend this alternative formulation to logistic regression. We recall that the ensemble deviations \(\varTheta _\tau \) satisfy the evolution Eq. (31). The same propagation of the first and second-order moments is achieved by the stochastic equations

$$\begin{aligned} {\mathrm{d}}(\theta _\tau - m_\tau ) = - \varSigma _\tau \varPsi \left\{ \pi _\tau [R] \varPsi ^{\mathrm{T}}(\theta _\tau -m_\tau )\,{\mathrm{d}}\tau + \pi _\tau [R]^{1/2}{\mathrm{d}}W_\tau \right\} \end{aligned}$$

in the mean-field limit \(M\rightarrow \infty \), where \(W_\tau \) denotes standard N-dimensional Brownian motion. This suggests to replace the finite ensemble size formulation (32) by the interacting system of SDEs

$$\begin{aligned} {\mathrm{d}}\theta _\tau ^{(i)}=-{{\hat{\varSigma }}}_\tau \varPsi \left( y_\tau ^{(i)}\,{\mathrm{d}}\tau + {\hat{R}}_\tau \,{\mathrm{d}}W_\tau ^{(i)} - t \,{\mathrm{d}}\tau \right) . \end{aligned}$$

Here \(W_\tau ^{(i)}\), \(i=1,\ldots ,M\), denote independent N-dimensional Brownian motions.

Again, the forward Euler method (27) can be used to solve (32) or (33) in time. However, the inherent stiffness of the equations of motion can lead to very small time-steps [2]. This issue and the rank deficiency of \({{\hat{\varSigma }}}_\tau \) for \(M\le D\) are addressed in the following subsection.

4.4 Dropout and Time-Stepping

We now discuss two numerical issue that are common to all ensemble-base methods considered in this section.

First, we observe that whenever \(M \le D\), then the empirical covariance matrix \({{\hat{\varSigma }}}_\tau \) is rank-deficient and the ensemble \(\{\theta _\tau ^{(i)}\}\) is propagated in the subspace defined by the ensemble at initial time. The concept of covariance localisation has been pioneered in the geoscience community in order to lift this restriction [8, 29]. In the context of logistic regression we suggest to instead utilise the idea of dropout training as pioneered in the machine learning community [32]. We employ the concept of dropout as follows: In each time step, a randomly chosen number of entries in \(\theta _\tau ^{(i)}\), \(i=1,\ldots ,M\), is set to zero. The resulting modified matrix of ensemble deviations (22) is denoted by \({{\tilde{\varTheta }}}_\tau \). The empirical covariance matrix is now defined by

$$\begin{aligned} {{\hat{\varSigma }}}_\tau = \frac{1}{(1-\mu )(M-1)} {{\tilde{\varTheta }}}_\tau {\tilde{\varTheta }}_\tau ^{\mathrm{T}} , \end{aligned}$$
(34)

where \(\mu \in (0,1)\) denotes the fraction of entries randomly set to zero. All other aspects of the previously considered algorithms remain unaltered. The estimator (34) underestimates the cross-correlations. An unbiased estimator requires an additional rescaling of the off-diagonal entries by \(1/(1-\mu )\).

Remark 7

There are alternative methods for breaking the sub-space property of ensemble transform methods in case of \(M\le D\). For example, the authors of [17] suggest to randomly perturb the ensemble members after each time step of EKI. In our context, this approach can be viewed as combining the EnKBF with diffusion at low temperature. However, it is not clear that diffusion is sufficient to effectively explore the full parameter space in the context of the homotopy approach considered in this paper.

Second, we return to the issue of efficient time-stepping of the interacting particle systems. We had previously considered the forward Euler method (27). However, the method can encounter severe step-size restrictions due to its restricted domain of stability. Following [2], we therefore consider the following tamed version of the forward Euler discretisation (27) for the EnKBF formulation from Sect. 4.3. Starting from (32) we employ

$$\begin{aligned} \theta _{\tau _{k+1}}^{(i)} = \theta _{\tau _{k}}^{(i)} - \frac{\varDelta \tau }{2} {\hat{\varSigma }}_{\tau _k} \varPhi \left( I+\varDelta \tau {\hat{R}}_{\tau _k} \varPhi ^{\mathrm{T}} {\hat{\varSigma }}_{\tau _k} \varPhi \right) ^{-1} (y_{\tau _k}^{(i)}+ y({\hat{m}}_{\tau _k}) - 2t) \end{aligned}$$
(35)

with step size \(\varDelta \tau >0\) for \(k=0,\ldots ,K-1\). If the necessary matrix inversions become too computationally expensive, one can diagonalise the inverse as proposed in [2]. The same modification can be applied to the second-order formulation (30)–(31).

5 Ensemble Transform Langevin Dynamics

While we have focused on the homotopy approach in the previous section, we now combine the homotopy approach with overdamped Langevin dynamics in order to sample from the desired target distribution (6) as \(\tau \rightarrow \infty \). Let us therefore turn the transport-based differential equation (10) into the McKean–Vlasov SDE

$$\begin{aligned} {\mathrm{d}} \theta _\tau = -\varSigma _\tau \left\{ \nabla _\theta V_\tau (\theta _\tau ) + \frac{1}{2} \varSigma _{\mathrm{prior}}^{-1}(\theta _\tau +m_\tau -2m_{\mathrm{prior}})\right\} {\mathrm{d}}\tau + \varSigma _\tau ^{1/2} {\mathrm{d}}W_\tau \end{aligned}$$
(36)

by adding diffusion and a prior-related drift term to (10). This SDE is motivated by the affine-invariant EKS [9] and ALDI [10] variants of standard Langevin dynamics. Note that the crucial differences are that (i) EKS/ALDI uses the gradient \(\nabla _\theta \varPsi _{\mathrm{prior}}\) of the prior in the drift term while (36) contains

$$\begin{aligned} \frac{1}{2} \varSigma _{\mathrm{prior}}^{-1}(\theta _\tau +m_\tau -2m_{\mathrm{prior}}) \end{aligned}$$

instead, and (ii) in a similar vain, the potential \(\varPsi _{\mathrm{data}}(\theta )\) is replaced in (36) by a time-dependent potential \(V_\tau \) which is defined as the solution of the elliptic PDE (9). We will find that these modifications are motivated by the two facts that (i) one can find efficient gradient-free time-stepping methods for the SDE (36) and (ii) that the SDE (36) samples from the posterior PDF in the case of Gaussian distributions as we demonstrate next.

Lemma 5

If the initial \(\pi _0\) is Gaussian and \(\varPsi _{\mathrm{data}}\) is a potential of the form

$$\begin{aligned} \varPsi _{\mathrm{data}}(\theta ) = \frac{1}{2} (G\theta - t)^{\mathrm{T}} \varGamma ^{-1} (G\theta - t), \end{aligned}$$

then the marginal PDFs \(\pi _\tau \) of (36) remain Gaussian for all \(\tau >0\), and the SDE (36) has a stationary measure which is given by \(\pi _\infty (\theta ) \propto \exp (-\varPsi _{\mathrm{post}}(\theta ))\).

Proof

Under the stated assumption, we have

$$\begin{aligned} -\frac{1}{\pi _\tau (\theta )}\nabla _\theta \cdot (\pi _\tau (\theta ) \varSigma _\tau \nabla _\theta V_\tau (\theta )) = (\theta -m_\tau )^{\mathrm{T}} \nabla _\theta V_\tau (\theta ) - \nabla _\theta \cdot \varSigma _\tau \nabla _\theta V_\tau (\theta ) \end{aligned}$$

as well as

$$\begin{aligned} \varPsi _{\mathrm{data}}(\theta ) - \pi _\tau [\varPsi _{\mathrm{data}}]= & {} \frac{1}{2} (G\theta -t)^{\mathrm{T}} \varGamma ^{-1} (G\theta -t) \,\,\\&-\frac{1}{2} \pi _\tau [\theta ^{\mathrm{T}}G^{\mathrm{T}} \varGamma ^{-1} G\theta ] + m_\tau ^{\mathrm{T}}G^{\mathrm{T}}\varGamma ^{-1} t - \frac{1}{2} t^{\mathrm{T}}\varGamma ^{-1} t\\= & {} \frac{1}{2} (\theta -m_\tau )^{\mathrm{T}} G^{\mathrm{T}} \varGamma ^{-1}(G\theta + Gm_\tau - 2t) \,\,\\&- \frac{1}{2} \pi _\tau [\theta ^{\mathrm{T}}G^{\mathrm{T}}\varGamma ^{-1} G\theta ] + \frac{1}{2} m_\tau ^{\mathrm{T}} G^{\mathrm{T}} \varGamma ^{-1} G m_\tau . \end{aligned}$$

Equating both expressions leads us to

$$\begin{aligned} \nabla _\theta V_\tau (\theta ) = \frac{1}{2} G^{\mathrm{T}} \varGamma ^{-1} (G\theta + Gm_\tau - 2t). \end{aligned}$$
(37)

The evolution equations arising from (36) for the mean \(m_\tau \) and the covariance matrix \(\varSigma _\tau \) are therefore given by

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } m_\tau&= -\varSigma _\tau \left( G^{\mathrm{T}} \varGamma ^{-1}(Gm_\tau -t ) + \varSigma _{\mathrm{prior}}^{-1}(m_\tau - m_{\mathrm{prior}})\right) \\&= - \varSigma _\tau \varSigma _{\mathrm{post}}^{-1}(m_\tau - m_{\mathrm{post}}) \end{aligned}$$

and

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau }\varSigma _\tau&= -\varSigma _\tau \left( G^{\mathrm{T}} \varGamma ^{-1}G + \varSigma _{\mathrm{prior}}^{-1}\right) \varSigma _\tau +\varSigma _\tau \\&= - \varSigma _\tau \varSigma _{\mathrm{post}}^{-1} \varSigma _\tau + \varSigma _\tau , \end{aligned}$$

respectively. The associated equilibrium solutions are

$$\begin{aligned} m_\infty = m_{\mathrm{post}} := m_{\mathrm{prior}} - \varSigma _{\mathrm{post}}G^{\mathrm{T}} \varGamma ^{-1} (Gm_{\mathrm{prior}} - t) \end{aligned}$$

and

$$\begin{aligned} \varSigma _\infty = \varSigma _{\mathrm{post}} := \left( G^{\mathrm{T}} \varGamma ^{-1}G + \varSigma _{\mathrm{prior}}^{-1}\right) ^{-1}, \end{aligned}$$

as desired. We also note that the linearity of the gradient (37) implies that the PDFs \(\pi _\tau \) remain Gaussian. Finally, (37) is indeed the gradient of the potential

$$\begin{aligned} V_\tau (\theta ) = \frac{1}{4} (G\theta - t)^{\mathrm{T}} \varGamma ^{-1} (G\theta - t) + \frac{1}{2} \theta ^{\mathrm{T}} G^{\mathrm{T}} \varGamma ^{-1} (G m_\tau - t) , \end{aligned}$$

which concludes the proof. \(\square \)

Remark 8

One can replace the McKean–Vlasov SDE (36) by the following formulation

$$\begin{aligned} {\mathrm{d}} \theta _\tau = -\varSigma _\tau \nabla _\theta U_\tau (\theta _\tau ) \,{\mathrm{d}}\tau + \varSigma _\tau ^{1/2} {\mathrm{d}}W_\tau \,, \end{aligned}$$
(38)

where the potential \(U_\tau \) now satisfies the elliptic PDE

$$\begin{aligned} \nabla _\theta \cdot (\pi _\tau \varSigma _\tau \nabla _\theta U_\tau ) = -\pi _\tau (\varPsi _{\mathrm{post}} - \pi _\tau \left[ \varPsi _{\mathrm{post}}\right] ) \end{aligned}$$

instead of (9). Both formulations are equivalent for Gaussian PDFs \(\pi _\tau \). While (36) is closer to the homotopy approach of Sect. 4, formulation (38) has a closer resemblance to overdamped Langevin dynamics.

5.1 Asymptotic Behaviour

The case of a general negative log-likelihood function \(\varPsi _{\mathrm{data}}\) is more delicate. However, if the associated posterior (6) is non-Gaussian but is well approximated by a Gaussian, and this also applies to all intermediate PDFs \(\pi _\tau \), that is,

$$\begin{aligned} \pi _\tau (\theta ) \approx {{\tilde{\pi }}}_\tau (\theta ) \propto \exp \left( -\frac{1}{2}(\theta - m_\tau )^{\mathrm{T}}\varSigma _\tau ^{-1}(\theta - m_\tau ) \right) , \end{aligned}$$
(39)

then following the arguments from Sect. 4.2, the first two moments satisfy

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } m_\tau = -\varSigma _t {\tilde{\pi }}_\tau [\nabla _\theta \varPsi _{\mathrm{post}}] \end{aligned}$$
(40)

and

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}t} \varSigma _\tau = -\varSigma _t {{\tilde{\pi }}}_\tau [D_\theta ^2 \varPsi _{\mathrm{post}} ] \varSigma _t + \varSigma _t . \end{aligned}$$
(41)

The equation for the covariance matrix is stable if \({{\tilde{\pi }}}_\tau [D_\theta ^2 \varPsi _{\mathrm{post}} ]\) is positive-definite for all \(\tau \ge 0\). This is, for example, the case for logistic regression.

We now discuss the limiting \(\tau \rightarrow \infty \) behavior of (36) in some more detail. Its equilibrium distributions are characterised by the following lemma.

Lemma 6

Stationary measures \(\pi _\infty \) of the mean-field equation (36) satisfy the PDE

$$\begin{aligned}&\frac{1}{2} \nabla _\theta \cdot \left( \pi _\infty \varSigma _\infty \left\{ \nabla _\theta \log \pi _\infty + \varSigma _{\mathrm{prior}}^{-1}(\theta +m_\infty -2m_{\mathrm{prior}}) \right\} \right) \\&\quad = \pi _\infty (\varPsi _{\mathrm{data}}- \pi _\infty [\varPsi _{\mathrm{data}}]) \end{aligned}$$

or, in terms of expectation values with respect to test functions \(\phi \),

$$\begin{aligned}&\frac{1}{2} \pi _\infty \left[ \nabla _\theta \phi \cdot \varSigma _\infty \left\{ \nabla _\theta \log \pi _\infty + \varSigma _{\mathrm{prior}}^{-1}(\theta +m_\infty -2m_{\mathrm{prior}}) \right\} \right] \nonumber \\&\quad = \pi _\infty \left[ \phi \left( \varPsi _{\mathrm{data}}-\pi _\infty \left[ \varPsi _{\mathrm{data}} \right] \right) \right] . \end{aligned}$$
(42)

Proof

The lemma follows from the Fokker–Planck equation associated to (36) and the PDE (9). \(\square \)

Again making the Gaussian approximation (39), formally setting \(\tau = \infty \) and recalling Remark 8, we first obtain from (42) the identity

$$\begin{aligned} \frac{1}{2} {{\tilde{\pi }}}_\infty \left[ \nabla _\theta \phi \cdot \varSigma _\infty \nabla _\theta \log {{\tilde{\pi }}}_\infty \right] = {\tilde{\pi }}_\infty \left[ \phi \left( \varPsi _{\mathrm{post}}- {{\tilde{\pi }}}_\infty \left[ \varPsi _{\mathrm{post}} \right] \right) \right] . \end{aligned}$$

Then,

$$\begin{aligned} \frac{1}{2} {{\tilde{\pi }}}_\infty [ \nabla _\theta \log {\tilde{\pi }}_\infty ] = 0 = {{\tilde{\pi }}}_\infty [ \nabla _\theta \varPsi _{\mathrm{post}}] \end{aligned}$$
(43)

for \(\phi (\theta ) = \theta -m_\infty \), and

$$\begin{aligned} {{\tilde{\pi }}}_\infty [ D^2_\theta \log {{\tilde{\pi }}}_\infty ] = \varSigma _\infty ^{-1} = {{\tilde{\pi }}}_\infty [ D^2_\theta \varPsi _{\mathrm{post}}] \end{aligned}$$
(44)

for \(\phi (\theta ) = (\theta -m_\infty )(\theta -m_\infty )^{\mathrm{T}}\), which provide a self-consistent system of equations for the mean \(m_\infty \) and the covariance matrix \(\varSigma _\infty \) of the Gaussian approximation (39). Here we have again used (28) and (29), respectively.

Note that (43) and (44) are compatible with (40) and (41), respectively. However, (43) and (44) only require Gaussianity (or a near-Gaussianity) for the equilibrium distribution \(\pi _\infty \).

5.2 Numerical Implementation

The most appealing aspect of our modified overdamped Langevin dynamics is the possibility of implementing the SDEs (36) or (38), respectively, in a derivative-free manner, that is, without explicit knowledge of the potential \(V_\tau \) and its gradient.

We describe a suitable time-stepping method in detail for the formulation (36) utilising methods from sequential data assimilation, which are well suited to deal with the \(-\varSigma _\tau \nabla _\theta V_\tau \) term in the SDE (36). In other words, we numerically solve the SDE (36) with a step size \(\varDelta \tau \) by alternating in each time step between (i) a data assimilation step (such as SMC or an ensemble transform filter [29]) with negative log-likelihood

$$\begin{aligned} l(\theta ) = \varDelta \tau \varPsi _{\mathrm{data}}(\theta ) \end{aligned}$$

and (ii) the SDE

$$\begin{aligned} {\mathrm{d}}\theta _t = -\frac{1}{2}\varSigma _\tau \varSigma _{\mathrm{prior}}^{-1}(\theta _\tau +m_\tau -2m_{\mathrm{prior}}) \,{\mathrm{d}}\tau + \varSigma _t^{1/2} {\mathrm{d}} W_t \end{aligned}$$
(45)

over the same time-interval \(\varDelta \tau \). More precisely, given \(\theta _{\tau _k} \sim \pi _{\tau _k}\) we first find

$$\begin{aligned} {{\tilde{\theta }}}_{\tau _{k+1}} \sim \exp (-\varDelta \tau \varPsi _{\mathrm{data}})\,\pi _{\tau _k} \end{aligned}$$
(46)

followed by solving (45) with initial \({\tilde{\theta }}_{\tau _{k+1}}\) over the time-interval \(\varDelta \tau \) in order to obtain \(\theta _{\tau _{k+1}} \sim \pi _{\tau _{k+1}}\) for \(k=0,\ldots ,K-1\). Note that (45) can be solved robustly by the tamed forward Euler-Maruyama method

$$\begin{aligned} \theta _{\tau _{k+1}}= & {} {{\tilde{\theta }}}_{\tau _{k+1}} - \frac{\varDelta \tau }{2} {{\tilde{\varSigma }}}_{\tau _{k+1}} \left( \varSigma _{\mathrm{prior}} + \varDelta \tau {{\tilde{\varSigma }}}_{\tau _{k+1}} \right) ^{-1} ({{\tilde{\theta }}}_{\tau _{k+1}}+ {{\tilde{m}}}_{\tau _{k+1}} -2m_{\mathrm{prior}}) \\&+ \sqrt{\varDelta \tau } {{\tilde{\varSigma }}}_{\tau _{k+1}}^{1/2} \varXi _{k+1} \end{aligned}$$

with \(\varXi _{k+1}\) a D-dimensional standard Gaussian random variable. As for ALDI, the \(\varSigma _\tau ^{1/2}\) multiplying the Brownian motion in (45) requires a correction term in order to sample from the correct distribution for finite ensemble sizes M. See [10] for details.

Our numerical implementations of (46) in terms of ensembles \(\{\theta _{\tau _k}^{(i)}\}\) rely on ensemble transform filters of the form [29]

$$\begin{aligned} {{\tilde{\theta }}}_{\tau _{k+1}}^{(j)} = \sum _{i=1}^M \theta _{\tau _k}^{(i)} s_{ij} \end{aligned}$$
(47)

for \(j =1,\ldots ,M\) with appropriate coefficients \(s_{ij}\), which satisfy

$$\begin{aligned} \sum _{i=1}^M s_{ij} = 1. \end{aligned}$$
(48)

We collect the coefficients \(s_{ij}\) in the \(M\times M\) matrix S.

Lemma 7

An ensemble transform filter is affine-invariant (compare Sect. 3.1) if the associated coefficients \({\overline{s}}_{ij}\) for the filter in the transformed variable \({\overline{\theta }}\) satisfy \(s_{ij} = {\overline{s}}_{ij}\) for all \(i,j = 1,\ldots ,M\).

Proof

If

$$\begin{aligned} \theta _{\tau _k}^{(i)} = A{\overline{\theta }}_{\tau _k}^{(i)}+b \end{aligned}$$

at \(\tau _k\), then

$$\begin{aligned} \theta _{\tau _{k+1}}^{(j)} = \sum _{i=1}^M \theta _{\tau _k}^{(i)}s_{ij} = \sum _{i=1}^M (A {\overline{\theta }}_{\tau _k}^{(i)} + b)s_{ij} = A \sum _{i=1}^M {\overline{\theta }}_{\tau _k}^{(i)} {\overline{s}}_{ij} + b = A {\overline{\theta }}_{\tau _{k+1}}^{(j)} + b , \end{aligned}$$

and we conclude that

$$\begin{aligned} \theta _{\tau _{k+1}}^{(i)} = A{\overline{\theta }}_{\tau _{k+1}}^{(i)}+b \end{aligned}$$

at \(\tau _{k+1}\) for all \(i=1,\ldots ,M\). \(\square \)

The EnKF and the nonlinear ensemble transform filter (NETF) of [36], for example, are affine-invariant. More specifically, the NETF leads to the transformation matrix

$$\begin{aligned} S = w_{\tau _k} \mathrm{1}^{\mathrm{T}}_M + \sqrt{M} \left( \mathrm{diag}\,(w_{\tau _k}) - w_{\tau _k} w_{\tau _k}^{\mathrm{T}}\right) ^{1/2} \end{aligned}$$

where \(w_{\tau _k} \in {\mathbb {R}}^M \) is the vector of normalised importance weights with entries

$$\begin{aligned} w_{\tau _k}^{(i)} = \frac{\exp (-\varDelta \tau \varPsi _{\mathrm{data}}(\theta _{\tau _k}^{(i)}))}{\sum _{j=1}^M \exp (-\varDelta \tau \varPsi _{\mathrm{data}}(\theta _{\tau _k}^{(j)}))}, \end{aligned}$$

\(\mathrm{1}_M \in {\mathbb {R}}^M\) is the vector of ones, and \(\text{ diag }\,(w_{\tau _k}) \in {\mathbb {R}}^{M\times M}\) is the diagonal matrix with diagonal entries \(w_{\tau _k}\). The affine invariance follows from the invariance of the importance weights \(w_{\tau _k}\). The affine invariance of the EnKF has been previously demonstrated, for example, in [22].

Remark 9

The ensemble transform particle filter (ETPF) of [27, 29] can also be made affine-invariant by using the cost function

$$\begin{aligned} V(S) = \frac{1}{2}\sum _{i,j=1}^M s_{ij}\, (\theta _{\tau _k}^{(i)}-\theta _{\tau _k}^{(j)})^{\mathrm{T}} {\hat{\varSigma }}_{\tau _k}^{-1} (\theta _{\tau _k}^{(i)}-\theta _{\tau _k}^{(j)}) \end{aligned}$$
(49)

subject to the constraints \(s_{ij}\ge 0\), (48), and

$$\begin{aligned} \frac{1}{M}\sum _{j=1}^M s_{ij} = w_{\tau _k}^{(i)} \end{aligned}$$
(50)

in the optimal transport definition of the coefficients \(s_{ij}\) in (47). Again, the affine invariance follows from the affine invariance of both the cost function V(S) and the importance weights \(w_{\tau _k}\).

6 Numerical Example

In this section, we provide a couple of relatively simple numerical illustrations for the methods proposed in this paper. We start with a low-dimensional toy problem for which all proposed methods can easily be implemented and tested.

Example 1

We follow the example from [5] Section 4.2.1. Specifically, consider the Gaussian likelihood functions

$$\begin{aligned} \pi (x|C_i) \propto \exp \left( -\frac{1}{2} (x-\mu _i)^{\mathrm{T}} B^{-1}(x-\mu _i)\right) \end{aligned}$$

for inputs x to belong to classes \(C_i\), \(i=1,2\). The implied posterior PDF \(\pi (C_1|\phi _x)\) is given by (2) with \(\phi _x = (x^{\mathrm{T}},1)^{\mathrm{T}}\), that is \(D = J+1\), and the true parameter value is

$$\begin{aligned} \theta = \left( \begin{array}{c} B^{-1}(\mu _1-\mu _2)\\ -\frac{1}{2}\mu _1^{\mathrm{T}} B^{-1}\mu _1 + \frac{1}{2} \mu _2^{\mathrm{T}}B^{-1} \mu _2 + \ln \frac{\pi (C_1)}{\pi (C_2)} \end{array} \right) . \end{aligned}$$
(51)

Furthermore, given data points \((x_n,t_n)\), \(n=1,\ldots ,N\), the maximum likelihood estimators for \(\pi (C_1)\), \(\mu _i\), and B are given by

$$\begin{aligned}&{{\hat{\pi }}}(C_1) = \frac{N_1}{N}, \qquad N_1 = \sum _{n=1}^N t_n, \\&{{\hat{\mu }}}_1 = \frac{1}{N_1}\sum _{n=1}^N t_n x_n,\qquad {{\hat{\mu }}}_2 = \frac{1}{N-N_1} \sum _{n=1}^N (1-t_n)x_n , \end{aligned}$$

and

$$\begin{aligned} {{\hat{B}}} = \frac{1}{N} \left\{ \sum _{n=1}^N t_n (x_n-{\hat{\mu }}_1)(x_n-{{\hat{\mu }}}_1)^{\mathrm{T}} + \sum _{n=1}^N (1-t_n)(x_n-{\hat{\mu }}_2)(x_n-{{\hat{\mu }}}_2)^{\mathrm{T}} \right\} , \end{aligned}$$

respectively. See Section 4.2.2 in [5] for details. In this example, we will however directly estimate the true parameter value (51) using Bayesian inference.

We consider the special case \(J = 2\) and use \(B = \mathrm{I}\), \(\mathrm{I}\) the identity matrix, \(\mu _1 = (-1,-1)^{\mathrm{T}}\), \(\mu _2 = (2,2)^{\mathrm{T}}\), and \(\pi (C_1) = \pi (C_2) = 1/2\) to generate \(N=100\) data points. We note that the chosen values for \(\mu _1\), \(\mu _2\), and B lead to

$$\begin{aligned} \theta = \left( \begin{array}{c} -3 \\ -3 \\ 3 \end{array} \right) . \end{aligned}$$
(52)

We now test the proposed algorithms using (i) an informative Gaussian prior with mean (52) and covariance matrix \(\varSigma _{\mathrm{prior}} = \mathrm{I}\), and (ii) a less informative Gaussian prior with mean \(m_{\mathrm{prior}} = 0\) and covariance matrix \(\varSigma _{\mathrm{prior}} = 4\,\mathrm{I}\). We time step the affine-invariant FPF, the second-order filter (30)–(31), and the EnKBF (32) with the forward Euler method using a step size \(\varDelta \tau = 10^{-3}\). A further reduction of the step size did not change the results in a statistically significant manner. The McKean–Vlasov SDE (36) is also implemented with the NETF as the inner data assimilation step. Furthermore, we implemented the ALDI method [10] for comparison since it is known to exactly sample from the posterior PDF (6). Both ALDI and the McKean–Vlasov SDE (38) were run up to time \(\tau = 10\) with step-size \(\varDelta \tau = 10^{-2}\) at which point the interacting particle systems were considered to be in equilibrium.

The implementation of the FPF, as well as the SDE (36), explicitly involve the evaluation of the negative log-likelihood function (3). We found that we needed to replace (4) by

$$\begin{aligned} y_n = 0.99 \,\sigma (\theta ^{\mathrm{T}}\phi _{x_n}) + 0.005 \end{aligned}$$

in order to avoid numerical instabilities due to exceedingly small values of \(\ln y_n\). The band-width, \(\epsilon \), in (24) has been set to \(\epsilon = 0.1\).

All experiments were repeated \(L = 1000\) times in order to average out random sampling effects. The numerically computed posterior means and spectral norm of the posterior covariance matrices can be found in Tables 1 and 2, respectively, for the informative prior, and in Tables 3 and 4, respectively, for the less informative prior. Reference values from an ALDI simulation are provided in the captions. While all methods reproduce those reference values very well in case of the informative prior, this picture changes for the less informative prior, where one expects a stronger nonlinear and non-Gaussian behavior of the associated Bayesian inference problem. One finds that the performance of the FPF improves as the ensemble size increases while the second-order method and the EnKBF suffer from a systematic bias. We also find that the McKean–Vlasov SDE formulation (36) behaves rather well over the full range of ensemble sizes. We conclude that all tested methods are able to qualitatively reproduce the exact parameter value (52) even in the case of the less informative prior. The results also indicate that the posterior distribution is close to Gaussian, while the intermittent PDFs \(\pi _\tau \) must deviate significantly from a Gaussian distribution.

In terms of computational cost, the EnKBF and the second-order method perform comparable and are the least expensive by far for all ensemble sizes. While the McKean–Vlasov SDE is twice as expensive as the EnKBF at \(M = 50\), this ratio goes up to a factor of five at \(M = 400\). The FPF, on the other hand, is ten times as expensive as the EnKFB at \(M =50\) which increases steeply to a factor of one hundred at \(M = 400\).

Table 1 Informative prior: We display the ensemble mean averaged over \(L=1000\) repeated experiments for the four proposed methods as a function of the ensemble size M
Table 2 Informative prior: We display the spectral norm of the final ensemble covariance matrices averaged over \(L=1000\) repeated experiments for the four proposed methods as a function of the ensemble size M
Table 3 Less informative prior: We display the ensemble mean averaged over \(L=1000\) repeated experiments for the four proposed methods as a function of the ensemble size M
Table 4 Less informative prior: We display the spectral norm of the final ensemble covariance matrices averaged over \(L=1000\) repeated experiments for the four proposed methods as a function of the ensemble size M

We now consider a high-dimensional extension of the previous example, for which we further investigate the performance of the EnKBF (33) in terms of the tamed time-stepping method (35) for ensemble sizes \(M\le D\), localisation via dropouts, and data sub-sampling.

Example 2

We consider the simple feature map

$$\begin{aligned} \phi _x = x, \end{aligned}$$

that is \(J=D\) such that

$$\begin{aligned} y_x(\theta ) = \sigma (\theta ^{\mathrm{T}}x) \end{aligned}$$
Table 5 We display the \(l_2\)-difference between the posterior ensemble means and the true parameter value averaged over \(L=1000\) repeated experiments and its standard deviation for three different implementations of the EnKBF as a function of the ensemble size M
Table 6 We provide the spectral norm of the computed posterior covariance matrices in the setting described in Table 5

for \(D=50\). The data is generated by first drawing a \(\theta = \theta _{\mathrm{ref}}\) from the standard normal distribution in \({\mathbb {R}}^D\). Next, \(N=1000\) data points are generated by randomly drawing \(x_n\), \(n= 1,\ldots ,N\), again from a standard normal distribution in \({\mathbb {R}}^D\). Those points are assigned the label \(t_n = 1\) with probability \(y_{x_n}(\theta _{\mathrm{ref}})\), and \(t_n = 0\) otherwise. We use the EnKBF tamed time-stepping method (35) with step size \(\varDelta \tau = 1/200\) for parameter inference. The initial ensemble \(\theta _0^{(i)}\), \(i=1,\ldots ,M\) is drawn from the prior Gaussian distribution with mean \(m_{\mathrm{prior}}=0\) and the covariance matrix \(\varSigma _{\mathrm{prior}}\) equal to the identity matrix. We vary the ensemble size M between \(M=20\) and \(M = 100\) and use dropout localisation for the computation of the empirical covariance matrices \({{\hat{\varSigma }}}_{\tau _n}\) following (34). More specifically, an entry of \(\theta _{\tau _n}^{(i)}\) is set to zero if \(\eta <0.5\) where the \(\eta \)’s are i.i.d.  uniform random variables from the interval [0, 1]. We report in Table 5 the \(l_2\)-difference between \(\theta _{\mathrm{ref}}\) and the ensemble mean \({{\hat{m}}}_\tau \) at final time \(\tau = 1\). We also state the spectral norm of \({{\hat{\varSigma }}}_\tau \) at the final time in Table 6. In a second set of experiments, we apply mini-batching to the localised EnKBF with random batches of size \(N' = 100\). We also compare the results to those from the standard EnKBF formulation (32). Furthermore, ALDI is run for a sample size of \(M=100\) in order to provide the numerical benchmark values listed in the captions of Tables 5 and 6. All results have been averaged over \(L = 1000\) independent simulations. We also report the standard deviations in addition to the averaged values. Our numerical results indicate that localisation by dropout is effective for small ensemble sizes, and that mini-batching does not degrade performance while providing significant computational savings. As for the previous example, the EnKBF is able to capture the posterior mean quite well while the posterior variances are systematically underestimated. Ensemble inflation, as widely used for the EnKF [8, 29], could help to further improve the variance estimates. We found that our results are insensitive to the choice of the step size \(\varDelta \tau \) in (35). However changing the threshold value \(\mu = 0.5\) in the dropout localisation (34) affects the results significantly. Smaller values of \(\mu \) lead in particular to reduced \(l_2\)-differences for larger ensemble sizes, M, while being less effective for smaller M. See Table 7.

Table 7 Same results as displayed in Table 5 except that the dropout rate has been reduced from \(\mu =0.5\) to \(\mu = 0.2\). While the estimation errors increase for the smallest ensemble sizes the reduced rate improves the performance for larger ensemble sizes

7 Generalisations and Extensions

In this section, we discuss possible extensions of the proposed methods to nonlinear logistic regression including derivative-free implementations, multi-class logistic regression, and sigmoidal Cox processes.

7.1 Nonlinear Logistic Regression

We generalise linear logistic regression from Sect. 2 to the more general logistic regression problem with \(y_x(\theta ) := \sigma (f_x(\theta ))\) for given functions \(f_x:{\mathbb {R}}^D \rightarrow {\mathbb {R}}\). Hence the gradient \(\nabla _\theta y_x(\theta )\) becomes

$$\begin{aligned} \nabla _\theta y_x(\theta ) = \phi _x(\theta ) \,(1-y_x(\theta ))\,y_x(\theta ), \end{aligned}$$

where the previously considered feature maps \(\phi _x\) now also depend on the parameters \(\theta \), that is,

$$\begin{aligned} \phi _x(\theta ) := \nabla _\theta f_x(\theta ). \end{aligned}$$

The gradient of \(\varPsi _{\mathrm{data}}\) is thus explicitly given by

$$\begin{aligned} \nabla _\theta \varPsi _{\mathrm{data}}(\theta ) = \sum _{n=1}^N \phi _{x_n}(\theta )\,(y_{x_n}(\theta )-t_n) \end{aligned}$$

and the Hessian matrix of second-order derivatives becomes

$$\begin{aligned} D^2_\theta \varPsi _{\mathrm{data}}(\theta )= & {} \sum _{n=1}^N \left\{ \phi _{x_n}\,(\theta ) \,y_{x_n}(\theta )\,(1-y_{x_n}(\theta ))\,\phi _{x_n}(\theta )^{\mathrm{T}} \right. \\&+ \left. D_\theta \phi _{x_n}(\theta )\,(y_{x_n}(\theta )-t_n) \right\} . \end{aligned}$$

We now discuss a derivative-free implementation of the EnKBF (32) for nonlinear logistic regression. The key difference to linear logistic regression is that the matrix \(\varPhi \in {\mathbb {R}}^{D\times N}\) now depends on \(\theta \) and is given by

$$\begin{aligned} \varPhi (\theta ) = (\nabla _\theta f_{x_1}(\theta ),\ldots ,\nabla _\theta f_{x_N}(\theta )). \end{aligned}$$

In order to avoid the computation of the gradients of \(f_x\), one replaces \(\varSigma _\tau \varPhi (\theta _\tau )\) in (32) by the empirical correlation matrix

$$\begin{aligned} C_\tau := \frac{1}{M-1} \sum _{i=1}^M(\theta _\tau ^{(i)}-{\hat{m}}_\tau )\left( f_{x_1}(\theta _\tau ^{(i)}), \ldots , f_{x_N}(\theta _\tau ^{(i)})\right) \in {\mathbb {R}}^{D\times N} \end{aligned}$$

between \(\theta _\tau \) and \(f_{x_n}(\theta _\tau )\) for \(n=1,\ldots ,N\). See [8] for the basic idea in the context of the EnKF and also [13, 17] for the specific application of this idea in the context of machine learning. See also the recent work [24, 30] on tuneable derivative-free approximations. We also note that this substitution becomes exact in the case of linear functions \(f_x(\theta )\). Such a linear approximation is, for example, justified for functions \(f_x(\theta )\) defined by sufficiently wide neural networks [18].

7.2 Multi-class Logistic Regression

The extension from two-class to multi-call classification is straightforward, see [5]. The posterior probabilities for L classes \(C_l\), \(l=1,\ldots ,L\), are given by

$$\begin{aligned} \pi (C_l|\phi _x) = \frac{\exp \left( a_{x,l}\right) }{\sum _{j=1}^L \exp \left( a_{x,j}\right) }, \end{aligned}$$

where the activation is given by

$$\begin{aligned} a_{x,l} := \theta _l^{\mathrm{T}}\phi _x . \end{aligned}$$

The adjustable parameters are now \(\theta = (\theta _1^{\mathrm{T}},\ldots ,\theta _L^{\mathrm{T}})^{\mathrm{T}} \in {\mathbb {R}}^{L D}\). The negative log-likelihood function becomes

$$\begin{aligned} \varPsi _{\mathrm{data}}(\theta ) = - \sum _{n=1}^N \sum _{l=1}^L t_{nl} \ln y_{nl}(\theta ) \end{aligned}$$

with

$$\begin{aligned} y_{nl}(\theta ) := \frac{\exp \left( a_{x_n,l} \right) }{\sum _{j=1}^L \exp \left( a_{x_n,j}\right) } \end{aligned}$$

and data \((x_n,\{t_{nl}\}_{l=1}^L)\), where \(t_{nl} \in \{0,1\}\) subject to \(\sum _{l=1}^L t_{nl} = 1\) for all \(n=1,\ldots ,N\). The gradients satisfy

$$\begin{aligned} \nabla _{\theta _l} \varPsi _{\mathrm{data}}(\theta ) = \sum _{n=1}^N (y_{nl}-t_{nl}) \phi _n , \end{aligned}$$

\(l=1,\ldots ,L\). The Hessian matrix of second-order derivatives got \(D\times D\) block entries

$$\begin{aligned} D_{\theta _l} D_{\theta _j} \varPsi _{\mathrm{data}}(\theta ) = \sum _{n=1}^N y_{nl}(\delta _{lj} - y_{nj}) \phi _n \phi _n^{\mathrm{T}} \end{aligned}$$

with \(\delta _{lj} = 1\) if \(l=j\) and \(\delta _{lj} = 0\) otherwise. We conclude that these formulas closely resemble the corresponding expressions for binary classification and, hence, all previously considered algorithms easily generalise to multi-class regression.

7.3 Sigmoidal Cox Processes

We consider Cox processes (CPs) on a domain \({\mathcal {X}} \subset {\mathbb {R}}^J\). Following [1], the CP is parametrised by an intensity function \(\lambda _x:{\mathcal {X}}\rightarrow {\mathbb {R}}^+\) which we represent using a sigmoidal random feature model, that is,

$$\begin{aligned} \lambda _x(\theta ) = \lambda ^*\sigma (\theta ^{\mathrm{T}} \phi _x), \end{aligned}$$

where \(\lambda ^*>0\) provides the upper threshold value for the intensity function. Note that a Gaussian process is used by [1] while we rely here on the closely related random feature maps \(\phi _x\) [21], which easier integrate into the homotopy approach and the EnKF in particular [12].

See [7] for an efficient Bayesian inference for CPs using variable augmentation. Here we suggest an alternative approach using an associated unbiased estimator of the negative log-likelihood in the forward Euler time-stepping (20).

More specifically, given N events \(x_n\), \(n=1,\ldots ,N\), the negative log-likelihood function is provided by

$$\begin{aligned} \varPsi _{\mathrm{data}}(\theta ) = \int _{\mathcal {X}} \lambda _x (\theta ) {\mathrm{d}}x - \sum _{n=1}^N \ln \lambda _{x_n}(\theta ) . \end{aligned}$$

Following the notation of Sect. 3.2 and using I uniformly distributed samples \({{\hat{x}}}_i\), \(i=1,\ldots ,I\), on \({\mathcal {X}}\), we obtain the random unbiased estimator

$$\begin{aligned} \varPsi _{\mathrm{data}}^\gamma (\theta ) = \frac{\lambda ^*}{I} \sum _{i=1}^I y_{{{\hat{x}}}_i}(\theta ) - \sum _{n=1}^N \ln y_{x_n}(\theta ) - N \ln \lambda ^*\end{aligned}$$
(53)

for this intractable likelihood \(\varPsi _{\mathrm{data}}\). Here we also used the previously introduced notation (4), that is, \(\lambda _x(\theta ) = \lambda ^*y_x(\theta )\).

We now apply the homotopy approach (10) to the associated Bayesian inference problem. Following the discussion in Sect. 3.2, the evolution converges to the correct posterior if in each time step (20) one uses a different realisation \(\gamma _{\tau _k}\) of the unbiased estimator (53) instead of \(\varPsi _{\mathrm{data}}\) in (9) and takes the limit \(\varDelta \tau \rightarrow 0\).

Example 3

Let us briefly discuss a related problem in a strictly Gaussian setting. We start from the intractable negative log-likelihood function

$$\begin{aligned} \varPsi _{\mathrm{data}}(\theta ) := \frac{1}{2}\int _0^1 \left( \theta ^{\mathrm{T}} \phi _x -t\right) ^2 {\mathrm{d}}x. \end{aligned}$$

Its gradient is provided by

$$\begin{aligned} \nabla _\theta \varPsi _{\mathrm{data}}(\theta ) = \int _0^1 \phi _x \left( \theta ^{\mathrm{T}} \phi _x -t\right) {\mathrm{d}}x \end{aligned}$$

and the associated mean-field EnKBF becomes

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}}\tau } \theta _t = -\frac{1}{2} \varSigma _\tau \int _0^1 \phi _x \left( \left( \theta _\tau + m_\tau \right) ^{\mathrm{T}} \phi _x -2t \right) {\mathrm{d}}x . \end{aligned}$$

So far, everything is exact and no approximation has been made. We now replace the integral by its Monte Carlo approximation using I randomly sampled \({\hat{x}}_{i,k}\sim \mathrm{U}[0,1]\), \(i=1,\ldots ,I\), and introduce a time step \(\varDelta \tau \) leading to

$$\begin{aligned} \theta _{\tau _{k+1}} = \theta _{\tau _k} - \frac{\varDelta \tau }{2I}\varSigma _{\tau _k} \sum _{i=1}^I \phi _{{{\hat{x}}}_{i,k}} \left( \left( \theta _{\tau _k}+m_{\tau _k}\right) ^{\mathrm{T}} \phi _{{{\hat{x}}}_{i,k}} - 2t \right) , \end{aligned}$$

where we re-sample the \({\hat{x}}_{i,k}\)’s in each time-step \(\tau _k\), \(k=0,\ldots ,K-1\). A theoretical investigation now boils down to studying the limit \(\varDelta \tau \rightarrow 0\), which can be performed along the modified equation analysis of [20]. We note that the approximation error can be reduced by either increasing I or decreasing the step size \(\varDelta \tau \).

8 Conclusions

We have presented affine-invariant extensions of the popular EnKF and related methods such as the FPF to logistic regression. In addition, we have proposed a novel SDE-based affine-invariant and derivative-free sampling method which utilises the robust ensemble transform approach for Bayesian inference as an inner time-stepping method. We have also introduced a number of algorithmic choices such as localisation via dropout, linearly implicit time-stepping methods, and mini-batching the data, which help to improve the efficiency of the proposed methods. While we have only considered two simple numerical examples, future work will consider more challenging applications such as the training of the Sentence Gestalt model of natural language comprehension [25]. A numerical exploration of the affine-invariant stochastic homotopy formulation (18) would also be of practical and theoretical interest.

A number of theoretical questions also remain to be answered such as the convergence of the ensemble methods to their mean-field limits, the impact of the various approximations on the computed posterior representations, and the ergodicity and convergence to equilibrium properties of (36) and (38), respectively. Finally, an embedding of the ensemble transform Langevin dynamics into a Markov chain Monte Carlo framework would allow for an exact sampling from the posterior.