1 Introduction

Exploration of high-dimensional probability distributions is a fundamental task in statistical physics, machine learning, uncertainty quantification, econometrics, and beyond. In many practical scenarios, high-dimensional random variables of interest follow intractable probability measures that exhibit nonlinear interactions and concentrate in some sub-manifolds. This way, one cannot directly simulate the random variables of interest but may be able to evaluate the unnormalised density function pointwise.

Suppose we have an intractable target probability measure \(\nu _{\pi }\) with the unnormalised density function \(\pi (x)\) over a parameter space \({\mathcal {X}} \subseteq \mathbb {R}^d\), for example, the posterior measure in a Bayesian inference problem. Various approaches have been proposed to characterise \(\nu _{\pi }\) using some reference probability measure \(\mu \) defined over \({\mathcal {U}} \subseteq \mathbb {R}^d\), where independent and identically distributed random variables can be drawn from, e.g. a uniform or a Gaussian. For example, Markov chain Monte Carlo (MCMC) methods [36, 53] generate a Markov chain of random variables converging to \(\nu _{\pi }\) using \(\mu \) as the proposal; and importance sampling and/or sequential Monte Carlo [32, 48] characterise \(\nu _{\pi }\) using weighted samples drawn from \(\mu \). The recently developed transport map idea, e.g. [4, 17, 39, 43, 50], offers new insights for this task by identifying a measurable mapping, \(T: {\mathcal {U}} \mapsto {\mathcal {X}}\), such that the pushforward of \(\mu \), denoted by \(T_\sharp \, \mu \), is a close approximation to \(\nu _{\pi }\). Then, the mapping T can be used to either accelerate classical sampling methods such as MCMC or to improve the efficiency of importance sampling. In this work, we generalise the tensor-train (TT) approach of [17] to offer an order-preserving and multi-layered construction of transport maps that is suitable for high-dimensional random variables with nonlinear interactions and concentrated density functions.

1.1 Outline and Contributions

The TT-based construction of [17] realises the mapping T via a separable TT decomposition [47] of the target density function. Since the separable tensor decomposition enables the marginalisation of the target density at a computational cost scaling linearly in the dimension of the random variables, it offers a computationally viable way to approximating marginal and conditional density functions of the target measure. In turn, the cumulative distribution functions (CDFs) of the marginals and conditionals define the Rosenblatt transportFootnote 1 that can couple the target measure with the uniform reference measure. Section 2 presents the relevant background of the Rosenblatt transport, the functional form of the TT decomposition of multivariate functions [2, 25, 26, 28], and the TT-based construction of the inverse Rosenblatt transport.

The TT-based construction faces several challenges. First, the TT decomposition of the non-negative target density function often cannot preserve the non-negativity after rank truncations. This way, the resulting Rosenblatt transport may not preserve the monotonicity. Second, TT decomposition works best when the correlations between random variables are local, i.e. the correlation decays with the distance between the indices of the variables. In the extreme case of independent random variables, the joint density factorises into the product of marginal densities. However, high-dimensional random variables of interest often have concentrated density functions and exhibit complicated nonlinear interactions. In such cases, one may need a TT with high ranks to approximate the target probability density with sufficient accuracy, which in turn requires a rather large number of target density evaluations during the TT construction.

In Sect. 3, we overcome the first challenge by proposing a new construction of inverse Rosenblatt transport by approximating the squared root of the target density in the TT format, followed by constructing the marginal and conditional densities from the square of the TT approximation. The resulting squared inverse Rosenblatt transport (SIRT) is order- and smoothness-preserving. In addition, utilising the squared structure of the approximation, we also establish error bounds of SIRT in terms of various statistical divergences within the f-divergence family. These bounds are useful for bounding the statistical efficiency of posterior characterisation algorithms such as MCMC and importance sampling.

Fig. 1
figure 1

Illustration of DIRT. Top row shows a sequence of bridging measures towards the target measure \(\nu _{\pi }\). Each layer of DIRT identifies an incremental mapping that couples the reference measure (bottom row) with the pullback of the bridging measure under existing composition of mappings (middle row), which admits a simpler structure for constructing TT decomposition

In Sect. 4, we circumvent the second challenge by introducing a multi-layer deep inverse Rosenblatt transport (DIRT) that builds a composition of SIRTs guided by a sequence of bridging measures with increasing complexity. We illustrate this idea in Fig. 1. At each layer of DIRT, we aim to obtain a composition of SIRTs, denoted by \(T_0 \circ T_1 \circ \cdots \circ T_k\), such that the pushforward of the reference measure under this composition is a close approximation of the kth bridging measure \(\nu _k\). The existing composition \(T_0 \circ T_1 \circ \cdots \circ T_k\) offers a nonlinear transformation of coordinates that can effectively capture the correlations and support of the next bridging measure \(\nu _{k+1}\). As a result, the density of the pullback measure, \((T_0 \circ T_1 \circ \cdots \circ T_k)^\sharp \, \nu _{k+1}\), can have a much simpler structure for building the TT decomposition compared with the density of \(\nu _{k+1}\). We can then factorise the density of \((T_0 \circ T_1 \circ \cdots \circ T_k)^\sharp \, \nu _{k+1}\) to define the incremental mapping \(T_{k+1}\) such that \((T_{k+1})_\sharp \, \mu = (T_0 \circ T_1 \circ \cdots \circ T_k)^\sharp \, \nu _{k+1}\). This way, DIRT is capable of characterising random variables with concentrated density functions by factorising a sequence of less complicated density functions in transformed coordinates. To further improve the efficiency, we also present strategies that can embed general reference measures rather than the uniform reference measure to avoid complicated boundary layers during DIRT construction. Moreover, we can show that the DIRT construction is robust to TT approximation errors in various statistical divergences, in the sense that the error bounds on a range of divergences is a linear combination of errors of TT decompositions involved in the DIRT construction process.

In Sect. 5, we integrate SIRT and DIRT into existing MCMC and importance sampling methods to further reduce the estimation and sampling bias due to TT approximation errors. In Sect. 6, we demonstrate the efficiency and various aspects of DIRT on several Bayesian inverse problems governed by ordinary differential equations (ODEs) and partial differential equations (PDEs). Using a predator–prey dynamical system (Sect. 6.1), we benchmark the impact of various tuning parameters of the functional TT decomposition such as the TT rank, the number of collocation points and the choice of the reference measure on the accuracy of the DIRT. Using an elliptic PDE (Sect. 6.3), we are able to compare the single-layered SIRT with DIRT, in which DIRT shows a clear advantage in both the computational efficiency and the accuracy over the single-layered counterpart. In the same example, we also demonstrate the efficiency of TT with the Fourier basis compared to that with the piecewise-linear basis on concentrated posterior measures due to increasing number of measurements and decreasing measurement noises. Furthermore, we can vary the discretisation of the underlying ODE or PDE models from layer to layer to accelerate the DIRT construction. For an example involving a computationally expensive parabolic PDE (Sect. 6.4), we employ models with increasingly refined grids to construct DIRT that is otherwise computationally infeasible to build.

1.2 Related Work

Apart from building transport maps using TT decompositions, most of other methods approximate the transport map T by solving an optimisation problem such that T minimises some statistical divergence between the target \(\nu _{\pi }\) and the pushforward \(T_\sharp \, \mu \). The mapping T often has a triangular structure, which is computationally efficient for evaluating the Jacobian and the inverse of T and can be represented using polynomials [4, 43, 50, 51], kernel functions [15, 37], invertible neural networks [7, 9, 10, 35, 49, 52], etc. In this setting, the objective function has to be approximated using a Monte Carlo average and minimised by some (stochastic) gradient-based method. Depending on the objective function and how samples are obtained, the resulting methods may have very different structures.

  • Density approximation. The work of [4, 43, 51] aims to minimise the Kullback–Leibler (KL) divergence of the pushforward \(T_\sharp \, \mu \) from the target \(\nu _{\pi }\), in which the pushforward density naturally approximates the target density. In this case, the KL divergence is approximated using the Jacobian of T and the target density function evaluated at samples drawn from the analytically tractable reference measure. The resulting optimisation problem may be highly nonlinear and non-convex. In each optimisation iteration, the target density function has to be re-evaluated as reference samples are transported by the updated map. Our TT-based methods also rely on approximations to the target density. However, TT approximations employ highly efficient deterministic sampling algorithms such as TT-Cross [46], which are free from either gradient or Monte Carlo. This way, TT-based methods may need less number of density evaluations to accurately approximate the target density.

  • Density estimation. The strategy adopted by normalising flows (e.g. [7, 9, 10, 35, 49, 52]) and the work of [50, 61, 63] offer an alternative that can bypass evaluations of the target density. Instead, these methods assume availability of samples drawn from the target measure and construct objective functions using a given set of target samples. Many of these methods, particularly neural networks, were originally designed to approximate high-dimensional distributions of naturally available samples, such as images. However, in our context, the intractable target random variable \(X\) cannot be simulated directly. One has to assume that there exists an auxiliary random variable \(Y\) such that the density function of \(X\) is given by a conditional density \(\pi (x|y)\) and the pair of joint random variables \((X, Y)\) can be simulated directly. This way, density estimations can be employed to first build a mapping from some higher dimensional reference measure to the joint measure of \((X, Y)\) and then obtain the mapping T by conditioning on a particular realisation of \(Y=y\). We provide a concrete example of normalising flows and its comparisons with DIRT in Sect. 6.1.

  • Greedy methods. In-between, the fully data-driven density estimation and the function-driven density approximation are the greedy strategy, including the Stein variational gradient descent method [37], its Newton variant [15], and the lazy maps [4]. While greedy methods build transport maps sharing a similar composition structure with DIRT, they obtain the composition of mappings by iteratively minimising the KL divergence of the pushforward measure under the current composition of maps from the target \(\nu _{\pi }\). To relax the burden in optimisation, the class of mappings used in each layer of greedy methods is often restricted, for example, to reproducing kernel Hilbert space with Gaussian kernels [15, 37] and sparse low-order polynomials [4]. As a result, greedy methods often need a rather large number of layers to accurately approximate concentrated target densities and hence, may lead to a large number of computationally costly target density evaluations. Compared to the greedy strategy, the usage of bridging measures allows DIRT to construct TT decompositions in different layers with arbitrary accuracy. The total error in DIRT is also accumulated linearly with the number of layers. We provide a numerical comparison on the performance of DIRT and the Stein variational Newton methods [15] in Sect. 6.1.

2 Background

In this section, we first introduce some notation and assumptions used throughout the paper. Then, we review the inverse Rosenblatt transport method that offers an algebraically exact transformation from the reference measure to the target measure. We will also discuss the role of the functional TT decomposition in the numerical construction of the (approximate) inverse Rosenblatt transport.

2.1 Notation

We consider probability measures that are absolutely continuous with respect to the Lebesgue measure. Suppose a mapping \(S: {\mathcal {X}} \mapsto {\mathcal {U}}\) is a diffeomorphism and a probability measure \(\nu \) has a density \(p(x)\), the pushforward of \(\nu \) under S, denoted by \(S_\sharp \nu \), has the density:

$$\begin{aligned} S_\sharp p(u) = \big (p \circ S^{-1}\big )(u) \, \big | \nabla _{u} S^{-1}(u)\big |. \end{aligned}$$
(1)

Similarly, given a probability measure \(\lambda \) with a density \(q(u)\), the pullback of \(\lambda \) under S, denoted by \(S^\sharp \lambda \), has the density:

$$\begin{aligned} S^\sharp q(x) = \big (q \circ S \big )(x) \, \big | \nabla _{x} S(x)\big |. \end{aligned}$$
(2)

The short hand \(X\sim \nu \) is used to refer a random variable \(X\) with the law \(\nu \). For a \(\nu \)-integrable function \(q: {\mathcal {X}}\mapsto \mathbb {R}\), the expectation of q is denoted by \(\nu (q) = \int q(x) \nu (\mathrm{d}x) \).

We assume the parameter space \({\mathcal {X}}\), and the reference space \({\mathcal {U}}\) can be expressed as Cartesian products \({\mathcal {X}} = {\mathcal {X}}_1 \times {\mathcal {X}}_2 \times \cdots \times {\mathcal {X}}_d\) and \({\mathcal {U}} = {\mathcal {U}}_1 \times {\mathcal {U}}_2 \times \cdots \times {\mathcal {U}}_d\), respectively, where \({\mathcal {X}}_k \subseteq \mathbb {R}\) and \({\mathcal {U}}_k = [0, 1]\). Using product-form Lebesgue measurable weighting functions \(\lambda (x) = \prod _{i = 1}^{d} \lambda _i(x_i) \) and \(\omega (u)= \prod _{i = 1}^{d} \omega _i(x_i)\), the weighted \(L^p\) norms on \({\mathcal {X}}\) and \({\mathcal {U}}\) can be expressed as

$$\begin{aligned} \big \Vert {f} \big \Vert _{L_\lambda ^{p}({\mathcal {X}})} = \left( \int _{\mathcal {X}} \left| f(x)\right| ^{p} \lambda (x) \mathrm{d}x\right) ^\frac{1}{p} \quad \text {and} \quad \big \Vert {g} \big \Vert _{L_\omega ^{p}({\mathcal {U}})} = \left( \int _{\mathcal {U}} \left| g(u)\right| ^{p} \omega (u) \mathrm{d}u\right) ^\frac{1}{p} , \end{aligned}$$

respectively. We define constants \(\lambda _i({\mathcal {X}}_i) = \int _{{\mathcal {X}}_i} \lambda _i(x_i) \mathrm{d}x_i\) for \(i = 1, ..., d\) and \(\lambda ({\mathcal {X}}) = \prod _{i = 1}^d \lambda _i({\mathcal {X}}_i)\). Likewise, we also define \(\omega _i({\mathcal {U}}_i)\) for \(i = 1, ..., d\) and \(\omega ({\mathcal {U}})\).

For a vector \(x\in \mathbb {R}^d\) and an index \(k \in \mathbb {N}\) such that \(1< k < d\), we express the first \(k-1\) coordinates and the last \(d-k\) coordinates of \(x\) as

$$\begin{aligned} x_{<k} \equiv [x_1, ..., x_{k-1}]^\top , \quad \text {and} \quad x_{>k} \equiv [x_{k+1}, ..., x_{d}]^\top , \end{aligned}$$

respectively. Similarly, we write \(x_{\le k} = (x_{<k}, x_k)\), \(x_{\ge k} = (x_{k}, x_{>k})\), \(x_{\le 1} = x_1\), and \(x_{\ge d} = x_{d}\). For any non-negative function \(\pi \in L^1_{\lambda ({\mathcal {X}})}\), we define its marginal functions as

$$\begin{aligned} \pi _{\le k}(x_{\le k}) \equiv \int \pi (x_{\le k}, x_{>k}) \, \Big ({\textstyle \prod _{i = k+1}^{d} \lambda _i(x_i)}\Big )\,\mathrm{d}x_{>k}, \quad \text {for} \quad 1 \le k < d, \end{aligned}$$
(3)

with \(\pi _{\le d}(x_{\le d}) = \pi (x)\). The marginal functions should not be confounded with \(\pi _1(x)\), \(\pi _2(x)\), \(...\), \(\pi _k(x)\) where the subscript indexes a sequence of functions on \({\mathcal {X}}\).

2.2 Inverse Rosenblatt Transport

We start with a d-dimensional uniform reference probability measure, \(\mu _\mathrm{uni}\), defined in a unit hypercube \({\mathcal {U}} = [0,1]^d\), which has the probability density function (PDF) \(f_\mathrm{uni}(u) = 1\). We aim to characterise a target probability measure \(\nu _{\pi }\) with the PDF

$$\begin{aligned} f_{X}(x) = \frac{1}{z}\, \pi (x)\,\lambda (x) \quad \text {and}\quad z = \int _{{\mathcal {X}}} \pi (x) \lambda (x) \mathrm{d}x. \end{aligned}$$
(4)

Here, \(\pi \in L_\lambda ^{1}({\mathcal {X}})\) is the unnormalised density function (with respect to the weight \(\lambda \)) that is non-negative, i.e. \(\pi (x) \ge 0, \forall x\in {\mathcal {X}}\), and z is the normalising constant that is often unknown.

Let \(X:=(X_1,...,X_d)\) be the target d-dimensional random variable with law \(\nu _{\pi }\) and \(U\) be the reference d-dimensional random variable with law \(\mu _\mathrm{uni}\). The Rosenblatt transport offers a viable way to constructing a map \(F: \mathbb {R}^d \mapsto \mathbb {R}^d\) such that \(F(X) = U\). As explained in [6, 58, 64], the principle of the Rosenblatt transport is the following. For \(1 \le k \le d\), we denote the marginal PDF of the k-dimensional random variable \(X_{\le k}:=(X_1,...,X_k)\) by

$$\begin{aligned} f_{X_{\le k}} ( x_{\le k} )= \frac{1}{z}\,\pi _{\le k}(x_{\le k}) \, \Big (\textstyle \prod _{i = 1}^{k} \lambda _i(x_i)\Big ) \end{aligned}$$

and the PDF of the conditional random variable \(X_{k} | X_{< k}\) by

$$\begin{aligned} f_{X_{k} | X_{< k}}( x_k | x_{<k}) = \frac{f_{X_{\le k}}( x_{<k}, x_k)}{f_{X_{< k}}( x_{<k})} = \frac{\pi _{\le k}( x_{<k}, x_k)}{\pi _{< k}( x_{<k})}\, \lambda _k(x_k). \end{aligned}$$

This way, the CDF of \(X_{1}\) and the conditional CDF of \(X_{k} | X_{< k}\) can be expressed as

$$\begin{aligned} \!\!F_{X_1}(x_1 ) = \!\! \int _{-\infty }^{x_1} f_{X_{1}}( x_1)\,\mathrm{d}x_{1} \;\; \text {and} \;\; F_{X_{k} | X_{< k}}(x_k | x_{<k}) = \!\!\int _{-\infty }^{x_k} f_{X_{k} | X_{< k}}( x_k | x_{<k}) \, \mathrm{d}x_{k},\! \end{aligned}$$
(5)

respectively. Under mild assumptions [6], the following sequence of transformations

(6)

defines uniquely a monotonically increasing map \(F: {\mathcal {X}} \mapsto {\mathcal {U}}\) in the form of

$$\begin{aligned} F(x) = \left[ F_{X_1}(x_1), \cdots , F_{X_{k} | X_{< k}}(x_k | x_{< k}), \cdots , F_{X_{d} | X_{< d}}(x_d | x_{<d})\right] ^\top , \end{aligned}$$
(7)

such that the random variable \(U= F(X)\) is uniformly distributed in the unit hypercube \([0,1]^d\). Since the kth component of F is a scalar-valued function depending on only the first k variables, that is, \(F_{X_{k} | X_{< k}}: \mathbb {R}^{k} \mapsto \mathbb {R}\), the map F has a lower-triangular form. Furthermore, the map F (as well as its inverse) is almost surely differentiable and satisfies

$$\begin{aligned} F^\sharp f_\mathrm{uni}(x) = \big (f_\mathrm{uni}\circ F \big )(x)\,\big | \nabla _{x} F(x) \big | = \big | \nabla _{x} F(x) \big | = f_{X}(x) \end{aligned}$$

\(\nu _{\pi }\)-almost surely.

Suppose one can compute the Rosenblatt transport. Then, it provides a viable way to characterising the target measure. One can first generate uniform random variables \(U\sim \mu _\mathrm{uni}\) and then applying the inverse Rosenblatt transport (IRT)

$$\begin{aligned} X= F^{-1} \big (U\big ) \end{aligned}$$

to obtain a corresponding target random variable \(X\sim \nu _{\pi }\). The inverse Rosenblatt transport \(T \equiv F^{-1}: {\mathcal {U}} \mapsto {\mathcal {X}}\) is also lower-triangular and can be constructed by successively inverting the Rosenblatt transport for \(k = 1, ..., d\):

$$\begin{aligned} x= T(u) \equiv \left[ F_{X_1}^{-1}(u_1 ), ..., F_{X_{k} | X_{< k}}^{-1}(u_k | x_{<k}), ..., F_{X_{d} | X_{< d}}^{-1}(u_d | x_{<d}) \right] ^\top . \end{aligned}$$
(8)

The evaluation of each \(F_{X_{k} | X_{< k}}^{-1}(u_k | x_{<k})\) requires inverting only a scalar-valued monotone function \(u_k = F_{X_{k} | X_{< k}}(x_k | x_{<k})\), where \(x_{<k}\) is already determined in the first \(k-1\) steps. Using the change-of-variables formula, the expectation of a function \(h: {\mathcal {X}} \mapsto \mathbb {R}\) can be expressed as

$$\begin{aligned} \nu _{\pi }(h) = \mu _\mathrm{uni}(h\circ T). \end{aligned}$$

This way, the expectation over the intractable target probability measure can be expressed as the expectation over a reference uniform probability measure, and thus many efficient high-dimensional quadrature methods such as sparse grids [5] and quasi-Monte Carlo [16] may apply.

2.3 Functional Tensor-Train

For high-dimensional target measures, it may be not computationally feasible to compute the marginal densities \(\pi _{\le k}\), and hence, the marginal and the conditional CDFs in (5) for building the inverse Rosenblatt transport. To overcome this challenge, a recent work [17] employed the TT decomposition [47] to factorise the density of the target measure in a separable form, which leads to a computationally scalable method for building the inverse Rosenblatt transport. Here, we first discuss the basics of the TT decomposition of a multivariate function.

Since multivariate functions can be viewed as continuous analogues of tensors [28], one can factorise the unnormalised density function using functional-form of TT [2, 25, 26]. Given a multivariate function \(h: {\mathcal {X}} \mapsto \mathbb {R}\), where \({\mathcal {X}} = {\mathcal {X}}_1 \times {\mathcal {X}}_2 \times ...\times {\mathcal {X}}_d\), TT approximates \(h(x)\) as

$$\begin{aligned} h(x) \approx {\tilde{h}}(x) \equiv \sum _{\alpha _0=1}^{r_0}\sum _{\alpha _1 = 1}^{r_1} \cdots \sum _{\alpha _d = 1}^{r_d} {{\mathsf {H}}}^{(\alpha _0, \alpha _1)}_{1} (x_1) \cdots {{\mathsf {H}}}^{(\alpha _{k-1}, \alpha _{k})}_{k} (x_{k}) \cdots {{\mathsf {H}}}^{(\alpha _{d-1}, \alpha _{d})}_{d} (x_{d}) , \end{aligned}$$
(9)

with \(r_0 = r_d = 1\), where the summation ranges \(r_0, r_1, ..., r_d\) are called TT ranks. Each univariate function \( {{\mathsf {H}}}^{(\alpha _{k-1}, \alpha _{k})}_{k} (x_{k}) : {\mathcal {X}}_k \mapsto \mathbb {R}\) is represented as a linear combination of a set of \(n_k\) basis functions \(\{\phi _{k}^{(1)}(x_k), ...,\phi _{k}^{(n_k)}(x_k)\}\). This way, we have

$$\begin{aligned} {{\mathsf {H}}}^{(\alpha _{k-1}, \alpha _{k})}_{k} (x_{k}) = \sum _{i =1}^{n_k}\phi _{k}^{(i)}(x_k) \,{{{\mathbf {\mathsf{{A}}}}}}_k [\alpha _{k-1}, i, \alpha _k], \end{aligned}$$
(10)

where \({{{\mathbf {\mathsf{{A}}}}}}_k \in \mathbb {R}^{r_{k-1}\times n_k \times r_k}\) is a coefficient tensor. Examples of the basis functions include piecewise polynomials, orthogonal functions, radial basis functions, etc. In general, the TT decomposition \({\tilde{h}}(x)\) is only an approximation to the original function \(h(x)\) because of truncated TT ranks and sets of basis functions used for representing each \( {{\mathsf {H}}}^{(\alpha _{k-1}, \alpha _{k})}_{k} (x_{k}) \).

Remark 1

For each k, grouping all the univariate functions \( {{\mathsf {H}}}^{(\alpha _{k-1}, \alpha _{k})}_{k} (x_{k}) \), we have a matrix-valued function \({\mathsf {H}}_{k}(x_k): {\mathcal {X}}_k \mapsto \mathbb {R}^{r_{k-1} \times r_k}\) that is commonly referred to as the k-th TT core. This way, the TT decomposition can also be expressed in the matrix form

$$\begin{aligned} {\tilde{h}}(x) = {\mathsf {H}}_{1}(x_1) \cdots {\mathsf {H}}_{k}(x_k) \cdots {\mathsf {H}}_{d}(x_d). \end{aligned}$$
(11)

We follow the MATLAB notation to denote vector-valued functions consisting of the \(\alpha _k\)th column and \(\alpha _{k-1}\)th row of \({\mathsf {H}}_{k}(x_k)\) by \({\mathsf {H}}_{k}^{(\,:\,, \alpha _k)}(x_k): {\mathcal {X}}_k \mapsto \mathbb {R}^{r_{k-1} \times 1}\) and \({\mathsf {H}}_{k}^{(\alpha _{k-1}, \,:\,)}(x_k): {\mathcal {X}}_k \mapsto \mathbb {R}^{1 \times r_{k}}\), respectively. In some situations, it is convenient to represent the TT decomposition with grouped coordinates. For example, we can write TT as the functional analogue of the compact singular value decomposition (SVD):

$$\begin{aligned} {\tilde{h}}(x) = \sum _{\alpha _k=1}^{r_k} {\mathsf {H}}_{\le k}^{(\alpha _k)}(x_{\le k}) \, {\mathsf {H}}_{> k}^{(\alpha _k)}(x_{>k}), \end{aligned}$$
(12)

where

$$\begin{aligned} {\mathsf {H}}_{\le k}^{(\alpha _k)}(x_{\le k})&= {\mathsf {H}}_{1}(x_1) \cdots {\mathsf {H}}_{k-1}(x_{k-1})\,{\mathsf {H}}_{k}^{(\,:\,,\alpha _k )}(x_k): {\mathcal {X}}_1 \times \cdots \times {\mathcal {X}}_k \mapsto \mathbb {R}, \end{aligned}$$
(13)
$$\begin{aligned} {\mathsf {H}}_{> k}^{(\alpha _k)}(x_{>k})&= {\mathsf {H}}_{k+1}^{(\alpha _k,\,:\,)}(x_{k+1})\,{\mathsf {H}}_{k+2}(x_{k+2}) \cdots {\mathsf {H}}_{d}(x_{d}): {\mathcal {X}}_{k+1} \times \cdots \times {\mathcal {X}}_d \mapsto \mathbb {R}. \end{aligned}$$
(14)

Given a multivariate function, its TT decomposition can be computed using alternating linear schemes such as the classical alternating least squares method (e.g. [34, 46]), density matrix renormalization group methods [30, 47, 65], and the alternating minimal energy method [18] together with the cross approximation [22,23,24, 38] or the empirical interpolation [1, 8]. In Appendix 1, we detail the cross algorithm used for constructing the functional TT decomposition.

2.4 A TT-Based Inverse Rosenblatt Transport

Using the functional TT previously discussed, now we review the TT-based construction of the inverse Rosenblatt transport [17]. Suppose one has the (approximate) TT decomposition \({\tilde{\pi }}(x)\) of the unnormalised target density \(\pi (x)\) in the form of

$$\begin{aligned} {\tilde{\pi }}(x_1, x_2, ..., x_d) = {\mathsf {F}}_{1}(x_1) \cdots {\mathsf {F}}_{k}(x_k) \cdots {\mathsf {F}}_{d}(x_d), \end{aligned}$$

where \({\mathsf {F}}_{k}(x_k) : {\mathcal {X}}_k \mapsto \mathbb {R}^{r_{k-1} \times r_k}\) is the kth TT core. Then, we can approximate the target PDF by

$$\begin{aligned} {\tilde{f}}_{X}(x) = \frac{1}{{{\tilde{c}}}}\, {{\tilde{\pi }}}(x) \, \lambda (x), \quad \text {where}\quad {{{\tilde{c}}}} = \int _{{\mathcal {X}}} {{\tilde{\pi }}}(x) \lambda (x) \mathrm{d}x. \end{aligned}$$
(15)

Proposition 1

For \(k < d\), the kth marginal PDF is given by

$$\begin{aligned} {\tilde{f}}_{X_{\le k}}(x_{\le k}) = \frac{1}{{{\tilde{c}}}} \, {\tilde{\pi }}_{\le k}(x_{\le k}) \, \Big ({\textstyle \prod _{i = 1}^{k}} \lambda _i(x_i)\Big ), \end{aligned}$$

where \({\tilde{\pi }}_{\le k}(x_{\le k}) = {\textsf {F} }_{1}(x_{1}) \cdots {\textsf {F} }_{k}(x_{k}) {\bar{{\textsf {F} }}}_{k+1} \cdots {\bar{{\textsf {F} }}}_{d}\), \({{{\tilde{c}}}} = {\bar{{\textsf {F} }}}_{1} \cdots {\bar{{\textsf {F} }}}_{d}\), and the matrices \({\bar{{\textsf {F} }}}_{k}\) are the integrated TT cores

$$\begin{aligned} {\bar{{\textsf {F} }}}_{k} = \int _{{\mathcal {X}}_k} {{\textsf {F} }}_{k} (x_k) \, \lambda _{k}(x_{k})\, \mathrm{d}x_{k} \in \mathbb {R}^{r_{k-1} \times r_k}, \quad \text {for} \quad k = 1, ..., d. \end{aligned}$$

Proof

The marginal function of \({\tilde{\pi }}(x)\) can be expressed by

$$\begin{aligned} {\tilde{\pi }}_{\le k}(x_{\le k}) = \int _{{\mathcal {X}}_{>k}} {\mathsf {F}}_{1}(x_{1}) \cdots {\mathsf {F}}_{d}(x_{d}) \,\left( \textstyle {\prod _{i = k+1}^{d}} \lambda _i(x_i)\right) \,\mathrm{d}x_{>k}. \end{aligned}$$

Using the separable form of the tensor–train, the marginal density then has the form

$$\begin{aligned}&{\tilde{\pi }}_{\le k}(x_{\le k}) \\&\quad = {\mathsf {F}}_{1}(x_{1}) \cdots {\mathsf {F}}_{k}(x_{k}) \bigg (\int _{{\mathcal {X}}_{k+1}} \!\!\!\!\!\!{\mathsf {F}}_{k+1}(x_{k+1}) \lambda _{k+1}(x_{k+1})\mathrm{d}x_{k+1}\bigg ) \cdots \left( \int _{{\mathcal {X}}_d} \!\!\!\! {\mathsf {F}}_{d}(x_{d}) \lambda _{d}(x_d)\mathrm{d}x_{d} \right) \\&\quad = {\mathsf {F}}_{1}(x_{1}) \cdots {\mathsf {F}}_{k}(x_{k}) {\bar{{\mathsf {F}}}}_{k+1} \cdots {\bar{{\mathsf {F}}}}_{d}. \end{aligned}$$

Since \({{{\tilde{c}}}} = \int _{{\mathcal {X}}} {\tilde{f}}(x) \lambda (x)\mathrm{d}x\), we have \({{{\tilde{c}}}} = {\bar{{\mathsf {F}}}}_{1} \cdots {\bar{{\mathsf {F}}}}_{d}\) using a similar argument. \(\square \)

The above proposition leads to the marginal PDF \({\tilde{f}}_{X_1} = \frac{1}{{{\tilde{c}}}}\, {\tilde{\pi }}_{\le 1}(x_{1})\,\lambda _1(x_1)\) and the sequence of conditional probability densities

$$\begin{aligned} {\tilde{f}}_{X_k | X_{< k}}(x_k | x_{<k}) = \frac{{\tilde{\pi }}_{\le k}( x_{<k}, x_k)}{{\tilde{\pi }}_{< k}( x_{<k})}\,\lambda _k(x_k), \quad k=2, ..., d. \end{aligned}$$
(16)

This leads to the CDF and the sequence of conditional CDFs

$$\begin{aligned} {\tilde{F}}_{X_1}(x_1 ) = \!\! \int _{-\infty }^{x_1} {\tilde{f}}_{X_1}\,\mathrm{d}x_1^\prime \;\; \text {and}\;\; {\tilde{F}}_{X_k | X_{<k} }(x_k | x_{<k}) = \!\! \int _{-\infty }^{x_k} {\tilde{f}}_{X_k | X_{< k}}(x_k | x_{<k})\, dx_k^\prime , \end{aligned}$$
(17)

for \(k=2, ..., d\), and hence, the Rosenblatt transport \(U= {\tilde{F}}(X)\). This equivalently defines the inverse Rosenblatt transport \({\tilde{T}} = {\tilde{F}}^{-1}\). This way, by drawing a reference random variable \(U\sim \mu _\mathrm{uni}\) and evaluating \(X= {\tilde{T}}(U)\), we obtain an approximate target random variable \(X\sim {\tilde{T}}_\sharp \mu _\mathrm{uni}\). Note that the pushforward measure \({\tilde{T}}_\sharp \mu _\mathrm{uni}\) has the density \({\tilde{f}}_{X}(x)\).

To estimate the numerical complexity, let us introduce the maximal number of basis functions \(n=\max _{k=1,...,d} n_k\), TT rank \(r = \max _{k=0,...,d} r_k\), and suppose we need to draw N samples from \({{\tilde{\pi }}}(x)\). Note that we can precompute \({\bar{{\mathsf {F}}}}_{k+1} \cdots {\bar{{\mathsf {F}}}}_{d}\) with the total cost of \({\mathcal {O}}(dnr^2)\) operations, before any sampling starts. Similarly, the conditioning requires the interpolation of \({\mathsf {F}}_1(x_1) \cdots {\mathsf {F}}_{k-1}(x_{k-1})\) at the current sample coordinates, which can be built up sequentially. Each univariate interpolation needs \({\mathcal {O}}(nr^2)\) operations in general, but for a piecewise interpolation this can be reduced to \({\mathcal {O}}(r^2)\) operations per sample per coordinate. Finally, the assembling of the conditional density requires the multiplication of N vectors \({\mathsf {F}}_1(x_1) \cdots {\mathsf {F}}_{k-1}(x_{k-1}) \in \mathbb {R}^{r_{k-1}}\) with a vector-valued function \({\mathsf {F}}_k(x_k){\bar{{\mathsf {F}}}}_{k+1} \cdots {\bar{{\mathsf {F}}}}_{d} \in \mathbb {R}^{r_{k-1}}\). The total complexity is therefore \({\mathcal {O}}(dnr^2 + Ndr^2 + Ndnr)\) [17].

Constructing the inverse Rosenblatt transport using the TT decomposition of the target density faces several challenges. First, the density function \(\pi (x)\) is non-negative; however, its truncated TT decomposition \({\tilde{\pi }}(x)\) can have negative values—a discrete analogue is that the truncated SVD of a matrix filled with non-negative entries can be negative. The leads to a critical issue: if the set \(\{ x\in {\mathcal {X}} \,|\, {\tilde{\pi }}(x) < 0\}\) has nonzero measure under \(\nu _{\pi }\), then the Rosenblatt transport constructed from \({\tilde{\pi }}(x)\) loses monotonicity. A simple way to circumvent this is to take the modulus of each univariate conditional density \({\tilde{f}}_{X_k|X_{<k}}(x_k | x_{<k})\) and then renormalise the modulus before computing the CDF [17]. However, the use of moduli and renormalisations may degrade the smoothness of marginal PDFs and conditional PDFs. This way, the resulting inverse Rosenblatt transport and its induced PDF can lose accuracy and smoothness. More importantly, the construction of the TT decomposition (see Sect. 1 for details) requires evaluating the target density at parameter points where the target density is significant. In practice, the high probability region of a high-dimensional target density, e.g. the posterior in the Bayesian inference context, can be hard to characterise. Thus, it can be challenging to construct the TT decomposition for approximating the target density directly. In the next section, we generalise the TT-based construction of the inverse Rosenblatt transport by tackling the aforementioned challenges.

3 Squared Inverse Rosenblatt Transport

We first introduce the SIRT to overcome the negativity issue outlined above. Instead of directly decomposing the unnormalised target density \(\pi (x)\), we first obtain the (approximate) functional TT decomposition \({\tilde{g}}(x)\) of the square root of \(\pi (x)\) in the form of

$$\begin{aligned} \sqrt{\pi }(x) \approx {\tilde{g}}(x) = {\mathsf {G}}_{1}(x_1) \cdots {\mathsf {G}}_{k}(x_k) \cdots {\mathsf {G}}_{d}(x_d), \end{aligned}$$
(18)

where \({\mathsf {G}}_{k}(x_k) : {\mathcal {X}}_k \mapsto \mathbb {R}^{r_{k-1} \times r_k}\) is the kth TT core. This leads to an alternative approximation to the target PDF:

$$\begin{aligned} {\hat{f}}_{\hat{X}}(x) = \frac{1}{{{\hat{z}}}} \, \hat{\pi }(x) \,\lambda (x), \;\; \text {with}\;\; {\hat{\pi }}(x) = \gamma + {{\tilde{g}}}(x)^2\;\; \text {and}\;\; {{\hat{z}}} = \gamma \,\lambda ({\mathcal {X}}) + \int _{{\mathcal {X}}} {{\tilde{g}}}(x)^2 \,\lambda (x)\,\mathrm{d}x, \end{aligned}$$
(19)

where \(\gamma > 0\) is a constant chosen according to the \(L^2\) error of \({{\tilde{g}}}(x)\). Similar to the process discussed in Sect. 2.4, we can obtain the SIRT, \(\hat{X} = {\hat{T}}(U)\), by constructing the sequence of marginal functions \(\hat{\pi }_{ \le k}(x_{\le k}) = \int _{{\mathcal {X}}} \hat{\pi }(x) \prod _{i=k+1}^{d}\lambda _i(x_i) \mathrm{d}x_{> k}\) for \(k=1, ..., d-1\) and computing the normalising constant \({{\hat{z}}}\). Given a reference random variable \(U\sim \mu _\mathrm{uni}\), we can evaluate \(\hat{X} = {\hat{T}}(U)\) to obtain an approximate target random variable \(\hat{X} \sim {\hat{T}}_\sharp \mu _\mathrm{uni}\), which has exactly the PDF \({\hat{f}}_{\hat{X}}(x)\). Since the function \(\hat{\pi }(x)\) is positive by construction, we can preserve the smoothness and monotonicity in the resulting SIRT \({\hat{T}}\).

Remark 2

For a target density \(\pi (x)\) satisfying \(\sup _{x\in {\mathcal {X}}} \pi (x) < \infty \), the ratio between \(\pi (x)\) and the approximate density \({\hat{\pi }}(x)\) satisfies

$$\begin{aligned} \sup _{x\in {\mathcal {X}}}\,\frac{\pi (x)}{{\hat{\pi }}(x)} = \sup _{x\in {\mathcal {X}}}\,\frac{\pi (x)}{\gamma + {\tilde{g}}(x)^2} = {\hat{c}} < \infty . \end{aligned}$$
(20)

This bound is essential to ensure the uniform ergodicity of the Metropolis independent algorithm and the rate of convergence of importance sampling schemes defined by SIRT. See Sect. 5 for further details.

3.1 Marginal Functions and Conditional PDFs

We represent each TT core of the decomposition in (18) as

$$\begin{aligned} {{\mathsf {G}}}^{(\alpha _{k-1}, \alpha _{k})}_{k} (x_{k}) = \sum _{i =1}^{n_k}\phi _{k}^{(i)}(x_k) {{{\mathbf {\mathsf{{A}}}}}}_k [\alpha _{k-1}, i, \alpha _k], \quad k = 1, ..., d, \end{aligned}$$
(21)

where \(\{\phi _{k}^{(i)}(x_k)\}_{i = 1}^{n_k}\) is the set of basis functions for the kth coordinate and \({{{\mathbf {\mathsf{{A}}}}}}_k \in \mathbb {R}^{r_{k-1}\times n_k \times r_k}\) is the associated kth coefficient tensor. For the kth set of basis functions, we define the mass matrix \({\mathsf {M}}_k \in \mathbb {R}^{n_k \times n_k}\) by

$$\begin{aligned} {\mathsf {M}}_k[i,j] = \int _{{\mathcal {X}}_k} \phi _{k}^{(i)}(x_k)\phi _{k}^{(j)}(x_k) \,\lambda (x_k)\,\mathrm{d}x_k, \text {for}~~ i = 1,..., n_k,\,j = 1,..., n_k. \end{aligned}$$
(22)

Then, we can represent the marginal functions by

$$\begin{aligned} \hat{\pi }_{1}(x_{1})&= \gamma \prod _{i = 2}^d \lambda _i({\mathcal {X}}_i) + \sum _{\ell _{1}=1}^{r_{1}} \Big ({\mathsf {P}}_{1}^{(\,\alpha _0\,,\ell _{1})}(x_{1}) \Big )^2, \end{aligned}$$
(23)
$$\begin{aligned} \hat{\pi }_{ \le k}(x_{\le k})&= \gamma \!\!\! \prod _{i = k+1}^d \!\!\! \lambda _i({\mathcal {X}}_i) + \sum _{\ell _{k}=1}^{r_{k}} \Big (\sum _{\alpha _{k-1}=1}^{r_{k-1}} {\mathsf {G}}^{(\alpha _{k-1}) }_{<k}(x_{<k}) \, {\mathsf {P}}_{k}^{(\alpha _{k-1},\ell _{k})}(x_{k}) \Big )^2, \quad k = 2, ..., d, \end{aligned}$$
(24)

where \(\alpha _0 = 1\) and

$$\begin{aligned} {\mathsf {G}}^{(\alpha _{k-1})}_{<k}(x_{<k})&= {\mathsf {G}}_{1}(x_{1}) \cdots {\mathsf {G}}^{(\,:\,,\alpha _{k-1})}_{k-1}(x_{k-1}): {\mathcal {X}}_{<k} \mapsto \mathbb {R}, \end{aligned}$$
(25)
$$\begin{aligned} {\mathsf {P}}_{k}^{(\alpha _{k-1}, \ell _{k})}(x_{k})&= \sum _{i =1}^{n_k}\phi _{k}^{(i)}(x_k) \, {{{\mathbf {\mathsf{{B}}}}}}_k [\alpha _{k-1}, i, \ell _k]: {\mathcal {X}}_{k} \mapsto \mathbb {R}, \end{aligned}$$
(26)

for a coefficient tensor \({{{\mathbf {\mathsf{{B}}}}}}_k \in \mathbb {R}^{r_{k-1}\times n_k \times r_k}\) that is recursively defined as follows.

Proposition 2

Starting with the last coordinate \(k = d\), we set \({{\mathbf {\mathsf{{B}}}} }_d = {{\mathbf {\mathsf{{A}}}} }_d\). Suppose for the first k dimensions (\(k>1\)), we have a coefficient tensor \({{\mathbf {\mathsf{{B}}}} }_k \in \mathbb {R}^{r_{k-1}\times n_k \times r_k}\) that defines a marginal function \({\hat{\pi }}_{\le k}(x_{\le k})\) as in (24). The following procedure can be used to obtain the coefficient tensor \({{\mathbf {\mathsf{{B}}}} }_{k-1}\in \mathbb {R}^{r_{k-2}\times n_{k-1} \times r_{k-1}}\) for defining the next marginal function \(\hat{\pi }_{< k}(x_{< k})\):

  1. 1.

    Use the Cholesky decomposition of the mass matrix, \({\textsf {L} }_k {\textsf {L} }_k^\top = {\textsf {M} }_k \in \mathbb {R}^{n_k \times n_k}\), to construct a tensor \({{\mathbf {\mathsf{{C}}}} }_k \in \mathbb {R}^{r_{k-1}\times n_k \times r_k}\):

    $$\begin{aligned} {{\mathbf {\mathsf{{C}}}} }_k[\alpha _{k-1}, \tau , \ell _{k}] = \sum _{i = 1}^{n_k} {{\mathbf {\mathsf{{B}}}} }_k[\alpha _{k-1}, i, \ell _{k}] \, {\textsf {L} }_k[i, \tau ]. \end{aligned}$$
    (27)
  2. 2.

    Unfold \({{\mathbf {\mathsf{{C}}}} }_k\) along the first coordinate [34] to obtain a matrix \({\textsf {C} }_k^{(\mathrm R)} \in \mathbb {R}^{r_{k-1} \times (n_k r_k)} \) and compute the thin QR decomposition

    $$\begin{aligned} {\textsf {Q} }_k {\textsf {R} }_k = \big ({\textsf {C} }_k^{(\mathrm R)} \big )^\top , \end{aligned}$$
    (28)

    where \({\textsf {Q} }_k \in \mathbb {R}^{(n_k r_k) \times r_{k-1}} \) is semi-orthogonal and \({\textsf {R} }_k \in \mathbb {R}^{r_{k-1} \times r_{k-1}}\) is upper-triangular.

  3. 3.

    Compute the new coefficient tensor

    $$\begin{aligned} {{\mathbf {\mathsf{{B}}}} }_{k-1}[\alpha _{k-2},i, \ell _{k-1}] = \sum _{\alpha _{k-1} = 1}^{r_{k-1}} {{\mathbf {\mathsf{{A}}}} }_{k-1}[\alpha _{k-2},i, \alpha _{k-1}]\, {\textsf {R} }_k[\ell _{k-1},\alpha _{k-1}]. \end{aligned}$$
    (29)

Furthermore, at index \(k = 1\), the unfolded \({{\mathbf {\mathsf{{C}}}} }_1\) along the first coordinate is a row vector \({\textsf {C} }_1^{(\mathrm R)} \in \mathbb {R}^{1 \times (n_1 r_1)} \). Thus, the thin QR decomposition \({\textsf {Q} }_1 {\textsf {R} }_1 = \big ( {\textsf {C} }_1^{(\mathrm R)} \big )^\top \) produces a scalar \({\textsf {R} }_1\) such that \({\textsf {R} }_1^2 = \Vert {\textsf {C} }_1^{(\mathrm R)}\Vert ^2\), and then the normalising constant \({\hat{z}} = \int _{{\mathcal {X}}_1} \hat{\pi }_{ \le 1}(x_{1}) \lambda _1(x_1) \mathrm{d}x_1\) can be obtained by \( {\hat{z}} = \gamma \prod _{i = 1}^d \lambda _i({\mathcal {X}}_i) + {\textsf {R} }_1^2\).

Proof

See Appendix 2. \(\square \)

Proposition 3

The marginal PDF of \(\hat{X}_1\) can be expressed as

$$\begin{aligned} {\hat{f}}_{\hat{X}_1}(x_1) = \frac{1}{{\hat{z}}} \bigg ( \gamma \prod _{i = 2}^d \lambda _i({\mathcal {X}}_i) + \sum _{\ell _{1}=1}^{r_{1}} \Big (\sum _{i =1}^{n_1}\phi _{1}^{(i)}(x_1) \, {\textsf {D} }_1 [i, \ell _1] \Big )^2 \bigg ) \lambda _1(x_1), \end{aligned}$$
(30)

where \({\textsf {D} }_1[i, \ell _1]={{\mathbf {\mathsf{{B}}}} }_1 [\alpha _{0}, i, \ell _1]\) for \(i = 1, ..., n_1\) and \(\alpha _0 = 1\). For \(k>1\) and a given \(x_{<k}\), the conditional PDF of \(\hat{X}_k | \hat{X}_{<k}\) can be expressed as

$$\begin{aligned} {\hat{f}}_{\hat{X}_k | \hat{X}_{< k}}(x_k | x_{<k}) = \frac{1}{\hat{\pi }_{< k}(x_{< k})} \bigg ( \gamma \!\!\! \prod _{i = k+1}^d \!\!\! \lambda _i({\mathcal {X}}_i) + \sum _{\ell _{k}=1}^{r_{k}} \Big (\sum _{i =1}^{n_k}\phi _{k}^{(i)}(x_k) \, {\textsf {D} }_k [i, \ell _k] \Big )^2 \bigg ) \lambda _k(x_k), \end{aligned}$$
(31)

where \({\textsf {D} }_k \in \mathbb {R}^{n_k \times r_k}\) is given by

$$\begin{aligned} {\textsf {D} }_k[i, \ell _k] = \sum _{\alpha _{k-1} = 1}^{r_{k-1}} {\textsf {G} }^{(\alpha _{k-1})}_{<k}(x_{<k}){{\mathbf {\mathsf{{B}}}} }_k [\alpha _{k-1}, i, \ell _k]. \end{aligned}$$

Proof

The above results directly follow from the definition of conditional PDF and the marginal functions in (23) and (24). \(\square \)

Note that the product \({\mathsf {G}}_{1}(x_{1}) \cdots {\mathsf {G}}_{k-1}(x_{k-1})\) requires \(k-1\) univariate interpolations and \(k-2\) products of matrices per sample, that is the same operations as in the standard inverse Rosenblatt transport. The QR decomposition (28) and the construction of the coefficient tensors (29) need \({\mathcal {O}}(dnr^3)\) operations, but these are pre-processing steps that are independent of the number of samples. However, in contrast to the vector-valued function \({\mathcal {F}}_k(x_k)\bar{{\mathcal {F}}}_{k+1} \cdots \bar{{\mathcal {F}}}_{d} \in \mathbb {R}^{r_{k-1}}\), in evaluating the PDF \({\hat{f}}_{\hat{X}}\), we need to multiply the matrix-valued function \({\mathsf {P}}_{k}(x_{k}) \in \mathbb {R}^{ r_{k-1} \times r_{k} }\) for each sample. Thus, the leading term of the complexity becomes \({\mathcal {O}}(Ndnr^2)\), one order of r or n higher than the complexity of the standard inverse Rosenblatt transport. However, for small r and n this is well compensated by a smoother map, which will be crucial in Sect. 4.

3.2 Implementation of CDFs

To evaluate SIRT, one has to first construct the marginal CDF of \(\hat{X}_1\) and the conditional CDFs of \(\hat{X}_k | \hat{X}_{<k}\) for \(k > 1\) and then, inverts the CDFs (see (8)). Here, we discuss the computation and the inversion of CDFs, which are based on methods, for problems with bounded domains and extensions to problems with unbounded domains. We refer the readers to [3, 57, 62] and references therein for a more details.

3.2.1 Bounded Domain with Polynomial Basis

For a bounded parameter space \({\mathcal {X}}\subset \mathbb {R}^d\), we consider the weighting function \(\lambda (x) = 1\). Since \({\mathcal {X}}\) can be expressed as a Cartesian product, without loss of generality, here we discuss the CDF of a one-dimensional random variable Z with the PDF

$$\begin{aligned} {\hat{f}}_{Z}(\zeta ) = C + \sum _{\ell =1}^{r} \Big (\sum _{i =1}^{n}\phi ^{(i)}(\zeta ) \, {\mathsf {D}}[i, \ell ] \Big )^2, \end{aligned}$$
(32)

where \(C>0\) is some constant, \(\{\phi ^{(i)}(\zeta )\}_{i = 1}^{n}\) are the basis functions, \({\mathsf {D}}\in \mathbb {R}^{n \times r}\) is a coefficient matrix, and \(\zeta \in [-1, 1]\). Here, \({\hat{f}}_{Z}(\zeta )\) can be either the marginal PDF or the conditional PDFs defined in Proposition 3 with a suitable linear change of coordinate.

We first consider a polynomial basis, \(\phi ^{(i)}(z) \in \mathbb {P}_{n-1}\) for \(i = 1, ..., n\), where \(\mathbb {P}_{n-1}\) is a vector space of polynomials of degree at most \(n-1\) defined on \( [-1, 1]\). Thus, the PDF \({\hat{f}}_{Z}(\zeta )\) can be represented exactly in \(\mathbb {P}_{2n-2}\). To enable fast computation of the CDF, we choose the Chebyshev polynomials of the second kind

$$\begin{aligned} p_m(\zeta ) = \frac{\sin \big ( (m+1)\cos ^{-1}(\zeta ) \big )}{\sin \big (\cos ^{-1}(\zeta )\big )}, \quad m = 0, 1, ..., 2n-2, \end{aligned}$$

as the basis of \(\mathbb {P}_{2n-2}\). Using the roots of \(p_{2n-1}(\zeta )\), we can define the set of collocation points

$$\begin{aligned} \big \{\zeta _m \big \}_{m=1}^{2n-1}, \quad \text {where} \quad \zeta _m = \cos \Big (\frac{m \pi }{2n}\Big ). \end{aligned}$$

This way, by evaluating \({\hat{f}}_{Z}(\zeta )\) on the collocation points, which needs \({\mathcal {O}}(nr)\) operations, one can apply the collocation method ([3], Chapter 4) to represent \({\hat{f}}_{Z}(\zeta )\) using the Chebyshev basis:

$$\begin{aligned} {\hat{f}}_{Z}(\zeta ) = \sum _{m=0}^{2n-2} a_m \, p_m(\zeta ), \end{aligned}$$
(33)

where the coefficients \(\{a_m\}_{m=1}^{2n-2}\) can be computed by the fast Fourier transform with \({\mathcal {O}}(n\log (n))\) operations. Then, one can express the CDF of Z as

$$\begin{aligned} F_Z(\zeta ) = \int _{-1}^\zeta {\hat{f}}_{Z}(\zeta ^\prime ) \mathrm{d}\zeta ^\prime = \sum _{m=0}^{2n-2} \frac{a_m}{m+1} \big (t_{m+1}(\zeta ) - t_{m+1}(-1) \big ), \end{aligned}$$
(34)

where \(t_m(\zeta ) = \cos \big ( m \cos ^{-1}(\zeta ) \big )\) is the Chebyshev polynomial of the first kind of degree m. A random variable Z can be generated by drawing a uniform random variable U and evaluating \(Z = F_Z^{-1}(U)\) by solving the root finding problem \(F_Z(Z) = U\).

Remark 3

The PDF in (32) is positive for all \(\zeta \in [-1, 1]\) by construction and can be represented exactly in \(\mathbb {P}_{2n-2}\) with the polynomial basis. Thus, its Chebyshev representation in (33) is also positive. This way, the resulting CDF in (34) is monotone, and thus, the solution to the inverse CDF equation, \(F_Z(Z) = U\), admits a unique solution.

Remark 4

One can also employ piecewise Lagrange polynomials as a basis to enable hp-adaptivity. With piecewise Lagrange polynomials, the above-mentioned technique can also be used to obtain the piecewise definition of the CDF.

Since \(F_Z(Z) = U\) has a unique solution and \(F_Z\) is monotone and bounded between [0, 1], it requires usually only a few iterations to apply the root finding methods, such as the regula falsi method and the Newton’s method, to solve \(F_Z(Z) = U\) with an accuracy close to machine precision. Overall, the construction of the CDF needs \({\mathcal {O}}(nr + n\log (n))\) operations, and the inversion of the CDF function needs \({\mathcal {O}}(c n )\) operations, where \({\mathcal {O}}(n)\) is the cost of evaluating the CDF and c is the number of iterations required by the root finding method. In comparison, building the matrix \({\mathsf {D}}\) requires \({\mathcal {O}}(nr^2)\) operations (cf. Proposition 3).

3.2.2 Bounded Domain with Fourier Basis

If the Fourier transform of the PDF of Z, which is the characteristic function, is band-limited in the frequency domain, then one may choose the sine and cosine Fourier series as the basis for representing the PDF in (32). In this case, the above strategy can also be applied. Recall the Fourier basis with an even cardinality n,

$$\begin{aligned} \big \{ 1, ..., \sin (m \pi \zeta ), \cos (m \pi \zeta ), ..., \cos (n \pi \zeta / 2) \big \}, \quad m = 1, ..., n/2-1, \end{aligned}$$

which consists of \(n/2-1\) sine functions and \(n/2+1\) cosine functions. The PDF \({\hat{f}}_{Z}(\zeta )\) defined in (32) yields an exact representation using the Fourier basis with cardinality 2n. This way, one can represent \({\hat{f}}_{Z}(\zeta )\) as

$$\begin{aligned} {\hat{f}}_{Z}(\zeta ) = a_0 + \sum _{m=1}^{n} a_m \cos (m \pi \zeta ) + \sum _{m=1}^{n-1} b_m \sin (m \pi \zeta ), \end{aligned}$$

where the coefficients, \(a_m\) and \(b_m\), are obtained by evaluating \({\hat{f}}_{Z}(\zeta )\) on the collocation points

$$\begin{aligned} \big \{\zeta _m \big \}_{m=1}^{2n}, \quad \text {where} \quad \zeta _m = \frac{m}{n} - 1, \end{aligned}$$

and applying the rectangular rule. This leads to the CDF

$$\begin{aligned} F_Z(\zeta )&= \int _{-1}^\zeta {\hat{f}}_{Z}(\zeta ^\prime ) \mathrm{d}\zeta ^\prime \\&= a_0(\zeta + 1) + \sum _{m=1}^{n} \frac{a_m}{m\pi } \sin (m \pi \zeta ) - \sum _{m=1}^{n-1} \frac{b_m}{m\pi } \big ( \cos (m \pi \zeta ) - \cos (m \pi ) \big ). \end{aligned}$$

The construction and the inversion of the CDF using the Fourier basis cost a similar amount of operations compared to the polynomial basis.

3.2.3 Unbounded Domain

Given an unbounded domain, the simplest approach is to truncate the domain at the tail of the PDF. With the domain truncation, the above-mentioned implementations based on Chebyshev and Fourier basis can be applied directly. Although the function approximation error induced by the domain truncation can be bounded, using the resulting SIRT for computing expectations may lead to a biased estimator.

One can also consider basis functions that are intrinsic to an unbounded domain. For the domain \({\mathcal {X}}_k = (0, \infty )\), one can employ the Laguerre polynomials as the basis. This equips \({\mathcal {X}}_k\) with a natural exponential weighting function \(\lambda _k(x_k) = \exp (-x_k)\). The collocation method using higher order Laguerre polynomials can be applied again to obtain the exact representation of the CDF. Similarly, for \({\mathcal {X}}_k = (-\infty , \infty )\), the Hermite polynomials can be used as a basis, which equips \({\mathcal {X}}_k\) with a Gaussian weighting function \(\lambda _k(x_k) = \exp (-\frac{1}{2} x_k^2)\). Although one can apply the collocation method to obtain an algebraically exact representation of the CDF, the resulting CDF involves error functions, complementary error functions, and imaginary error functions. Those functions have to be approximated numerically. Thus, the computational cost of computing the CDF can be high, and it may be hard to guarantee the monotonicity and uniqueness of the inverse CDF solution at the tails. Using other bases such as the Whittaker cardinal functions for \({\mathcal {X}}_k = (-\infty , \infty )\) may face a similar challenge.

Remark 5

In a situation where the squared form of the PDF in (32) can be computed, but it is challenging to invert the CDF function, one can employ the rejection sampling [53] to generate random variables. In this situation, our TT approximation can still be used to draw conditional samples. However, this approach may not lead to the deterministic inverse Rosenblatt transport.

3.2.4 Change of Coordinate

One can also apply a diffeomorphic mapping to change the coordinate of an unbounded domain \({\mathcal {X}}\) to a bounded one, e.g. \({\mathcal {Z}} = [-1, 1]^d\), followed by application of the Chebyshev polynomials or Fourier series. Given a PDF \(f_{X}(x)\) of a random variable \(X\in {\mathcal {X}}\), suppose we have a diffeomorphic mapping \(R: {\mathcal {X}} \mapsto {\mathcal {Z}}\) and let \(q(x) = |\nabla R(x)| \ge 0\). For any Borel set \({\mathcal {B}}_X \subseteq {\mathcal {X}}\), we have

$$\begin{aligned} \mathbb {P} [X\in {\mathcal {B}}_X]&= \int _{{\mathcal {B}}_X} f_{X}(x) \mathrm{d}x\\&= \int _{{\mathcal {B}}_Z} (f_{X}\circ R^{-1})(\zeta )\,\big |\nabla R^{-1}\!(\zeta )\big | \mathrm{d}\zeta = \int _{{\mathcal {B}}_Z} \frac{ (f_{X}\circ R^{-1})(\zeta ) }{(q \circ R^{-1})(\zeta )} \mathrm{d}\zeta , \end{aligned}$$

where \({\mathcal {B}}_Z = R({\mathcal {B}}_X)\). Thus, one can draw a random variable \(Z\) with PDF

$$\begin{aligned} f_{Z}(\zeta ) = \frac{ (f_{X}\circ R^{-1})(\zeta ) }{(q \circ R^{-1})(\zeta )}, \end{aligned}$$

and apply the mapping \(X= R(Z)\) to obtain a random variable X. With the change of the coordinate \(\zeta = R(x)\), one needs to build a TT to approximate \(\sqrt{f_{Z}(\zeta )}\) and construct the corresponding SIRT to simulate the random variable \(Z\). To avoid singularities at the boundary of \({\mathcal {Z}}\), one can choose a mapping R such that the function \(q(x)\) decays slower than \(f_{X}(x)\).

3.3 SIRT Error

Since SIRT enables us to generate i.i.d. samples from the probability measure \({\hat{T}}_\sharp \mu _\mathrm{uni}\), it can be used to define either Metropolis independence samplers or importance sampling schemes. Based on certain assumptions on the TT approximation \({\tilde{g}}\), here we establish error bounds for the TV distance, the Hellinger distance, and the \(\chi ^ { {\scriptstyle 2} } \)-divergence of the target measure \(\nu _{\pi }\) from \({\hat{T}}_\sharp \mu _\mathrm{uni}\). These divergences play a vital role in analysing the convergence of Metropolis–Hastings methods and the efficiency of importance sampling. The error analysis also provides a heuristic for choosing the constant \(\gamma \) in the approximate target density (19).

Proposition 4

Recall the approximate target density \({\hat{\pi }}(x) = \gamma + {{\tilde{g}}}(x)^2\), where \({{\tilde{g}}}(x)\) is a TT approximation to the square root of the unnormalised target density \(\sqrt{\pi }\). Suppose the error of \({\tilde{g}}\) and the constant \(\gamma \) satisfy

$$\begin{aligned} \big \Vert {{\tilde{g}} - \sqrt{\pi }} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le \epsilon \quad \mathrm{and} \quad \gamma \le \frac{1}{\lambda ({\mathcal {X}})} \big \Vert {{\tilde{g}} - \sqrt{\pi }} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} ^2, \end{aligned}$$
(35)

respectively. Then, the error of \(\sqrt{{\hat{\pi }}}\) satisfies \( \big \Vert {\sqrt{\pi } - \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le \sqrt{2}\epsilon \).

Proof

Applying the identity \( \big ( \sqrt{\pi } - \sqrt{\gamma + {\tilde{g}}^2} \big )^2 \le \big ( \sqrt{\pi } - {\tilde{g}} \big )^2 + \gamma , \) we have

$$\begin{aligned} \big \Vert {\sqrt{\pi } - \textstyle \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})}&\le \bigg (\int _{\mathcal {X}} \Big ( \sqrt{\pi (x)} - {\tilde{g}}(x) \Big )^2 \lambda (x)\,\mathrm{d}x+ \gamma \,\lambda ({\mathcal {X}}) \bigg )^\frac{1}{2} \le \sqrt{2} \epsilon . \end{aligned}$$

\(\square \)

Proposition 5

Suppose the conditions in (35) hold. Then, the approximate normalising constant \({\hat{z}}\) satisfies \(\big | \sqrt{z} - \sqrt{{{\hat{z}}}} \big | \le \sqrt{2} \epsilon \).

Proof

The normalising constants z and \({\hat{z}}\) satisfy

$$\begin{aligned} \big | z - {\hat{z}} \big | = \Big | \int _{\mathcal {X}}\! \big ( \pi (x) - \hat{\pi }(x) \big ) \,\lambda (x)\,\mathrm{d}x\Big | \le \big \Vert {\pi -\hat{\pi } } \big \Vert _{L_\lambda ^{1}({\mathcal {X}})} . \end{aligned}$$
(36)

Applying the Hölder’s inequality (with \(p = q = 2\)) and the Minkowski inequality, the right hand side of the above inequality also satisfies

$$\begin{aligned} \big \Vert {\pi - \hat{\pi } } \big \Vert _{L_\lambda ^{1}({\mathcal {X}})}&= \big \Vert { \big ( \sqrt{\pi } - \textstyle \sqrt{\hat{\pi }}\big ) \big ( \sqrt{\pi } + \textstyle \sqrt{\hat{\pi }}\big )} \big \Vert _{L_\lambda ^{1}({\mathcal {X}})} \nonumber \\&\le \big \Vert { \sqrt{\pi } - \textstyle \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \big \Vert {\sqrt{\pi } + \textstyle \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \nonumber \\&\le \big \Vert {\sqrt{\pi } - \textstyle \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \; \Big ( \big \Vert {\sqrt{\pi } } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} + \big \Vert {\textstyle \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \Big ) \nonumber \\&= \big \Vert {\sqrt{\pi } - \textstyle \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \; \big ( \sqrt{z} + \sqrt{{{\hat{z}}}} \big ) . \end{aligned}$$
(37)

Since both z and \({\hat{z}}\) are positive, we have \(\big | z - {\hat{z}} \big | = \big | \sqrt{{{\hat{z}}}} -\sqrt{z} \big |\; \big ( \sqrt{z} + \sqrt{{{\hat{z}}}} \big )\). Substituting this identity and the inequality in (37) into (36), we have \( \big | \sqrt{z} - \sqrt{{{\hat{z}}}} \big | \le \big \Vert {\sqrt{\pi } - \sqrt{\hat{\pi }}} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} . \) Thus, the result follows from Proposition 4. \(\square \)

Theorem 1

Suppose the conditions in (35) hold. The Hellinger distance between \(\nu _{\pi }\) and \({\hat{T}}_\sharp \mu _\mathrm{uni}\) satisfies \(D_\mathrm{H}(\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}) \le 2\epsilon \big / \sqrt{z}\).

Proof

Since the target measure \(\nu _\pi \) and the approximate measure \({\hat{T}}_\sharp \mu _\mathrm{uni}\), respectively, have the densities \(\frac{1}{z} \, \pi (x)\,\lambda (x)\) and \(\frac{1}{{{\hat{z}}}}\,\hat{\pi }(x)\,\lambda (x)\), the squared Hellinger distance satisfies

$$\begin{aligned} D_\mathrm{H}^2\big (\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}\big ) = \frac{1}{2} \int _{\mathcal {X}} \bigg (\sqrt{ \frac{\pi (x) }{z} } - \sqrt{\frac{{\hat{\pi }}(x)}{{{\hat{z}}}}} \bigg )^2 \, \lambda (x)\,\mathrm{d}x. \end{aligned}$$

This leads to the inequality

$$\begin{aligned} D_\mathrm{H}^2\big (\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}\big )&= \frac{1}{2z} \big \Vert { \sqrt{\pi } - \textstyle \sqrt{{\hat{\pi }}} + \sqrt{{\hat{\pi }}} - \textstyle \sqrt{{\hat{\pi }}} \, \sqrt{z/{\hat{z}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} ^2 \\&\le \frac{1}{2z} \Big ( \big \Vert { \sqrt{\pi } - \textstyle \sqrt{{\hat{\pi }}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} + \big \Vert { \textstyle \sqrt{{\hat{\pi }}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \,\big |1 - \sqrt{z/{\hat{z}}}\,\big | \Big )^2 . \end{aligned}$$

Applying \( \big \Vert { \sqrt{{\hat{\pi }}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} ^2 = {\hat{z}}\) and Propositions 4 and 5 , the above inequality can be further reduced to

$$\begin{aligned} D_\mathrm{H}^2\big (\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}\big ) \le \frac{1}{2z} \Big ( \sqrt{2} \epsilon + \big |\sqrt{{{\hat{z}}}} - \sqrt{z}\big | \Big )^2 \le \frac{4\epsilon ^2}{z} . \end{aligned}$$

Thus, we have \(D_\mathrm{H}\big (\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}\big ) \le 2\epsilon \big / \sqrt{z}\). \(\square \)

Corollary 1

Suppose the conditions in (35) hold. The total variation distance between \(\nu _{\pi }\) and \({\hat{T}}_\sharp \mu _\mathrm{uni}\) satisfies \(D_\mathrm{TV}(\nu _{\pi }\Vert {\hat{T}}_\sharp \mu _\mathrm{uni}) \le 2 \sqrt{2} \epsilon \big / \sqrt{z}\).

Proof

The result directly follows from the inequality \(D_\mathrm{TV} \le \sqrt{2} D_\mathrm{H}\) and Theorem 1. \(\square \)

Proposition 6

Given two probability measures \(\nu _{\pi }\) and \(\hat{\nu }_{\hat{\pi }}\) and a function h with finite second moments with respect to \(\nu _{\pi }\) and \(\hat{\nu }_{\hat{\pi }}\). Then,

$$\begin{aligned} \big | \nu _{\pi }(h) - \hat{\nu }_{\hat{\pi }}(h) \big | \le \sqrt{2}\, \big ( \nu _{\pi }(h^2)^\frac{1}{2} + \hat{\nu }_{\hat{\pi }} (h^2)^\frac{1}{2} \big ) \, D_\mathrm{H}\big (\nu _\pi \Vert \hat{\nu }_{\hat{\pi }} \big ). \end{aligned}$$

Proof

Suppose \(\nu _{\pi }\) and \(\hat{\nu }_{\hat{\pi }}\), respectively, have density functions \(f_{X}(x)\) and \({\hat{f}}_{{\hat{X}}}(x)\) with respect to the Lebesgue measure. We have the following inequality

$$\begin{aligned}&\big | \nu _{\pi }(h) - \hat{\nu }_{\hat{\pi }} (h) \big | \\&\quad = \bigg | \int _{\mathcal {X}} h(x) \big ( f_{X}(x) -{\hat{f}}_{{\hat{X}}}(x) \big ) \mathrm{d}x\, \bigg | \\&\quad = \bigg | \int _{\mathcal {X}} h(x) \Big ( f_{X}(x)^\frac{1}{2} + {\hat{f}}_{{\hat{X}}}(x)^\frac{1}{2} \Big ) \Big (f_{X}(x)^\frac{1}{2} - {\hat{f}}_{{\hat{X}}}(x)^\frac{1}{2} \Big ) \mathrm{d}x\, \bigg | \\&\quad \le \bigg ( \!\int _{\mathcal {X}}\! \Big (h(x)f_{X}(x)^\frac{1}{2} + h(x){\hat{f}}_{{\hat{X}}}(x)^\frac{1}{2} \Big )^2 \mathrm{d}x\bigg )^\frac{1}{2} \bigg ( \!\int _\mathcal {X\!} \Big ( f_{X}(x)^\frac{1}{2} - {\hat{f}}_{{\hat{X}}}(x)^\frac{1}{2} \Big )^2 \mathrm{d}x\bigg )^\frac{1}{2} \\&\quad \le \sqrt{2}\,\bigg (\Big ( \int _{\mathcal {X}} h(x)^2 \, f_{X}(x)\, \mathrm{d}x\Big )^\frac{1}{2} + \Big (\int _{\mathcal {X}} h(x)^2 \, {\hat{f}}_{{\hat{X}}}(x) \, \mathrm{d}x\Big )^\frac{1}{2}\bigg ) D_\mathrm{H}\big (\nu _\pi \Vert \hat{\nu }_{\hat{\pi }} \big ). \end{aligned}$$

Thus, the result follows. \(\square \)

Corollary 2

Suppose the conditions in (35) hold. Suppose further the bound in (20) holds. Then, the \(\chi ^ { {\scriptstyle 2} } \)-divergence of \(\nu _\pi \) from \({\hat{T}}_\sharp \mu _\mathrm{uni}\) satisfies

$$\begin{aligned} D_{\chi ^ { {\scriptstyle 2} } }\big (\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}\big ) \le \Big (\nu _{\pi }\big (\pi ^2\big / {{\hat{\pi }}}^2\big )^\frac{1}{2} + \big ({\hat{T}}_\sharp \mu _\mathrm{uni}\big )\big (\pi ^2 \big /{{\hat{\pi }}}^2\big )^\frac{1}{2} \Big ) \, \frac{2 \sqrt{2} {\hat{z}}}{z\sqrt{z}}\, \epsilon . \end{aligned}$$

Proof

Given the bound in (20), the ratio between \(f_{X}(x)\) and \({\hat{f}}_{\hat{X}}(x)\) satisfies

$$\begin{aligned} \sup _{x\in {\mathcal {X}}} \frac{f_{X}(x)}{{\hat{f}}_{\hat{X}}(x)} = \frac{{\hat{z}}}{z} \sup _{x\in {\mathcal {X}}} \frac{\pi (x)}{{\hat{\pi }}(x)} = \frac{{\hat{z}}}{z} {\hat{c}} < \infty . \end{aligned}$$

This way, \(\nu _\pi \) is absolutely continuous with respect to \({\hat{T}}_\sharp \mu _\mathrm{uni}\). Thus, the \(\chi ^ { {\scriptstyle 2} } \)-divergence of \(\nu _\pi \) from \({\hat{T}}_\sharp \mu _\mathrm{uni}\) can be expressed as

$$\begin{aligned} D_{\chi ^ { {\scriptstyle 2} } }\big (\nu _\pi \Vert {\hat{T}}_\sharp \mu _\mathrm{uni}\big )&= \int _{\mathcal {X}} \bigg ( \frac{f_{X}(x)}{{\hat{f}}_{\hat{X}}(x)} \bigg )^2 {\hat{f}}_{\hat{X}}(x) \mathrm{d}x- 1 \\&= \nu _{\pi }\big (f_{X}\big /{\hat{f}}_{\hat{X}} \big ) - \big ({\hat{T}}_\sharp \mu _\mathrm{uni}\big )\big (f_{X} \big / {\hat{f}}_{\hat{X}}\big ) \\&= \frac{{\hat{z}}}{z}\;\Big ( \nu _{\pi }\big (\pi \big / {{\hat{\pi }}} \big ) - \big ({\hat{T}}_\sharp \mu _\mathrm{uni}\big )\big (\pi \big / {{\hat{\pi }}} \big ) \Big ). \end{aligned}$$

The bound in (20) also implies that the expectations \(\nu _{\pi }({\pi ^2}/{{\hat{\pi }}}^2)\) and \(({\hat{T}}_\sharp \mu _\mathrm{uni})({\pi ^2}/{{\hat{\pi }}}^2)\) are finite. Therefore, the result follows from Proposition 6. \(\square \)

Remark 6

The TV distance, Hellinger distance, and \(\chi ^ { {\scriptstyle 2} } \)-divergence of the target measure \(\nu _{\pi }\) from the pushforward of the reference \(\mu _\mathrm{uni}\) under the SIRT \({\hat{T}}\) are linear in the approximation error of the TT. Note that \(\sqrt{z} = \big \Vert {\sqrt{\pi }} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \). Suppose the relative error of \({\tilde{g}}\) satisfies

$$\begin{aligned} \big \Vert {{\tilde{g}} - \sqrt{\pi }} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \Big / \big \Vert {\sqrt{\pi }} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le \tau , \end{aligned}$$

the TV distance, Hellinger distance, and \(\chi ^ { {\scriptstyle 2} } \)-divergence of \(\nu _{\pi }\) from \({\hat{T}}_\sharp \mu _\mathrm{uni}\) are bounded by \({\mathcal {O}}(\tau )\).

4 Deep Inverse Rosenblatt Transport

In many practical applications, probability densities can be concentrated to a small region of the parameter space or have complicated correlation structures. For example, posterior densities in Bayesian inference problems with informative data often occupy a relatively small region of the parameter space and demonstrate complicated nonlinear interactions in some sub-manifold, see [11, 12, 21, 50] for detailed examples. In this situation, straightforward approximation of a complicated density function in a TT decomposition may require rather large ranks. As the number of function evaluations needed in constructing TT decompositions grows quadratically with the ranks, such direct factorisation of the target densities with complicated structures may become infeasible.

Example 1

Consider a d-dimensional multivariate normal distribution with the unnormalised density \(\pi (x) = \exp (-\frac{1}{2}x^\top {\mathsf {C}}^{-1} x)\). If the covariance matrix \({\mathsf {C}}\in \mathbb {R}^{d \times d}\) is diagonal, the joint density factorises into a product of marginal densities, that is a TT decomposition with ranks 1. This corresponds to zero-rank off-diagonal blocks \({\mathsf {C}}[1:k,~(k+1):d]\), \(k=1,...,d-1\). However, the TT ranks of a correlated normal density \(\pi (x)\) may grow exponentially in the rank of the off-diagonal blocks of \({\mathsf {C}}\) [54].

We design a DIRT framework to construct a composition of order-preserving mappings in the SIRT format that can characterise concentrated probability densities with complicated correlation structures. The construction of DIRT is guided by a sequence of bridging probability measures \({\nu }_{0}\), \({\nu }_{1}, ...\), \({\nu }_{L}\), where \(\nu _L = \nu _{\pi }\) is the target measure. Each bridging measure \(\nu _k\) has the corresponding PDF

$$\begin{aligned} f_{X^k}(x) = \frac{1}{z_k} \,\pi _k(x)\,\lambda (x), \quad \text {where} \quad z_k = \int _{\mathcal {X}} \pi _k(x) \,\lambda (x)\,\mathrm{d}x. \end{aligned}$$
(38)

Here, \(\pi _0(x)\) is the unnormalised initial density such that \(\sup _{x\in {\mathcal {X}}} \pi _0(x) < \infty \), \(\pi _L(x) = \pi (x)\) is the unnormalised target density, and the superscript k indexes the random variable \(X^k \in {\mathcal {X}}\), \(X^k \sim \nu _k\). Our goal is to construct a composition of mappings \(T_0 \circ T_1 \circ \cdots \circ T_k\) such that the pushforward of the reference measure under this composition matches the kth bridging measure, i.e. \((T_0 \circ T_1 \circ \cdots \circ T_k)_\sharp \, \mu = \nu _k\). This way, by gradually increasing the complexity in the geometry and/or the computational cost of the densities of the bridging measures, it becomes computationally feasible to construct TT and the corresponding SIRT at each layer of the composition.

Assumption 1

Denoting the ratio between two unnormalised densities by

$$\begin{aligned} r_{k,j}(x)=\frac{\pi _k( x)}{\pi _j( x)}, \end{aligned}$$
(39)

we assume that for each pair of \(j < k\), the ratio \(r_{k,j}(x)\) is finite such that

$$\begin{aligned} \sup _{x\in {\mathcal {X}}}r_{k,j}(x) = c_{k,j}< \infty , \quad \forall j < k. \end{aligned}$$
(40)

In practice, there are many ways to choose the bridging measures. For example, one can consider tempered distributions [20, 31, 40, 44, 60] where \(\pi _k(x) = \pi (x)^{\beta _k}\) for a suitable chosen set of powers (reciprocal temperatures) \(0 \le \beta _0< \cdots < \beta _L=1\); and for problems involving computationally expensive PDE models, one can employ a hierarchy of models with different grid resolutions to reduce the computational cost for building the DIRT. In the rest of this section, we will present the recursive construction of DIRT and provide error analysis.

4.1 Recursive Construction

In the initial step (\(k = 0\)), we compute a TT \({\tilde{g}}_{0}(x)\) that approximates \(\sqrt{\pi _0}\) and construct the corresponding SIRT \(\hat{X}^ { {\scriptstyle 0} } = {\hat{T}}_{0}(U)\) so that the reference random variable \(U\sim \mu _\mathrm{uni}\) and \(\hat{X}^ { {\scriptstyle 0} } \sim ({\hat{T}}_{0})_\sharp \mu _\mathrm{uni}\) with the PDF

$$\begin{aligned} {\hat{f}}_{\hat{X}^0}(x) = \frac{1}{{\hat{z}}_0} \big (\gamma _0 + {\tilde{g}}_{0}(x)^2\big )\,\lambda (x), \quad \text {with} \quad {\hat{z}}_0 = \gamma _0 \,\lambda ({\mathcal {X}}) + \int _{\mathcal {X}} {\tilde{g}}_{0}(x)^2 \, \lambda (x)\, \mathrm{d}x, \end{aligned}$$
(41)

where the constant \(\gamma _0\) is chosen such that \(0< \gamma _0 \le \lambda ({\mathcal {X}})^{-1} \big \Vert {\sqrt{\pi _0} - {\tilde{g}}_0} \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} ^2\). Note that \({\hat{f}}_{\hat{X}^0}(x)\) is an approximation to \(f_{X^0}(x)\).

Remark 7

We can replace the uniform reference measure \(\mu _\mathrm{uni}\) with a general product-form probability measure \(\mu \) that has the PDF \(f_{U}(u) = \prod _{k = 1}^d f_{U_k}(u_k)\) with support in \({\mathcal {U}} = {\mathcal {U}}_1 \times {\mathcal {U}}_2 \times \cdots \times {\mathcal {U}}_d\). One can construct a mapping

$$\begin{aligned} R(u) = \big [F_{U_1}(u_1), ..., F_{U_k}(u_k),..., F_{U_d}(u_d)\big ]^\top , \end{aligned}$$

where \(F_{U_k}(u_k)\) is the CDF of \(U_k\), such that \(R_\sharp \,\mu = \mu _\mathrm{uni}\). Then, the composition of mappings \({\hat{T}}_{0} \circ R: {\mathcal {U}} \mapsto {\mathcal {X}}\) is lower-triangular and \(({\hat{T}}_{0} \circ R)_\sharp \, \mu \) has the density \({\hat{f}}_{\hat{X}^0}(x)\). We initialise the DIRT by \({\bar{T}}_{0} = {\hat{T}}_{0}\circ R\), where R is the identity map if \(\mu = \mu _\mathrm{uni}\).

After \(k>1\) steps, suppose we have the kth DIRT given as a composition of mappings

$$\begin{aligned} {\bar{T}}_{k} = ({\hat{T}}_{0} \circ R) \circ ({\hat{T}}_{1}\circ R) \circ \cdots \circ ({\hat{T}}_{k} \circ R), \end{aligned}$$

where each \({\hat{T}}_{j}\) is a SIRT. Denoting the pushforward of the reference probability measure \(\mu \) under \({\bar{T}}_{k}\) by \(\hat{\nu }_k\), i.e. \(\hat{\nu }_k \equiv ({\bar{T}}_{k})_\sharp \, \mu \), and the density function of \(\hat{\nu }_k\) by \({\hat{f}}_{\hat{X}^k}(x)\), the pullback density of \(\hat{\nu }_k\) under \({\bar{T}}_{k}\) satisfies

$$\begin{aligned} {\bar{T}}_{k}^\sharp {\hat{f}}_{\hat{X}^k}(u)&= ({\hat{f}}_{\hat{X}^k} \circ {\bar{T}}_{k} ) (u) \, \big |\nabla _{u} {\bar{T}}_{k}(u) \big | = f_{U}(u). \end{aligned}$$
(42)

The density function of the pullback probability measure \({\bar{T}}_{k}^\sharp \hat{\nu }_k\) is the reference product density \(f_{U}(u)\). Suppose the corresponding approximate PDF \({\hat{f}}_{\hat{X}^k}(x)\) can capture the range of variation and the correlation structure of the next PDF \(f_{X^{k+1}}(x)\), then the density function of the pullback probability measure \({\bar{T}}_{k}^\sharp \nu _{k+1}\),

$$\begin{aligned} {\bar{T}}_{k}^\sharp f_{X^{k+1}}(u) = (f_{X^{k+1}} \circ {\bar{T}}_{k} ) (u)\, \big |\nabla _{u} {\bar{T}}_{k}(u) \big |, \end{aligned}$$
(43)

may become easier to factorise in the TT format compared to the direct factorisation of the original target density function \(f_{X^{k+1}}(x)\). This way, for step \(k+1\), the existing composition \({\bar{T}}_{k}\) can be used to precondition the construction of the coupling between \(\mu \) and \(\nu _{k+1}\): by building a coupling between the pullback measure \({\bar{T}}_{k}^\sharp \nu _{k+1}\) and the reference measure

$$\begin{aligned} U^{k+1} = (T_{k+1}\circ R)(U), \quad \text {where} \quad U\sim \mu \quad \text {and} \quad U^{k+1} \sim {\bar{T}}_{k}^\sharp \nu _{k+1}, \end{aligned}$$

one can obtain a new composition of maps \(( {\bar{T}}_{k} \circ T_{k+1} \circ R)\) such that

$$\begin{aligned} ( {\bar{T}}_{k} \circ T_{k+1} \circ R )^\sharp \nu _{k+1} = \mu \quad \text {or} \quad ( {\bar{T}}_{k} \circ T_{k+1} \circ R)_\sharp \mu = \nu _{k+1}. \end{aligned}$$

We use SIRT to approximate \(T_{k+1}\). Using (42), we have \(\big |\nabla _{u} {\bar{T}}_{k}(u) \big | = \big ( ({\hat{f}}_{\hat{X}^k} \circ {\bar{T}}_{k} ) (u) \big )^{-1} \, f_{U}(u)\). Thus, the pullback density in (43) can be expressed as a ratio function

$$\begin{aligned} ({\bar{T}}_{k})^\sharp f_{X^{k+1}}(u) = \frac{(f_{X^{k+1}} \circ {\bar{T}}_{k}) (u)}{({{{\hat{f}}}}_{\hat{X}^{k}} \circ {\bar{T}}_{k}) (u)} f_{U}(u). \end{aligned}$$
(44)

This way, we can compute a TT \({\tilde{g}}_{k+1}(u)\) to approximate the function

$$\begin{aligned} q_{k+1}(u) \propto \bigg (\frac{(f_{X^{k+1}} \circ {\bar{T}}_{k}) (u)}{({{{\hat{f}}}}_{\hat{X}^{k}} \circ {\bar{T}}_{k}) (u) } \frac{f_{U}(u)}{\omega (u)} \bigg )^{\frac{1}{2}}, \end{aligned}$$
(45)

where \(\omega (u)\) is the weighting function associated with the reference domain \({\mathcal {U}}\) and \(a \propto b\) denotes that a is proportional to b. Since \(z_k\), the normalising constant of \(f_{X^{k+1}}\), is unknown, here we only need to factorise an unnormalised version of \(q_{k+1}(u)\) into a TT. The normalising constant is computed automatically during the marginalisation process of SIRT (see Proposition 2).

The SIRT \(U^{k+1} = {\hat{T}}_{k+1}(U')\) built from the TT \({\tilde{g}}_{k+1}(u)\) couples the uniform reference random variable \(U' \sim \mu _\mathrm{uni}\) with \(U^{k+1} \sim ({\hat{T}}_{k+1})_\sharp \, \mu _\mathrm{uni}\). Thus, the composition of transformations \(U^{k+1} = ({\hat{T}}_{k+1}\circ R)(U)\) couples the general reference random variable \(U\sim \mu \) with \(U^{k+1} \sim ({\hat{T}}_{k+1}\circ R)_\sharp \, \mu \), where \(({\hat{T}}_{k+1}\circ R)_\sharp \, \mu \) is an approximation to the pullback measure \({\bar{T}}_{k}^\sharp \,\nu _{k+1}\). Thus, we have

$$\begin{aligned} ({\hat{T}}_{k+1}\circ R)^\sharp ({\bar{T}}_{k}^\sharp \, \nu _{k+1} ) = ( {\bar{T}}_{k} \circ {\hat{T}}_{k+1} \circ R )^\sharp \, \nu _{k+1} \approx \mu \quad \text {or} \quad ( {\bar{T}}_{k} \circ {\hat{T}}_{k+1} \circ R )_\sharp \, \mu \approx \nu _{k+1} . \end{aligned}$$

The next DIRT is therefore defined by the new composition of mappings

$$\begin{aligned} {\bar{T}}_{k+1} := {\bar{T}}_{k} \circ ({\hat{T}}_{k+1}\circ R). \end{aligned}$$

The recursion is completed by obtaining \({\hat{T}}_{L}\) and \({\bar{T}}_{L}\).

Proposition 7

At the jth DIRT step, the Jacobian of the incremental mapping \({\hat{T}}_{j}\) is given by

$$\begin{aligned} \big |\nabla _{u} {\hat{T}}_{j}^{-1}(u) \big | = {\hat{p}}_{U^j} (u) = \frac{1}{{\hat{z}}_j} \big (\gamma _j + {\tilde{g}}_{j}(u)^2 \big ) \omega (u) , \end{aligned}$$
(46)

with

$$\begin{aligned} {\hat{z}}_j = \gamma _j\, \omega ({\mathcal {U}})+ \int _{\mathcal {U}} {\tilde{g}}_{j}(u)^2 \omega (u) \mathrm{d}u, \end{aligned}$$
(47)

where the constant \(\gamma _j > 0\) is chosen according to the \(L^2\) error of \({\tilde{g}}_{j}\).

Proof

The SIRT \(U^{j} = {\hat{T}}_{j}(U^\prime )\), which is constructed by integrating \(\big (\gamma _j + {\tilde{g}}_{j}(u)^2 \big ) \omega (u)\), maps the uniform random variable \(U^\prime \sim \mu _\mathrm{uni}\) to \(U^{j} \sim ({\hat{T}}_{j})_\sharp \mu _\mathrm{uni}\). Thus, the pushforward measure \(({\hat{T}}_{j})_\sharp \,\mu _\mathrm{uni}\) has the density function \({\hat{p}}_{U^{j}}(u)\) defined in (46), which yields

$$\begin{aligned} {\hat{p}}_{U^{j}} (u) \equiv ({\hat{T}}_{j})_\sharp f_\mathrm{uni}(u) = \big |\nabla _{u} {\hat{T}}_{j}^{-1}(u) \big |. \end{aligned}$$

\(\square \)

Lemma 1

At step k of the DIRT construction, suppose we have the DIRTs

$$\begin{aligned} {\bar{T}}_{j} = ({\hat{T}}_{0} \circ R) \circ ({\hat{T}}_{1} \circ R) \circ \cdots \circ ({\hat{T}}_{j} \circ R), \quad \text {for} \quad j \le k. \end{aligned}$$

Suppose further we have a normalised density function over the domain \({\mathcal {X}}\) defined in (41) for \(j = 0\), and normalised density functions over the reference domain \({\mathcal {U}}\) defined in (46) for \(j = 1, ..., k\). Then, the pushforward measure \(({\bar{T}}_{k})_\sharp \,\mu \) has the PDF

$$\begin{aligned} {\hat{f}}_{\hat{X}^k}(x) \equiv ({\bar{T}}_{k})_\sharp f_{U}(x) = {\hat{f}}_{\hat{X}^0} (x) \prod _{j=1}^{k}\frac{({\hat{p}}_{U^j}\circ {\bar{T}}_{j-1}^{-1})(x)}{(f_{U}\circ {\bar{T}}_{j-1}^{-1})(x)}. \end{aligned}$$
(48)

Proof

The result can be shown using induction. For the case \(k=0\), the result follows directly from (41). Suppose (48) holds for \(k>0\). We define the composition of mappings

$$\begin{aligned} {\tilde{T}}_{k} = {\bar{T}}_{k} \circ R^{-1} = {\hat{T}}_{0}\circ ( R \circ {\hat{T}}_{1}) \circ \cdots \circ ( R \circ {\hat{T}}_{k}). \end{aligned}$$

Since \(R_\sharp \,\mu = \mu _\mathrm{uni}\), we have the identity \(({\bar{T}}_{k})_\sharp \,\mu = ({\tilde{T}}_{k})_\sharp \,\mu _\mathrm{uni}\), which leads to

$$\begin{aligned} {\hat{f}}_{\hat{X}^k}(x) = ({\bar{T}}_{k})_\sharp f_{U}(x)= ({\tilde{T}}_{k})_\sharp f_\mathrm{uni}(x). \end{aligned}$$

At step \(k+1\), we have the new composition of mappings \({\tilde{T}}_{k+1} = {\tilde{T}}_{k}\circ R\circ {\hat{T}}_{k+1}\), the pushforward measures, \(({\tilde{T}}_{k+1})_\sharp \, \mu \) and \(({\tilde{T}}_{k})_\sharp \, \mu \), have the density functions

$$\begin{aligned} {\hat{f}}_{\hat{X}^{k+1}}(x)&= ({\tilde{T}}_{k+1})_\sharp f_\mathrm{uni}(x) = (f_\mathrm{uni}\circ {\tilde{T}}_{k+1}^{-1}) (x) \, \big |\nabla _{x} {\tilde{T}}_{k+1}^{-1}(x) \big | = \big |\nabla _{x} {\tilde{T}}_{k+1}^{-1}(x) \big |,\\ {\hat{f}}_{\hat{X}^k}(x)&= ({\tilde{T}}_{k})_\sharp f_\mathrm{uni}(x) = (f_\mathrm{uni}\circ {\tilde{T}}_{k}^{-1}) (x) \, \big |\nabla _{x} {\tilde{T}}_{k}^{-1}(x) \big | = \big |\nabla _{x} {\tilde{T}}_{k}^{-1}(x) \big |, \end{aligned}$$

respectively. Applying the change of variables \(u^\prime = {\tilde{T}}_{k}^{-1}(x)\) and \(u= R^{-1}(u^\prime )\), the determinant of \(\nabla _{x} {\tilde{T}}_{k+1}^{-1}(x)\) can be expressed as

$$\begin{aligned} \big |\nabla _{x} {\tilde{T}}_{k+1}^{-1}(x) \big | = \big |\nabla _{u} {\hat{T}}_{k+1}^{-1}(u) \big | \, \big |\nabla _{u'} R^{-1}(u') \big | \, \big |\nabla _{x} {\tilde{T}}_{k}^{-1}(x) \big | . \end{aligned}$$
(49)

Note that the above change of variables implies also that \(\big |\nabla _{u'} R^{-1}(u') \big | = f_{U}( u)^{-1}\) and \(u= ({\tilde{T}}_{k} \circ R)^{-1}(x) = {\bar{T}}_{k}^{-1}(x)\). Together with Proposition 7, we have

$$\begin{aligned} \big |\nabla _{x} {\tilde{T}}_{k+1}^{-1}(x) \big |&= \frac{( {\hat{p}}_{U^{k+1}} \circ {\bar{T}}_{k}^{-1}) (x)}{ ( f_{U}\circ {\bar{T}}_{k}^{-1} ) ( x)}\, \big |\nabla _{x} {\tilde{T}}_{k}^{-1}(x) \big |. \end{aligned}$$

Thus, the result follows. \(\square \)

Corollary 3

At step k of the DIRT construction, the composition of mappings, \({\bar{T}}_{k}\), satisfies

$$\begin{aligned} \big |\nabla _{x} {\bar{T}}_{k}^{-1}(x)\big | = \frac{{\hat{f}}_{\hat{X}^k}(x)}{( f_{U}\circ {\bar{T}}_{k}^{-1} ) (x)}. \end{aligned}$$

Proof

The result direct follows from \({\bar{T}}_{k}^{-1} = R^{-1} \circ {\tilde{T}}_{k}^{-1}\) and the proof of Lemma 1. \(\square \)

Remark 8

The normalised PDFs of the kth DIRT step can be expressed as

$$\begin{aligned} {\hat{f}}_{\hat{X}^k}(x) = \frac{1}{ {\bar{z}}_k } \hat{\pi }_k(x) \lambda (x), \end{aligned}$$
(50)

where \({\bar{z}}_k = \prod _{j=0}^{k} {\hat{z}}_j\) is an approximation to the normalising constant \(z_k\), and

$$\begin{aligned} \hat{\pi }_k(x) = \big (\gamma _0 + {\tilde{g}}_{0}(x)^2 \big ) \prod _{j=1}^{k} \frac{\big (\gamma _j + ({\tilde{g}}_{j}\circ {\bar{T}}_{j-1}^{-1})(x)^2\big ) \, (\omega \circ {\bar{T}}_{j-1}^{-1}) (x)}{(f_{U}\circ {\bar{T}}_{j-1}^{-1}) (x)} \end{aligned}$$
(51)

is an approximation to the unnormalised bridging density \(\pi _k(x)\).

4.2 Ratio Functions and Error Analysis

We will first discuss the ratio function (45) and its approximation and then present the corresponding error analysis.

4.2.1 Ratio Functions

Given the unnormalised PDF in (51), the pullback density in (44) can be expressed as

$$\begin{aligned} ({\bar{T}}_{k})^\sharp f_{X^{k+1}}(u) = \frac{(f_{X^{k+1}} \circ {\bar{T}}_{k}) (u)}{({{{\hat{f}}}}_{\hat{X}^{k}} \circ {\bar{T}}_{k}) (u)} \, f_{U}(u) \propto \frac{(\pi _{k+1}\circ {\bar{T}}_{k})(u) }{(\hat{\pi }_k\circ {\bar{T}}_{k})(u)}\, f_{U}(u). \end{aligned}$$

This way, we need to compute a TT \({\tilde{g}}_{k+1}(u)\) to approximate the function

$$\begin{aligned} q_{k+1}(u) = \bigg (\frac{ (\pi _{k+1}\circ {\bar{T}}_{k})(u) }{(\hat{\pi }_k\circ {\bar{T}}_{k})(u)} \frac{ f_{U}(u)}{\omega (u)} \bigg )^\frac{1}{2}, \end{aligned}$$
(52)

to build the SIRT \({\hat{T}}_{k+1}\). We call this strategy the exact ratio approach.

Alternatively, the pullback density in (44) can be expressed as

$$\begin{aligned} ({\bar{T}}_{k})^\sharp f_{X^{k+1}}(u) \propto \frac{(\pi _{k+1}\circ {\bar{T}}_{k})(u) }{(\pi _{k}\circ {\bar{T}}_{k})(u) } \frac{(\pi _{k}\circ {\bar{T}}_{k})(u) }{(\hat{\pi }_k\circ {\bar{T}}_{k})(u)}\, f_{U}(u) \end{aligned}$$
(53)

Since the DIRT density function \(\hat{\pi }_k\) approximates the kth unnormalised bridging density function \(\pi _k\), the pullback density in (53) can be approximated as

$$\begin{aligned} ({\bar{T}}_{k})^\sharp f_{X^{k+1}}(u) \propto (r_{k+1,k} \circ {\bar{T}}_{k}) (u) \frac{(\pi _{k}\circ {\bar{T}}_{k})(u)}{(\hat{\pi }_k\circ {\bar{T}}_{k})(u)} \,f_{U}(u) \approx (r_{k+1,k} \circ {\bar{T}}_{k}) (u)\, f_{U}(u). \end{aligned}$$

This way, we need to compute a TT \({\tilde{g}}_{k+1}(u)\) that approximates the function

$$\begin{aligned} {\tilde{q}}_{k+1}(u) = \bigg (\frac{(r_{k+1,k} \circ {\bar{T}}_{k}) (u)\, f_{U}(u) }{\omega (u)} \bigg )^\frac{1}{2} , \end{aligned}$$
(54)

to build an alternative SIRT \({\hat{T}}_{k+1}\). We call this strategy the approximate ratio approach.

Remark 9

For all \(k \ge 0\), we want the ratio \(\pi _k / \hat{\pi }_k\) to be finite in \({\mathcal {U}}\). Otherwise, it may cause large errors in the TT decomposition and may deteriorate the convergence of the resulting sampling schemes for characterising \(\nu _{\pi }\). Given Assumption 1 and \(\gamma _j > 0\) for all \(j = 0, ..., k\), we have

$$\begin{aligned} \sup _{x\in {\mathcal {X}}} \frac{\pi _0(x)}{\gamma _0 + {\tilde{g}}_{0}(x)^2}< \infty \quad \mathrm{and} \quad \sup _{u\in {\mathcal {U}}} \frac{(r_{k,k-1} \circ {\bar{T}}_{k-1}) (u) \, f_{U}(u)}{\big (\gamma _k + {\tilde{g}}_{k}(u)^2\big ) \, \omega (u)} < \infty , \end{aligned}$$
(55)

and thus, it can be shown (using induction) that the ratio \(\pi _k / \hat{\pi }_k\) is bounded.

Remark 10

In some situations, the ratio function \((r_{k+1,k} \circ {\bar{T}}_{k}) (u)\) may exhibit sharp boundary layers if the uniform reference measure \(\mu _\mathrm{uni}\) (with \(R = I\)) is used. This can increase the complexity of the resulting TT decompositions. Apart from carefully choosing the bridging measures, a partial remedy to the boundary layer is to use a reference measure with the density \(f_{U}(u)\) decaying towards the boundary, such as the normal density truncated on a sufficiently large hypercube \([-\sigma , \sigma ]^d\). The function \(f_{U}(u)\) in (52) and (54) smoothens the previous approximation errors, which can improve the accuracy of TT approximations. With a reference measure defined on a hypercube, the collocation techniques based on Chebyshev and Fourier bases (see Sect. 3.2) can be applied to construct and evaluate functional TT decompositions in DIRT.

4.2.2 DIRT Error

Based on assumptions on the TT error at each layer of the DIRT construction, here we first establish bounds on approximation errors of the DIRT-induced approximate density \({\hat{\pi _k}}\) for both the exact ratio approach and the approximate ratio approach. The error analysis also provides heuristics for choosing the constants \(\gamma _j\). Then, the error bounds of \(\sqrt{{\hat{\pi _k}}}\) leads to bounds on the TV distance, the Hellinger distance, and the \(\chi ^ { {\scriptstyle 2} } \)-divergence of the kth bridging measure \(\nu _k\) from the pushforward measure \(({\bar{T}}_{k})_\sharp \mu \).

Theorem 2

(Exact ratio approach) At the kth DIRT step (\(k>0\)), suppose the TT decomposition \({\tilde{g}}_{k} \approx q_{k}\) and the constant \(\gamma _k\), respectively, satisfy

$$\begin{aligned} \big \Vert {{\tilde{g}}_{k} - q_k} \big \Vert _{L_\omega ^{2}({\mathcal {U}})} \le \epsilon _k \quad \mathrm{and} \quad \gamma _k \le \frac{1}{\omega ({\mathcal {U}})} \big \Vert {{\tilde{g}}_{k} - q_k} \big \Vert _{L_\omega ^{2}({\mathcal {U}})} ^2, \end{aligned}$$

where \(q_{k}\) is defined in (52). Then, the unnormalised PDF \(\hat{\pi }_k\) (51) approximates the unnormalised density function of the kth bridging measure \(\pi _k\) with the error

$$\begin{aligned} \big \Vert {\sqrt{\pi _k} - \sqrt{{\hat{\pi _k}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le \sqrt{2\,{\bar{z}}_{k-1}} \epsilon _k, \end{aligned}$$

where \({\bar{z}}_{k-1} = \prod _{j=0}^{k-1} {\hat{z}}_j\) is the normalising constant of the unnormalised PDF \(\hat{\pi }_{k-1}\).

Proof

We start with the identity

$$\begin{aligned} \big \Vert {\sqrt{\pi _k} - \sqrt{{\hat{\pi _k}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})}&= \bigg (\int _{\mathcal {X}} \bigg ( \Big (\frac{\pi _{k}(x)}{{\hat{\pi }}_{k-1}(x)}\Big )^\frac{1}{2} - \Big (\frac{{\hat{\pi }}_{k}(x)}{{\hat{\pi }}_{k-1}(x)}\Big )^\frac{1}{2} \bigg )^2 {\hat{\pi }}_{k-1}(x) \,\lambda (x)\,\mathrm{d}x\bigg )^\frac{1}{2} \\&= \sqrt{{\bar{z}}_{k-1} } \bigg ( \int _{\mathcal {X}} \bigg ( \Big (\frac{\pi _{k}(x)}{{\hat{\pi }}_{k-1}(x)}\Big )^\frac{1}{2} - \Big (\frac{{\hat{\pi }}_{k}(x)}{{\hat{\pi }}_{k-1}(x)}\Big )^\frac{1}{2} \bigg )^2 {\hat{f}}_{\hat{X}^{k-1}}(x)\,\mathrm{d}x\bigg )^\frac{1}{2} . \end{aligned}$$

Applying the change of variables \(u= {\bar{T}}_{k-1}^{-1}(x)\) and Equations (51) and (52), we have

$$\begin{aligned} \frac{\pi _{k}(x)}{{\hat{\pi }}_{k-1}(x)}&= \frac{(\pi _{k}\circ {\bar{T}}_{k-1})(u)}{({\hat{\pi }}_{k-1}\circ {\bar{T}}_{k-1})(u)} = \frac{ q_k (u)^2 \,\omega (u)}{f_{U}(u)},\\ \frac{{\hat{\pi }}_{k}(x)}{{\hat{\pi }}_{k-1}(x)}&= \frac{({\hat{\pi }}_{k}\circ {\bar{T}}_{k-1})(u)}{({\hat{\pi }}_{k-1}\circ {\bar{T}}_{k-1})(u)} =\frac{\big (\gamma _k + {\tilde{g}}_{k}(u)^2\big )\, \omega (u)}{f_{U}(u)}. \end{aligned}$$

In addition, Corollary 3 implies \( {\hat{f}}_{\hat{X}^{k-1}}(x)\,dx= f_{U}(u)\,du. \) Thus, we have

$$\begin{aligned} \big \Vert {\sqrt{\pi _k} - \sqrt{{\hat{\pi _k}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})}&= \sqrt{{\bar{z}}_{k-1}} \Big ( \int _{\mathcal {U}} \Big ( q_k (u) - \big (\gamma _k + {\tilde{g}}_{k}(u)^2\big )^\frac{1}{2} \Big )^2 \omega (u) \mathrm{d}u\Big )^\frac{1}{2} \\&\le \sqrt{{\bar{z}}_{k-1}} \Big ( \int _{\mathcal {U}} \Big ( q_k (u) - {\tilde{g}}_{k}(u) \Big )^2 \omega (u) \mathrm{d}u+ \gamma _k \,\omega ({\mathcal {U}}) \Big )^\frac{1}{2} \\&\le \sqrt{2\,{\bar{z}}_{k-1}} \epsilon _k, \end{aligned}$$

where the second last inequality follows from the same proof of Proposition 4. \(\square \)

Theorem 3

(Approximate ratio approach) Suppose the initial TT decomposition \({\tilde{g}}_{0} \approx \sqrt{\pi _0}\) and the constant \(\gamma _0\) satisfy

$$\begin{aligned} \big \Vert {{\tilde{g}}_{0} - \sqrt{\pi _0}} \big \Vert _{L_\omega ^{2}({\mathcal {U}})} \le \epsilon _0 \quad \mathrm{and} \quad \gamma _0 \le \frac{1}{\lambda ({\mathcal {X}})} \big \Vert {{\tilde{g}}_{0} - \sqrt{\pi _0} } \big \Vert _{L_\omega ^{2}({\mathcal {U}})} ^2, \end{aligned}$$

respectively. At the kth DIRT step (\(k>0\)), suppose further the TT decompositions \({\tilde{g}}_{j} \approx {\tilde{q}}_{j}\) and the constants \(\gamma _j\) satisfy

$$\begin{aligned} \big \Vert {{\tilde{g}}_{j} - {\tilde{q}}_j} \big \Vert _{L_\omega ^{2}({\mathcal {U}})} \le \epsilon _j \quad \text {and} \quad \gamma _j \le \frac{1}{\omega ({\mathcal {U}})} \big \Vert {{\tilde{g}}_{j} - {\tilde{q}}_j } \big \Vert _{L_\omega ^{2}({\mathcal {U}})} ^2, \;\; \text {for} \;\; j = 1, ..., k, \end{aligned}$$
(56)

respectively, where \({\tilde{q}}_{j}\) is defined in (54). Then, the unnormalised PDF of DIRT defined by (51) approximates the kth unnormalised bridging density function with the error

$$\begin{aligned} \big \Vert {\sqrt{\pi _k} - \textstyle \sqrt{{\hat{\pi _k}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le \sqrt{2\,c_{k,0}} \epsilon _0 + \sum _{j=1}^{k} \sqrt{2\,c_{k,j}{\bar{z}}_{j-1}} \epsilon _j, \end{aligned}$$

where \(c_{k,j} = \sup _{x\in {\mathcal {X}}} r_{k,j}(x)\) is given in Assumption 1, and \({\bar{z}}_{k} = \prod _{j=0}^{k} {\hat{z}}_j\).

Proof

The difference between \(\sqrt{\pi _k}\) and \(\sqrt{{\hat{\pi _k}}}\) can be written as

$$\begin{aligned} \sqrt{\pi _k} - \textstyle \sqrt{{\hat{\pi _k}}} =&\textstyle \left( \sqrt{\frac{\pi _k}{\pi _0}\pi _0} - \sqrt{\frac{\pi _k}{\pi _0} {\hat{\pi }}_0} \right) + \sum _{j=1}^{k} \Big ( \sqrt{\frac{\pi _k}{\pi _j} \frac{\pi _j}{\pi _{j-1}} {\hat{\pi }}_{j-1} } - \sqrt{\frac{\pi _k}{\pi _j} \frac{{\hat{\pi }}_{j}}{{\hat{\pi }}_{j-1}} {\hat{\pi }}_{j-1} } \Big ) \\ =&\textstyle \sqrt{r_{k,0}}\, (\sqrt{\pi _0} - \textstyle \sqrt{{\hat{\pi }}_0} ) + \sum _{j=1}^{k}\sqrt{r_{k,j}} \Big (\sqrt{r_{j,j-1}} - \sqrt{\frac{{\hat{\pi }}_{j}}{{\hat{\pi }}_{j-1}}} \Big ) \sqrt{{\hat{\pi }}_{j-1}} . \end{aligned}$$

Then, we have \( \big \Vert {\sqrt{\pi _k} - \sqrt{{\hat{\pi _k}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le I_0 + \sum _{j=1}^{k} I_j\), where

$$\begin{aligned} I_0 = \big \Vert {\sqrt{r_{k,0}} \, \big (\sqrt{\pi _0} - \textstyle \sqrt{{\hat{\pi }}_0} \big ) } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \;\; \mathrm{and} \;\; I_j = \big \Vert {\sqrt{r_{k,j}} \Big (\sqrt{r_{j,j-1}} - \textstyle \sqrt{\frac{{\hat{\pi }}_{j}}{{\hat{\pi }}_{j-1}}} \Big ) \sqrt{{\hat{\pi }}_{j-1}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} . \end{aligned}$$

Recalling that \(c_{k,j} = \sup _{x\in {\mathcal {X}}} r_{k,j}(x)\) and applying Proposition 4, we have

$$\begin{aligned} I_0 \le \sqrt{c_{k,0}} \, \big \Vert {\sqrt{\pi _0} - \textstyle \sqrt{ \gamma _0 + {\tilde{g}}_{0}^2} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le \sqrt{2\,c_{k,0}} \epsilon _0 . \end{aligned}$$

Applying the change of variables \(u= {\bar{T}}_{j-1}^{-1}(x)\) and Corollary 3 for each \(j>0\), we have

$$\begin{aligned} r_{j,j-1}(x) = (r_{j,j-1} \circ {\bar{T}}_{j-1}\big )(u) = \frac{{\tilde{q}}_j(u)^2\, \omega (u) }{f_{U}(u) } ,\quad \frac{{\hat{\pi }}_{j}(x)}{{\hat{\pi }}_{j-1}(x)} = \frac{ \big ( \gamma _j + {\tilde{g}}_{j}(u) ^2\big ) \, \omega (u) }{f_{U}(u) } \end{aligned}$$

and \({\hat{f}}_{\hat{X}^{j-1}}(x)\,dx= f_{U}(u)\,du\). This leads to

$$\begin{aligned} I_j&\le \sqrt{c_{k,j}} \bigg ( \!\int _{\mathcal {X}}\! \bigg (r_{j,j-1}(x)^\frac{1}{2} - \Big (\frac{{\hat{\pi }}_{j}(x)}{{\hat{\pi }}_{j-1}(x)} \Big )^\frac{1}{2} \bigg )^2 \hat{\pi }_{j-1}(x)\,\lambda (x)\,\mathrm{d}x\bigg )^\frac{1}{2}\\&= \sqrt{c_{k,j} {\bar{z}}_{j-1}} \bigg ( \!\int _{\mathcal {X}}\! \bigg (r_{j,j-1}(x)^\frac{1}{2} - \Big (\frac{{\hat{\pi }}_{j}(x)}{{\hat{\pi }}_{j-1}(x)} \Big )^\frac{1}{2} \bigg )^2 {\hat{f}}_{\hat{X}^{j-1}}(x)\,\mathrm{d}x\bigg )^\frac{1}{2} \\&= \sqrt{c_{k,j}{\bar{z}}_{j-1}} \bigg ( \int _{\mathcal {U}} \Big ( {\tilde{q}}_j(u) - \big ( \gamma _j + {\tilde{g}}_{j}(u) ^2\big )^\frac{1}{2} \Big )^2 \omega (u) \mathrm{d}u\bigg )^\frac{1}{2} \\&\le \sqrt{c_{k,j}{\bar{z}}_{j-1}} \bigg ( \int _{\mathcal {U}} \Big ( {\tilde{q}}_j(u) - {\tilde{g}}_{j}(u) \Big )^2 \omega (u) \mathrm{d}u+ \gamma _j \,\omega ({\mathcal {U}}) \bigg )^\frac{1}{2} \\&\le \sqrt{2\,c_{k,j}{\bar{z}}_{j-1}} \epsilon _j. \end{aligned}$$

Thus, the result follows. \(\square \)

Remark 11

At first glance, it appears that Theorem 2 gives smaller errors than Theorem 3. However, this assumes that the two ratio functions in (52) and (54) are approximated with the same TT error \(\epsilon _k\). Ideally, this should also require the same number of degrees of freedom in TT cores. In practice, this may not be the case: the exact ratio (53) carries the previous approximation errors in the term \((\pi _k \circ {{\bar{T}}}_k)(u) \big / ({{\hat{\pi }}}_k \circ {\bar{T}}_k)(u)\), which can have a complicated structure that is difficult to approximate in TT. In contrast, the approximate ratio involves only the target densities. For example, if the bridging densities \(\pi _k\) were introduced by tempering, the ratio \(r_{k+1,k} = \pi ^{\beta _{k+1}-\beta _{k}}\) is just another tempered density. For this reason, DIRT built by the approximate ratio approach may be more accurate in practice.

Corollary 4

Given \(\pi _k\) and \({\hat{\pi _k}}\) constructed using either the exact or the approximate ratio functions, we suppose the error of \(\sqrt{{\hat{\pi _k}}}\) satisfies

$$\begin{aligned} \big \Vert {\sqrt{\pi _k} - \textstyle \sqrt{{\hat{\pi _k}}} } \big \Vert _{L_\lambda ^{2}({\mathcal {X}})} \le e_k. \end{aligned}$$

Then, the Hellinger distance and the total variation distance between the kth bridging measure \(\nu _k\) and the pushforward measure \(({\bar{T}}_{k})_\sharp \mu \) satisfy

$$\begin{aligned} D_\mathrm{H}\big (\nu _k\Vert ({\bar{T}}_{k})_\sharp \mu \big ) \le \frac{\sqrt{2} e_k}{\sqrt{z_k} }\quad \mathrm{and} \quad D_\mathrm{TV}\big (\nu _k\Vert ({\bar{T}}_{k})_\sharp \mu \big ) \le \frac{2 \, e_k}{ \sqrt{z_k} }, \end{aligned}$$

respectively. The \(\chi ^ { {\scriptstyle 2} } \)-divergence of \(\nu _k\) from \(({\bar{T}}_{k})_\sharp \mu \) satisfies

$$\begin{aligned} D_{\chi ^ { {\scriptstyle 2} } }\big (\nu _k\Vert ({\bar{T}}_{k})_\sharp \mu \big ) \le \Big ( \nu _k \big (\pi _k^2 \big / {\hat{\pi _k}}^2\big )^\frac{1}{2} + \big (({\bar{T}}_{k})_\sharp \mu \big )\big (\pi _k^2 \big / {\hat{\pi _k}}^2 \big )^\frac{1}{2} \Big ) \, \frac{2 {\bar{z}}_k}{z_k\sqrt{z_k}}\, e_k. \end{aligned}$$

Proof

The results can be obtained by applying the same proofs of Theorem 1, Corollary 1, and Corollary 2, respectively. Note that for the \(\chi ^ { {\scriptstyle 2} } \)-divergence, the condition \(\sup _{x\in {\mathcal {X}}} \pi _k(x)/{\hat{\pi _k}}(x) < \infty \) holds as discussed in Remark 9. \(\square \)

5 Debiasing

Applying either SIRT or DIRT, one can obtain an approximate map \(T: {\mathcal {U}} \mapsto {\mathcal {X}}\) that enables the simulation of a random variable \(\hat{X} \sim T_\sharp \, \mu \) approximating the target random variable \(X\sim \nu _{\pi }\). In a situation where SIRT and DIRT have high accuracy in approximating the target measure, one can approximate the expectation \(\nu _{\pi }(h)\) of a function of interest \(h(x)\) directly, using the expectation of \(h(x)\) over \(T_\sharp \,\mu \), i.e. \((T_\sharp \mu )(h) \equiv \mu (h\circ T)\). The bias of the approximated expectation is proportional to the Hellinger distance \(D_\mathrm{H}( \nu _{\pi }\Vert T_\sharp \mu )\) (see Proposition 6). In addition, we can apply the approximate inverse Rosenblatt transport T within the Metropolis–Hastings method and importance sampling to reduce the bias in computing \(\nu _{\pi }(h)\). For the sake of completeness, here we discuss some debiasing strategies based on existing work.

figure a

We first consider the IRT-MCMC (Algorithm 1), in which the approximate IRT is used as a proposal mechanism in the Metropolised independent sampler for constructing a Markov chain of random variables that converges to the target measure. In the acceptance probability (57), \(f_{X}(\cdot )\) is the PDF of the target measure \(\nu _{\pi }\) and \({\hat{f}}_{\hat{X}}(\cdot )\) is the PDF of \(T_\sharp \mu \) that is defined by either the SIRT (19) or the DIRT (50). Following the result of Mengersen and Tweedie [41], the bounds discussed in Remarks 2 and 9 can guarantee the uniform ergodicity of the Markov chain constructed by Algorithm 1. In addition, the average rejection probability is bounded by \(2\,D_\mathrm{TV}(\nu _{\pi }\Vert T_\sharp \mu )\), see Lemma 1 of [17]. This provides an indicator on the performance of the Metropolised independent sampler. However, our bound on \(D_\mathrm{TV}(\nu _{\pi }\Vert T_\sharp \mu )\) does not directly connect to the bound on the convergence rate of the Metropolised independent sampler, in which the use of acceptance/rejection may require a more precise control on the pointwise error, e.g. \( \big \Vert {{\tilde{g}} - \sqrt{\pi }} \big \Vert _{L_\lambda ^{\infty }({\mathcal {X}})} \), to assess the convergence rate of the sampler.

figure b

One can also employ the approximate IRT built by either the SIRT or the DIRT as the biasing distribution in importance sampling, which leads to the IRT-IS algorithm (Algorithm 2). Compared to IRT-MCMC, IRT-IS generates random variables \(\hat{X}^ { {\scriptstyle (j)} } \) from the approximate IRT and correct the bias using the weights \(W_j\). By avoiding the Markov chain, importance sampling offers several advantages over the Metropolised independent sampler: (i) it can be easily parallelised; and (ii) variance reduction techniques such as antithetic variable and control variates (see [53, Chapter 4] and references therein) and efficient high-dimensional quadrature methods such as quasi-Monte Carlo [16] can be naturally applied within importance sampling.

The error bounds established in Sects. 3 and 4 offer insights into the efficiency of IRT-IS. As discussed in [48, Chapter 9], for N approximate target random variables, one can use the effective sample size (ESS)

$$\begin{aligned} \mathrm{ESS}(N) = N\,\frac{(T_\sharp \mu )(w)^2 }{(T_\sharp \mu )(w^2)}, \end{aligned}$$

where \(w(x) = \pi (x)/{\hat{\pi }}(x)\) for SIRT and \(w(x) = \pi (x)/{\hat{\pi _L}}(x)\) for DIRT, to measure the efficiency of importance sampling for representing the target measure \(\nu _{\pi }\).

Lemma 2

Given the \(\chi ^ { {\scriptstyle 2} } \)-divergence of \(\nu _{\pi }\) from \(T_\sharp \,\mu \), the ESS of Algorithm 2 satisfies

$$\begin{aligned} \mathrm{ESS}(N) = \frac{N}{1 + D_{\chi ^ { {\scriptstyle 2} } }(\nu _{\pi }\Vert T_\sharp \mu ) }. \end{aligned}$$

Proof

Since we have \((T_\sharp \mu ) ( \pi \lambda \big / {\hat{f}}_{\hat{X}} ) = z\), where z is the normalising constant of the target density, the \(\chi ^ { {\scriptstyle 2} } \)-divergence of \(\nu _{\pi }\) from \(T_\sharp \,\mu \) satisfies

$$\begin{aligned} D_{\chi ^ { {\scriptstyle 2} } }(\nu _{\pi }\Vert T_\sharp \mu ) = \frac{(T_\sharp \mu )\big (( \pi \lambda \big / {\hat{f}}_{\hat{X}})^2\big )}{\big ((T_\sharp \mu ) ( \pi \lambda \big / {\hat{f}}_{\hat{X}}) \big )^2 } -1 = \frac{(T_\sharp \mu ) (w^2)}{(T_\sharp \mu )(w)^2} - 1, \end{aligned}$$

in which one can choose w to be the ratio \(\pi \lambda \big / {\hat{f}}_{\hat{X}}\) multiplied by any nonzero constant. Thus, the result follows. \(\square \)

The bounds discussed in Remarks 2 and 9 imply that \({\bar{z}}_N\) and \({\bar{h}}_N\) computed by Algorithm 2 are unbiased estimators for the normalising constant z and the expectation \((T_\sharp \mu )(w\,h)\), respectively. However, the ratio estimator \(I_N\) is only asymptotically unbiased such that

$$\begin{aligned} \mathbb {P}\big [ \lim _{N\rightarrow \infty } I_N = \nu _{\pi }(h)\big ] = 1. \end{aligned}$$

For a finite sample size, \(N < \infty \), the ratio estimator \(I_N\) is a biased estimator of \(\nu _{\pi }(h)\). However, for sufficiently large sample size N, one can apply the Delta method (see [48, Chapter 2] and references therein) to show that the mean square error (MSE) of \(I_N\) yields the approximation

$$\begin{aligned} \mathrm{MSE}(I_N) \equiv \mathbb {E}\big [\big (I_N - \nu _{\pi }(h)\big )^2\big ] \approx \frac{1}{N} \frac{ (T_\sharp \mu ) \big ( w^2 \big ( h - \nu _{\pi }(h) \big )^2\big )}{(T_\sharp \mu )(w)^2} . \end{aligned}$$
(58)

Thus, for a sufficiently regular function h, the MSE of the ratio estimator \(I_N\) can also be controlled by the \(\chi ^ { {\scriptstyle 2} } \)-divergence \(D_{\chi ^ { {\scriptstyle 2} } }(\nu _{\pi }\Vert T_\sharp \mu )\).

6 Numerical Examples

We demonstrate the efficiency and various aspects of DIRT, which employs SIRT within each layer, using four Bayesian inference problems arising in dynamical systems and PDEs. In all numerical examples, as efficiency measures of DIRT (or SIRT in the single layer case), we employ the estimated integrated autocorrelation time (IACT) for IRT-MCMC (Algorithm 1) and the ratio between the total number of samples and the estimated ESS, \(N / {\mathrm{ESS}}(N)\), for IRT-IS (Algorithm 2). For both IACT and \(N / {\mathrm{ESS}}(N)\), a lower value indicates a better sampling efficiency. The minimum value of both IACT and \(N / {\mathrm{ESS}}(N)\) is 1. Since \(N / {\mathrm{ESS}}(N)= 1 + D_{\chi ^ { {\scriptstyle 2} } }(\nu _k\Vert T_\sharp \mu )\), it also measures directly the accuracy of DIRT for approximating the posteriors. Matlab implementation of IRT methods and numerical examples is available at https://github.com/dolgov/TT-IRT.

6.1 Predator and Prey

The predator–prey model is a system of coupled ODEs frequently used to describe the dynamics of biological systems. The populations of predator (denoted by Q) and prey (denoted by P) change over time according to a pair of ODEs

$$\begin{aligned} \left\{ \begin{array}{ll} \displaystyle \frac{d P}{dt} &{} = \displaystyle r P \Big (1 - \frac{P}{K}\Big ) - s \Big (\frac{PQ }{\alpha + P}\Big ), \\ \displaystyle \frac{d Q}{dt} &{} = \displaystyle u \Big (\frac{PQ }{\alpha + P}\Big ) - v Q, \end{array}\right. \end{aligned}$$
(59)

with initial conditions \(P(t=0) = P_0\) and \(Q(t=0) = Q_0\). The dynamical system is controlled by several parameters. In the absence of the predator, the population of the prey evolves according to the logistic equation characterised by r and K. In the absence of the prey, the population of the predator decreases exponentially with a rate v. In addition, the two populations have a nonlinear interaction characterised by \(\alpha \), s, and u. We often do not know the initial populations and the parameters r, K, \(\alpha \), s, u, and v. This way, we need to estimate unknowns

$$\begin{aligned} x= \left[ P_0, Q_0, r, K, \alpha , s, u, v \right] ^\top , \end{aligned}$$

from observed populations of the predator and prey at time instances \(t_i\) for \(i = 1, ..., n_T\).

6.1.1 Posterior Density

Let \(y\in \mathbb {R}^{2n_T}\) denote the observed populations of the predator and prey. We define a forward model \(G: {\mathcal {X}}\mapsto \mathbb {R}^{2n_T}\) in the form of \(G(x) = [P(t_i), Q(t_i)]_{i=1}^{n_T}\) to represent the populations of the predator and prey computed at \(\{t_i\}_{i =1}^{n_T}\) for a given parameters \(x\). Assuming independent and identically distributed (i.i.d.) normal noise in the observed data and assigning a prior density \(\pi _0(x)\) to the unknown parameter, one can define the unnormalized posterior density

$$\begin{aligned} \pi (x) \propto \exp \Big (-\frac{1}{2\sigma ^2}\Vert G(x) - y\Vert ^2_2\Big )\,\pi _0(x), \end{aligned}$$

where \(\sigma \) is the standard deviation of the normally distributed noise. Synthetic observed data are used in this example. With \(n_T = 13\) time instances \(t_{i} = (i-1)\times 25/6\) and a given parameter \(x_\mathrm{true} = [50, 5, 0.6, 100, 1.2, 25, 0.5, 0.3]^\top \), we generate synthetic noisy data \(y= y_\mathrm{true} + \eta \), where \(\eta \) is a realization of the i.i.d. zero mean normally distributed noise with the standard derivation \(\sigma = \sqrt{2}\). A uniform prior density \( \pi _0(x) = \prod _{k = 1}^{8} \mathbb {I}_{[a_k,b_k]}(x_k) \) is specified to restrict the support of \(x_k\) to the interval \([a_k, b_k]\), where \( a = [30, 3, 0.36, 60, 0.72, 15, 0.3, 0.18]^\top \) and \( b = [80, 8, 0.96, 160, 1.92, 40, 0.8, 0.48]^\top . \) To illustrate the behaviour of the posterior density, we plot the kernel density estimates of the marginal posterior densities in Fig. 2. Note that some of the parameters are significantly correlated, which makes the posterior density function difficult to explore by both MCMC and a straightforward TT approximation.

Fig. 2
figure 2

Marginal posterior densities of the predator–prey model estimated from \(10^6\) posterior samples

Fig. 3
figure 3

IACT±standard deviation over 10 runs with the uniform reference measure (a). IACT±standard deviation with the truncated normal reference measure (b). Number of density evaluations in TT-cross at each layer (c). \(N / {\mathrm{ESS}}(N)\pm \) standard deviation over 10 runs with the uniform reference measure (d). \(N / {\mathrm{ESS}}(N)\pm \) standard deviation with the truncated normal reference measure (e). Initial TT rank is adjusted such that the maximal TT rank is 13 in all tests, whereas enrichment ranks Rho and numbers of TT-cross iterations MaxIt are varied

Fig. 4
figure 4

IACT ± standard deviation over 10 runs (a), \(N / {\mathrm{ESS}}(N)\pm \) standard deviation over 10 runs (b), and number of density evaluations in TT-cross at each layer (c) with varying maximum TT ranks \(\mathtt {R}_{\max }\) and different reference measures. For the uniform reference, \(\mathtt {MaxIt}=3\) and \(\mathtt {Rho}=3\) are used. For the truncated normal reference, \(\mathtt {MaxIt}=1\) and \(\mathtt {Rho}=0\) are used

Fig. 5
figure 5

IACT ± standard deviation over 10 runs (a), \(N / {\mathrm{ESS}}(N)\pm \) standard deviation over 10 runs (b), and number of density evaluations in TT-cross at each layer (c) for varying numbers of collocation points n and different reference measures. For the uniform reference, \(\mathtt {MaxIt}=3\) and \(\mathtt {Rho}=3\) are used. For the truncated normal reference, \(\mathtt {MaxIt}=1\) and \(\mathtt {Rho}=0\) are used

6.1.2 Numerical Results

We use \(L = 8\) bridging measures in the construction of DIRT by tempering the unnormalised posterior density with \(\pi _k(x) = \pi (x)^{\beta _k}\), starting from \(\beta _0=10^{-4}\) and following by \(\beta _{k+1} = \sqrt{10}\cdot \beta _k\). This way, \(\beta _L=1\) gives the target probability density. We consider two reference measures: the uniform reference measure \(\mu _\mathrm{uni}\) and the truncated normal reference measure \(\mu _\mathrm{TG}\) with the density \(f_{U}(u) \propto \prod _{k=1}^{8} \mathbb {I}_{[-4, 4]} (u_k) \exp (-\Vert u_k\Vert _2^2/2)\). Note that at layer 0, the ratio function is just the tempered density \(\pi _0(x)\) in the original domain \(x_k \in [a_k, b_k]\). We employ the piecewise-linear basis functions with n equally spaced interior collocation points for both reference measures. In addition, we tune TT-cross (Algorithm 4) using three parameters: the initial TT rank R0, enrichment TT ranks \(\rho _1=\cdots =\rho _{d-1}=\mathtt {Rho}\), and the maximum number of TT-cross iterations MaxIt. Those define uniquely the maximum TT rank \(\mathtt {R}_{\max } = \mathtt {R0} + \mathtt {Rho}\cdot \mathtt {MaxIt}\).

Firstly, we vary one tuning variable at a time and investigate its impact on the efficiency and computational cost of the DIRT. We take the number of posterior density function evaluations in TT-cross in each DIRT layer to measure the computational cost for building DIRT.

In Fig. 3, we vary the enrichment rank Rho and the number of TT-cross iterations MaxIt. The initial TT rank R0 is adjusted such that the maximum TT rank is 13 in all cases. We set the number of collocation points to be \(n = 16\). All the DIRTs are constructed using the approximate ratio (54). With each Rho and MaxIt, we repeat the IRT-MCMC and IRT-IS for 10 experiments and report the estimated mean and standard deviation of the efficiency indicators. For the uniform reference (Fig. 3a, d), carrying out \(\texttt {MaxIt}=1\) iteration gives very inaccurate results with \(\text{ IACT }>10\) and \(N / {\mathrm{ESS}}(N)> 10\). Increasing the number of TT-cross iterations for the uniform reference measure significantly improves the DIRT accuracy. Since the ratio function varies considerably from layer to layer, TT-cross needs at least 3 iterations and a nontrivial enrichment to adapt the approximation to the new function. This comes at the expense of tripling the number of density evaluations, in addition to those needed to compute the enrichment, as shown in Fig. 3c. In contrast, using the truncated normal reference measure (Fig. 3b, e) can significantly improve the efficiency in this example. With only one TT-cross iteration, it can reduce the final IACT to below 4 and \(N / {\mathrm{ESS}}(N)\) to below 3.

Remark 12

At levels \(k > 0\) of the DIRT construction, the ratio functions may have a similar shape (see Fig. 1). Thus, one can take the TT of the ratio function at the previous level \(k>0\) as the initial guess for building TT at level \(k+1\). This initialization provides good index sets in TT-cross, such that only one TT-cross iteration is sufficient with the truncated normal reference measure.

In Fig. 4, we vary the maximum TT rank \(\mathtt {R}_{\max }\). With the uniform reference, we set \(\texttt {Rho}=3\) and \(\texttt {MaxIt}=3\). With the truncated normal reference, we set \(\texttt {Rho}=0\) and \(\texttt {MaxIt}=1\), which makes the number of density evaluations equal to the number of degrees of freedom in the TT decomposition, \(n_1 r_1 + \sum _{k=2}^{d-1} n_k r_{k-1} r_k + r_{d-1} n_d = (d-2)n\mathtt {R}_{\max }^2 + 2n\mathtt {R}_{\max }\). We observe that the two reference measures give eventually comparable IACTs and ESSs with increasing \(\mathtt {R}_{\max }\). However, the truncated normal reference achieves this with much fewer density evaluations.

In Fig. 4, we compare also the approximate ratio (54) used in all experiments with the exact ratio (52). The diamond-shaped markers in Fig. 4a, b show IACTs and ESSs obtained by the exact ratio approach. In this example, it gives worse results with a larger IACT and \(N / {\mathrm{ESS}}(N)\) for the truncated normal reference measure with lower \(\mathtt {R}_{\max }\) values and does not lead to any meaningful results for the uniform reference measure.

In Fig. 5, we vary the number of collocation points n used in each dimension. The truncated normal reference starts with a larger error since \(n=10\) points cannot resolve the rather large reference domain \([-4,4]\). With increasing n, the IACT obtained using the truncated normal reference decays rapidly. In comparison, the IACT obtained using the uniform reference exhibits a spike and does not show rapid decay with increasing n. This may be caused by the boundary layers in the ratio function. Similar trends are observed in the reported \(N / {\mathrm{ESS}}(N)\). Again, the truncated normal reference requires significantly fewer density evaluations to achieve the same level of accuracy compared to the uniform reference in this experiment.

Next, we benchmark DIRT with the truncated normal reference, \(\mathtt {MaxIt}=1\), \(\mathtt {Rho}=0\), \(n=16\), and \(\mathtt {R0}=\mathtt {R}_{\max }=13\) against other sampling algorithms, including the Delayed Rejection Adaptive Metropolis (DRAM) [27], the Stein variational Newton (SVN) [15], and the Hierarchical Invertible Neural Transport (HINT) [35].

DRAM is initialized with the covariance matrix \(5{\mathsf {I}}\), adaptation scale \(2.4/\sqrt{d}\), adaptation interval 10 and delayed rejection scale 2. These parameters are commonly recommended in general case.

For this example, SVN is sensitive to the choice of the step size and to the initial distribution of particles. We choose the step size to be \(2\cdot 10^{-2}\) and generate the initial particle set from the normal distribution \({\mathcal {N}}(x_\mathrm{true}, (2\cdot 10^{-2}x_\mathrm{true})^2)\), which gives a reasonable balance between the stability and the rate of convergence. We carry out 23 Newton iterations in SVN to approach stationarity.

HINT is an autoregressive normalising flow estimator for the joint probability density \(\pi (y,x)\). Since both the prior random variable \(X\) with the density \(\pi _0(x)\) and the noise random variable \(\eta \) can be directly simulated, drawing samples from the joint distribution is easy. One can first draw a sample \(X\) from the prior and then simulate the corresponding data sample \(Y= G(X) + \eta \) by generating a noise random variable \(\eta \). Drawn a set of independent and identically distributed samples \((Y^{(i)},X^{(i)})_{i=1}^{N}\) from the joint measure, HINT computes a triangular invertible map \((u_Y, u_X ) = S_{\theta }(y,x)\) from the (joint) target measure to the reference measure by minimizing the maximum likelihood

$$\begin{aligned} L(\theta ) = \sum _{i=1}^{N} \frac{1}{2}\Vert S_{\theta }(Y^{(i)},X^{(i)}) \Vert _2^2 - \log |\nabla S_{\theta }(Y^{(i)},X^{(i)})| \end{aligned}$$

over the parameter \(\theta \) that defines the neural networks \(S_\theta \). Given observed data \(y\), this allows one to define a conditional map \(x= T_{\theta }(u_X; y) := \big (S_{\theta }^{X}\big )^{-1}(y, u_X)\) that maps from the reference measure of \(U_X\) to the posterior measure conditioned on data \(y\).

We simulate each method \(M = 10\) times with N samples produced in each simulation, denoted by \(\{x^{(\ell ,j)}\}_{j=1}^{N}\), where \(\ell =1,...,M\) indexes the simulations. For each simulation, we compute the empirical posterior covariance matrix \({\mathsf {C}}^\ell = \frac{1}{N}\sum _{j=1}^{N} (x^{(\ell ,j)} - {\bar{x}}^{\ell })(x^{(\ell ,j)} - {\bar{x}}^{\ell })^\top \), where \({\bar{x}}^{\ell } = \frac{1}{N}\sum _{j=1}^{N} x^{(\ell ,j)}\) is the empirical posterior mean. Then, we use the average deviation of covariance matrices to benchmark the sampling performance of different sampling algorithms. Here, we employ the Förstner–Moonen distance [19] over the cone of symmetric and positive definite (SPD) matrices,

$$\begin{aligned} d_\mathrm{FM}({\mathsf {A}}, {\mathsf {B}}) = \sum _{i=1}^{d} \ln ^2\big (\lambda _i({\mathsf {A}}, {\mathsf {B}})\Big ), \end{aligned}$$

where \(\lambda _i({\mathsf {A}}, {\mathsf {B}})\) denotes the ith generalised eigenvalue of the pair of SPD matrices \(({\mathsf {A}}, {\mathsf {B}})\), to measure the deviation. This way, averaging the Förstner–Moonen distance between the \(\ell \)th empirical covariance matrix and the average covariance matrix over all M simulations,

$$\begin{aligned} {\mathcal {E}}_C = \frac{1}{M} \sum _{\ell =1}^{M} d_\mathrm{FM}({\mathsf {C}}^\ell , {\bar{{\mathsf {C}}}}), \quad \text {where} \quad {\bar{{\mathsf {C}}}} = \frac{1}{M}\sum _{\ell =1}^{M} {\mathsf {C}}^\ell , \end{aligned}$$
(60)

provides an estimated deviation of empirical covariance matrices computed by a given algorithm.

In Fig. 6, we plot the covariance deviations (60) obtained by IRT-MCMC, DRAM, SVN and HINT versus the total number of target density function evaluations and the total CPU time needed by each algorithm. Here, the reported total numbers of density evaluations and CPU times include the construction of DIRT in each simulation experiment. The 10 independent simulations are run in parallel on a workstation with a Intel Xeon E5-2640v4 CPU at 2.4GHz. We can notice that DIRT produces estimated covariance matrices with smallest deviations in almost all tests. Moreover, DIRT is computationally more efficient in terms of the CPU time, because the evaluation of DIRT can take advantage of vector instructions.

Fig. 6
figure 6

Estimated deviation of empirical covariance matrices (60) computed by IRT-MCMC, DRAM, SVN and HINT for different total numbers of density evaluations (left) and CPU times (right) (\(^{*}\) the marker in the figure is only indicative, and the actual results are larger)

In this example, HINT gives the worst results since the joint density estimation from samples is a much higher dimensional problem compared to the posterior density approximation. In particular, the dimension of \(y\) in the predator-prey model is \(2n_T=26\), so the total dimensionality of the problem increases to 34. This required us to construct HINT networks with 8 blocks containing \(200 \times 200\) weights each. This totalled to 2691308 trainable parameters in the entire HINT. We trained the networks using \(5 \cdot 10^6\) \(10^4\) training samples for 50 epochs, consisting of 500 batches of \(10^5\) samples each. The other (e.g. ADAM) parameters are left unchanged from [35]. The training took 5.8 h on a NVidia GeForce GTX 1650 Max-Q GPU card. To avoid disproportionate scaling of axes in Fig. 6 compared to other methods, we put just an indicative marker for HINT. The actual error (60) with \(10^5\) test samples taken directly from HINT was 16.7. Using the test samples as proposals in the MCMC rejection against the exact posterior gives a rejection rate of 97% and IACT of 127, and the covariance matrix computed from the rejected samples gives the Förstner–Moonen distance of 0.1. This indicates that the data-driven joint density estimation should be more applicable for lower-dimensional data, whereas if one is only interested in the posterior, the function approximation methods seem to be a better choice.

6.2 Lorenz-96

This is a widely used benchmark model in atmospheric circulations. We consider a Lorenz-96 model that is specified by the system of ODEs

$$\begin{aligned} \frac{dP_i(t)}{dt}&= (P_{i+1} - P_{i-2})P_{i-1} - P_i + 8,\quad \text {for} \quad i=1,...,d, \end{aligned}$$
(61)

with periodic boundary conditions and an unknown initial condition \(P_i(0) = x_i\) for \(i=1,...,d\). The state dimension is set to \(d=40\). Observing noisy states with even indices at the final time \(T=0.1\), we aim to infer the initial state \(x\) in this example. This way, we have observed data \(y\in \mathbb {R}^{\frac{d}{2}}\) and can define a forward model \(G: {\mathcal {X}} \mapsto \mathbb {R}^{d/2}\) in the form of \(G(x) = [P_{2k}(T)]_{k=1}^{d/2}\) to represent simulated observables for a given initial condition \(x\).

Fig. 7
figure 7

Lorenz-96 model, true initial state (solid red) and posterior expectation, mean (blue dashed) ± 2 standard deviations (shaded area) (Color figure online)

Assuming i.i.d. normal noise in the observed data and assigning a truncated normal prior density to the initial condition, we have the unnormalized posterior density

$$\begin{aligned} \pi (x) = \exp \Big (-\frac{1}{2\sigma ^2}\Vert G(x) - y\Vert _2^2\Big )\, \prod _{k=1}^{d} \Big ( \mathbb {I}_{[-10, 10]}(x_k) \exp \big (-\frac{1}{2} (x_k-1)^2\big )\Big ). \end{aligned}$$

We use a synthetic data set \(y= G(x_\mathrm{true}) + \eta \), where \(x_\mathrm{true}\) is drawn from \(\sim {\mathcal {N}}(1, 10^{-4}{\mathsf {I}}_{d})\), and \(\eta \) is a realisation of the i.i.d. zero mean normal noise with the standard deviation \(\sigma = 10^{-1}\).

For the TT-cross approximations, we use the truncated normal reference measure on \([-3,3]^d\), piecewise linear basis functions with \(n=15\) interior collocation points, \(\texttt {MaxIt}=1\) TT-cross iteration, and TT ranks \(\mathtt {R}_{\max }=15\). DIRT is built with the tempered density

$$\begin{aligned} \pi _k(x) = \exp \Big (-\frac{\beta _k}{2\sigma ^2}\Vert G(x) - y\Vert _2^2\Big ) \cdot \prod _{k=1}^{d} \Big ( \mathbb {I}_{[-10, 10]}(x_k) \exp \big (-\frac{\beta _k^{0.25}}{2} (x_k-1)^2\big )\Big ), \end{aligned}$$

with \(\beta _0=10^{-4}\) and \(\beta _{k+1}=\sqrt{10}\cdot \beta _k\). This way, we need \(L=8\) layers to reach the posterior density. A weaker tempering of the prior is used to reduce its impact on the intermediate levels. This allows most of the intermediate DIRT levels to be used to bridge the more complicated likelihood. This setup requires a total of \(1.2 \times 10^6\) density evaluations in TT-cross at all layers and provides an average ESS of N/1.55 in IRT-IS and an average IACT of 2.6 in IRT-MCMC.

Using the posterior density, we can quantify the uncertainty of the inferred initial state and make predictions of the terminal state. The predicted initial state is shown in Fig. 7. Note that the chaotic regime of Lorenz-96 makes it difficult to predict the unobserved odd coordinates. Nevertheless, DIRT demonstrates high numerical and sampling efficiency in approximating this complicated posterior.

6.3 Elliptic PDE

In the third example, we apply both SIRT and DIRT to the classical inverse problem governed by the stochastic diffusion equation

$$\begin{aligned} -\nabla \cdot \big (\kappa _d(s; x) \nabla u(s)\big ) = 0 \quad \text{ on } \quad s \in D:= (0,1)^2, \end{aligned}$$
(62)

with Dirichlet boundary conditions \(u|_{s_1=0}=1\) and \(u|_{s_1=1}=0\) on the left and right boundaries, and homogeneous Neumann conditions on other boundaries. The goal is to infer the unknown diffusion coefficient \(\kappa _d(s; x)\) from incomplete observations of the potential function u(s). Here, we adopt the same setup used in [17, 56].

6.3.1 Posterior Density

The unknown diffusion coefficient \(\kappa _d(s; x)\) is parametrized by a d-dimensional random variable \(x\). We take each of the parameters \(x_k\), \(k=1,...,d,\) to be uniformly distributed on \([-\sqrt{3},\sqrt{3}]\). Then, for any \(x\in [-\sqrt{3},\sqrt{3}]^d\) and \(s = (s_1,s_2) \in D\), the logarithm of the diffusion coefficient at s is defined by the following expansion

$$\begin{aligned} \ln \kappa _d(s; x) = \sum _{k=1}^{d} x_k \, \sqrt{\eta _k} \, \cos (2\pi \rho _1(k) s_1) \cos (2\pi \rho _2(k) s_2), \end{aligned}$$
(63)

where

$$\begin{aligned} \eta _k = \frac{k^{-(\nu +1)}}{\sum _{k=1}^{d} k^{-(\nu +1)}}, \quad \rho _1(k) = k - \frac{\tau (k)^2 + \tau (k)}{2}, \quad \mathrm{and} \quad \rho _2(k) = \tau (k)-\rho _1(k), \end{aligned}$$

with \(\tau (k) = \lfloor \frac{1}{2} (\sqrt{1+k/2} - 1 )\rfloor \). To discretise the PDE in (62), we tessellate the spatial domain D with a uniform Cartesian grid with mesh size h. Then, we replace the infinite dimensional solution \(u \in V \equiv H^1(D)\) by the continuous, piecewise bilinear finite element (FE) approximation \(u_h \in V_h\) associated with the discretisation grid. To find \(u_h\), we solve the resulting Galerkin system using a sparse direct solver. A fixed discretisation with \(d=11\), \(h=2^{-6}\), and \(\nu =2\) is used in this example.

The observed data \(y\in \mathbb {R}^{m}\) consist of m local averages of the potential function u(s) over subdomains \(D_{i} \subset D\), \(i=1,...,m\). To simulate the observable model outputs, we define the forward model \(G^h:{\mathcal {X}}\mapsto \mathbb {R}^{m}\) with

$$\begin{aligned} G^h_i(x) = \frac{1}{|D_{i}|}\int _{D_i} u_h(s;x) \mathrm{d}s, \quad i=1,...,m. \end{aligned}$$

The subdomains \(D_{i}\) are squares with side length \(2/(\sqrt{m}+1)\) centred at the interior vertices of a uniform Cartesian grid on \(D=[0,1]^2\) with grid size \(1/(\sqrt{m}+1)\), which form an overlapping partition of D. Synthetic data for these m local averages are produced from the “true” parameter \(x_\mathrm{true} = (1.5,...,1.5)\) by adding i.i.d. zero mean normally distributed noise with the standard deviation \(\sigma \). This way, we have the unnormalized posterior density

$$\begin{aligned} \pi (x) = \exp \Big (-\frac{1}{{2\sigma ^2}}\big \Vert G^h(x) - y\big \Vert ^2_2\Big ) \, \prod _{k=1}^{d} \Big ( \mathbb {I}_{[-\sqrt{3}, \sqrt{3}]}(x_k)\Big ). \end{aligned}$$

6.3.2 Numerical Results

In this example, we compare the impact of different tempering schemes, different numbers of measurements, and different measurement noise levels on DIRT. We also compare different basis functions used in the DIRT construction. In all experiments, we feed \(N=2^{16}\) independent samples generated by DIRT to both IRT-MCMC and IRT-IS.

In Fig. 8, we compare DIRT with three different tempering sequences \(\beta = \left[ \beta _0,...,\beta _L\right] \), varying the grid size n and the TT ranks \(\mathtt {R}_{\max }\). Note that with \(L=0\), we have the single-layer SIRT. The reported number of density function evaluations is a sum of the numbers of evaluations in TT-cross at all layers. We use the truncated normal reference measure on \([-4,4]^d\) with both piecewise linear and Fourier bases for the multilayer DIRT.

Fig. 8
figure 8

Elliptic PDE with \(\sigma ^2=10^{-2}\) and \(m=3^2\). IACT vs. number of density evaluations in DIRT for different number of layers: \(L=0\) (\(\beta =1\)), \(L=1\) (\(\beta =\{0.1,1\}\)) and \(L=2\) (\(\beta =\{0.1,\sqrt{0.1},1\}\)). Note \(\mathtt {R}_{\max }\) varying from 8 to 32 for \(L=0\), but only from 4 to 8 for \(L=2\)

With the noise variance \(\sigma ^2=10^{-2}\) and a rather small data size \(m=3^2\), the posterior density is relatively simple to characterise and hence, can be tackled directly using the single-layer SIRT (see the case \(L=0\) in Fig. 8 and [17]). However, the multilayer DIRT uses much smaller number of collocation points and TT ranks for producing an approximate posterior density with the same accuracy. Here, the three-layer DIRT needs only 10% of the density evaluations required for the single-layer counterpart.

Fig. 9
figure 9

Elliptic PDE with varying numbers of measurements m. Total number of density evaluations in all layers (a), reciprocal sample size (b), and IACT (c). Tempering is carried out with \(\beta _0 = 4^{-\lceil \log _4 m \rceil }\), \(\beta _{k+1} = 4 \cdot \beta _k\). TT-cross parameters: \(n=16\), \(\mathtt {R}_{\max }=\mathtt {R}_0=12\), and \(\mathtt {MaxIt}=1\)

Next, we test the multilayer DIRT on more difficult posterior densities, with larger numbers of measurements and smaller observation noise. We set the number of collocation points to be \(n=16\), maximum TT-cross iteration to be \(\mathtt {MaxIt}=1\), and maximum TT rank to be \(\mathtt {R}_{\max }=12\). In Fig. 9, we fix \(\sigma ^2 = 10^{-2}\) and vary the number of measurements. Since halving the measurement grid size \(2/(\sqrt{m}+1)\) corresponds to multiplying m by approximately a factor of 4, we use a different tempering strategy, starting with \(\beta _0 = 4^{-\lceil \log _4 m \rceil }\), and setting \(\beta _{k+1}=4\cdot \beta _k\) for next layers. This way, the number of layers grows proportionally to \(\log m\), and the number of density evaluations in TT-cross for fixed TT ranks is also proportional to \(\log m\), which can be confirmed by Fig. 9a. Here, we can see that the Fourier basis is significantly more accurate than the piecewise-linear basis for the same grid size. With the linear basis, both IACT and \(N / {\mathrm{ESS}}(N)\) grow logarithmically in the number of measurements. With the Fourier basis, the IACT stays almost constant below 1.5 and the \(N / {\mathrm{ESS}}(N)\) stays almost constant below 1.1, increasing slightly only for the most difficult case with \(m = 31^2\) (Fig. 9b, c). With increasing number of measurements, the likelihood becomes more concentrated. This makes it more challenging to characterise the posterior using prior-based approaches such as QMC [56] or single-layer TT approximation. For example, even with a much larger number of collocation points \(n=65\) and 5 iterations of TT-cross (giving a maximal TT rank of 41), we still cannot produce reasonable results for \(m=15^2\) with the single-layer SIRT.

We carry out an additional test with decreasing noise variance \(\sigma ^2\). In Fig. 10, we fix \(m=15^2\) and vary \(\sigma ^2\) from \(10^{-1}\) to \(10^{-5}\). In this experiment, fixing TT ranks becomes insufficient for representing posterior densities with low observation noise. In particular, the piecewise linear basis does not have sufficient accuracy for the case of the smallest noise variance. In contrast, the Fourier basis can still retain low IACT and \(N / {\mathrm{ESS}}(N)\) for low noise variance cases, where IACT and \(N / {\mathrm{ESS}}(N)\) grow proportionally to \(\log \sigma \). Together with the log-scaling of the number of evaluations, the effective complexity of the entire IRT-MCMC and IRT-IS schemes becomes poly-logarithmic in the variance. Although the Fourier basis is computationally more expensive to evaluate than the piecewise-linear basis, with a factor of 2.5 in the worst case scenario in this experiment, this additional computational effort is well compensated by a much higher accuracy. This makes DIRT a viable approach for a range of concentrated distributions.

Fig. 10
figure 10

Elliptic PDE with varying noise variances \(\sigma ^2\). Total numbers of density evaluations in all layers (a), reciprocal sample size (b), and IACT (c). Tempering is carried out with \(\beta _0 = 0.1\sigma ^2\), \(\beta _{k+1} = \sqrt{10} \cdot \beta _k\). TT parameters: \(n=16\), TT rank 20, one TT-cross iteration

6.4 Parabolic PDE

In the fourth example, we consider an inverse problem of identifying the diffusion coefficient of a two-dimensional parabolic PDE from point observations of its solution. In the problem domain \(D = [0, 3]\times [0, 1]\), with boundary \(\partial D\), we model the time-varying potential function p(st) for given diffusion coefficient field \(\kappa _d(s)\) and forcing function f(st) using the heat equation

$$\begin{aligned} \frac{\partial p(s,t)}{\partial t} = \nabla \cdot \left( \kappa _d(s; x) \nabla p(s,t) \right) + f(s,t), \quad s \in D, \; t \in [0, T], \end{aligned}$$
(64)

where \(T = 10\). Parabolic PDEs of this type are widely used in modeling groundwater flow, optical diffusion tomography, the diffusion of thermal energy, and numerous other common scenarios for inverse problems. Let \(\partial D_{\text{ n }} = \{ s \in \partial D \,|\, s_2 = 0\} \cup \{ s \in \partial D \,|\, s_2 = 1\}\) denote the top and bottom boundaries, and \(\partial D_{\text{ d }} = \{ s \in \partial \Omega \,|\, s_1 = 0\} \cup \{ s \in \partial \Omega \,|\, s_1 = 3\}\) denote the left and right boundaries. For \(t \ge 0\), we impose the mixed boundary condition:

$$\begin{aligned} p(s,t) = 0, \forall s \in \partial D_{\text{ d }}, \quad \text {and} \quad (\kappa _d(s; \theta ) \nabla p(s,t) ) \cdot \mathbf {n}(s) = 0, \forall x \in \partial D_{\text{ n }}, \end{aligned}$$

where \(\mathbf {n}(s)\) is the outward normal vector on the boundary. We also impose a zero initial condition, i.e. \(p(s,0) = 0, \forall s \in D\), and let the potential field be driven by a time-invariant forcing function

$$\begin{aligned} f(s, t) = c\,\Big ( \exp \big (-\frac{1}{2 r^2} \Vert s - a\Vert ^2 \big ) - \exp \big (-\frac{1}{2 r^2} \Vert s - b\Vert ^2 \big ) \Big ), \forall t \ge 0, \end{aligned}$$

with \(r = 0.05\), which is the superposition of two normal-shaped sink/source terms centered at \(a = (0.5, 0.5)\) and \(b = (2.5, 0.5)\), scaled by a constant \(c = 5\pi \times 10^{-5}\).

6.4.1 Posterior Density

The logarithm of the diffusion coefficient, \(\ln \kappa _d(s; x)\), is endowed with the process convolution prior [29],

$$\begin{aligned} \ln \kappa _d(s; x) = \ln {\bar{\kappa }} + \sum _{k=1}^{d} x_k \exp \Big (-\frac{1}{2} \Vert s - s^{(k)}\Vert ^2\Big ), \end{aligned}$$
(65)

where \(d = 27\), \(\ln {\bar{\kappa }} = -5\), each coefficient \(x_k\) follows a standard normal prior \({\mathcal {N}}(0, 1)\) (which can be truncated to \([-5,5]\) with sufficient accuracy), and \(s^{(k)}, k = 1, ..., d\) are centers of the kernel functions (shown as blue crosses in Fig. 11a). Similarly to the previous example, the potential function p(st) in (64) is approximated by \(p_h(s,t)\) using the finite element method with piecewise bilinear basis functions and implicit Euler time integration.

Fig. 11
figure 11

Setup of the parabolic example. Logarithm of the “true” diffusion coefficient and d = 27 centers of the process convolution prior (blue crosses) (a). The potential function \(p_h(s,t; x_\mathrm {true}\)) at t = 0.1, computed with h = 1/80 (b). The potential function \(p_h(s,t; x_\mathrm {true}\)) at t = 10 (c). Black dots in (b) and (c) are locations of measurements

The observed data \(y\in \mathbb {R}^{m\times n_T}\) consist of the time-varying potential function p(st) measured at \(m = 13\) locations (shown as black dots in Fig. 11b, c) at \(n_T = 10\) discrete time points equally spaced between \(t=1\) and \(t=10\). To simulate the observable model outputs, we define the forward model \(G^h:{\mathcal {X}}\mapsto \mathbb {R}^{m\times n_T}\) with

$$\begin{aligned} G^h_{i,j}(x) = p_h(s_i, t_j;x), \quad i=1,...,m, \quad j = 1, ..., n_T. \end{aligned}$$

Using a “true” parameter \(x_\mathrm{true}\) drawn from the prior distribution and a forward model with \(h = 1/80\), synthetic data \(y\in \mathbb {R}^{m\times n_T}\) are produced by adding i.i.d. normal noise with zero mean and the standard deviation \(\sigma = 1.65\times 10^{-2}\) to \(G^h(x_\mathrm{true})\). The corresponding \(\ln \kappa _d(s; x_\mathrm{true})\) and the simulated potential function at several time snapshots are shown in Fig. 11. The standard deviation \(\sigma = 1.65\times 10^{-2}\) corresponds to a signal-to-noise ratio of 10. This way, we have the unnormalized posterior density

$$\begin{aligned} \pi (x) = \exp \Big (-\frac{1}{{2\sigma ^2}}\big \Vert G^h(x) - y\big \Vert _F^2\Big ) \, \prod _{k=1}^{d}\Bigl (\mathbb {I}_{[-5,5]}(x_k) \exp \big ( - \frac{1}{2} x_k^2\big ) \Bigr ). \end{aligned}$$

6.4.2 Numerical Results

To construct DIRT, we employ a geometric grading in \( \beta \), refining towards 1,

$$\begin{aligned} \log _{10}\beta _k \in \{-5, -4, -3, -2.5, -2, -1.5, -1, -0.75, -0.5, -0.25, 0\}. \end{aligned}$$

The posterior is very concentrated in this example, so we employ separate tempering of prior and likelihood in the bridging densities,

$$\begin{aligned} \pi _k(x) = \exp \Big (-\frac{\beta _k}{{2\sigma ^2}}\left\| G^h(x) - y\right\| _F^2\Big ) \, \, \prod _{k=1}^{d}\Bigl (\mathbb {I}_{[-5,5]}(x_k) \exp \big ( - \frac{\beta _k^{0.01}}{2} x_k^2 \big ) \Bigr ). \end{aligned}$$

in which a weakly tempered prior is used. We use a truncated normal reference measure on the domain \((-4,4]^d\) with the Fourier basis to build DIRT. In TT-cross, a maximum iteration \(\mathtt {MaxIt}=1\) without enrichment (\(\mathtt {Rho}=0\)) is used. The number of collocation points in each dimension is set to be \(n=16\), and the TT ranks are chosen to be \(\mathtt {R0}=\mathtt {R}_{\max }=\mathtt {R}_k\), where

$$\begin{aligned} \mathtt {R}_k \in \{15, 15, 15, 15, 15, 15, 13, 9, 9, 8, 7\} \end{aligned}$$

at the kth layer of DIRT.

The PDE in (64) is computationally expensive to solve. Here, our goal is to explore the posterior density defined by a forward model, \(G^{h_f}\), with refined grid size \(h_f = 1/80\). A coarse forward model, \(G^{h_c}\) with \(h_c = 1/20\), and an intermediate forward model, \(G^{h_m}\) with \(h_m = 1/40\), are used in defining the bridge densities to speed-up the DIRT construction. This multilevel construction shares similarities with the multi-fidelity preconditioning strategy of [51], except that DIRT is based on TT rather than optimisation and our multilevel models are blended into the bridging densities. In numerical experiments, we consider the CPU time of solving the coarse model evaluation as one work unit. The CPU times for evaluating the intermediate model and the fine model are about 12.5 work units and 160 work units, respectively.

In the first experiment, we employ the coarse forward model, \(G^{h_c}\), to compare the sampling performance of DIRT with that of DRAM. The results are reported in Fig. 12a, where the number of independent samples is calculated as the length of the Markov chain divided by the estimated IACT. The estimated IACTs for DRAM and DIRT are about 132 and 3.04, respectively, and the importance sampling with DIRT produces \(\text{ ESS }=N/1.5\). For DRAM, we exclude the burn-in samples in the number of work units, whereas the number of work units for the DIRT includes the construction cost of DIRT (993392 density evaluations). In this experiment, despite the high construction cost, DIRT can generate a Markov chain with almost independent samples, which is significantly more efficient than DRAM. Furthermore, the construction cost of DIRT will be less significant if one needs to generate more posterior samples, as shown in Fig. 12a.

Fig. 12
figure 12

Number of independent samples computed by IRT-MCMC, IRT-IS, and DRAM versus the total computational cost. a Comparison using a single level coarse model. b Comparison using the multilevel model in the DIRT construction. In both plots, the construction costs of IRT-MCMC and IRT-IS are included, whereas the burn-in cost of DRAM is not included. The work unit is the computational cost of one coarse model evaluation

In the second experiment, we demonstrate the construction of DIRT using not only the bridge densities with different temperatures, but also the forward models with different grid resolutions. For initial temperatures such that \(\beta _k<10^{-0.5}\), we use the coarse forward model \(G^{h_c}\). For \(\beta _k=10^{-0.5}\) and \(\beta _k=10^{-0.25}\), we use the intermediate forward model \(G^{h_m}\). For \(\beta _k=1\), we use the fine forward model \(G^{h_f}\), so that the fine model is used to define the target posterior density. We need 915024 58544, and 19824 evaluations of the coarse, intermediate, and fine models, respectively, to construct DIRT. Once the DIRT is constructed, Algorithm 1 generates a Markov chain with IACT 2.87 that samples the posterior defined by the fine model. Again, the importance sampling is more efficient with \(\text{ ESS }=N/1.78\). The number of independent samples versus the number of work units is reported in Fig. 12b. In this experiment, it is computationally infeasible to apply DRAM directly (or any MCMC in general) to sample the posterior defined by the fine model. In contrast, the evaluation of DIRT and the corresponding posterior densities can be embarrassingly parallelised, which can further accelerate the posterior inference using high-performance computers. The IRT-IS algorithm can bypass the construction of Markov chains, which makes it suitable to be integrated into multilevel Monte Carlo or multilevel quasi-Monte Carlo estimators to improve the convergence rate of the computation of posterior expectations. We leave this as a future research question.

7 Conclusion

We have enabled functional tensor decompositions of complicated and concentrated continuous probability density functions that suffer from impractically large tensor ranks when approximated directly. Instead, we build an adaptive sequential change of coordinates that drives the target density towards a product function. This change of variables is realised by the composition of order-preserving SIRTs computed from functional TT decompositions of ratios of bridging densities. Each of the ratio functions recovers one scale of correlations of the target density, and hence, it can be approximated with fixed TT ranks. Together with the triangular structure of the Rosenblatt transport, this makes the total complexity linear in the number of variables.

This deep composition of the inverse Rosenblatt transports shares similarities with deep neural networks with nonlinear activation functions. However, DIRT has several advantages.

  • Each DIRT layer, defined by the bridging densities, can be associated with the scale of noise or observation function. Any prior knowledge of model hierarchies can improve the selection of bridging densities. In contrast, the influence of a particular fully connected layer in a neural network is difficult to predict or understand.

  • DIRT layers can be computed independently. As soon as the layer is approximated up to the desired accuracy, it can be saved and never recomputed again. This enables a simple interactive construction, where the tuning parameters can be set layer per layer. Neural networks require optimisation of all layers simultaneously.

  • The construction of each DIRT layer is powered by efficient TT-cross algorithms, which can converge much faster than the stochastic gradient descent used by neural networks in many cases. The dense linear algebra operations used by TT decompositions can take full advantage of modern CPU and GPU vectorisations, whereas an embarrassing parallelism with respect to target density evaluations is well scalable to modern high-performance computers.

This work opens many potential applications and further enhancements of DIRT. For example, the transport maps defined by DIRT can be naturally extended to approximate the optimal biasing density in importance sampling, which can be valuable for solving rare event simulations. In Sect. 6.4, we offered some preliminary investigation on constructing DIRT using multilevel models. The multilevel idea can be further integrated with DIRT to improve the convergence rate of the importance sampling estimator. For problems involving extremely high-dimensional or infinite-dimensional random variables, DIRT can be combined with the likelihood-informed subspace (LIS) [13, 14, 59, 66] to characterise the highly non-Gaussian effective random variable dimensions identified by LIS. In addition, for sequential Bayesian inference, we can apply DIRT to iteratively characterise the filtered posterior measures changing over time, where the evolution of the random states and time-dependent observations naturally defines a sequence of bridging measures.