1 Introduction

Can one learn a differential operator from pairs of solutions and righthand sides? If so, how many pairs are required? These two questions have received significant research attention [17, 31, 34, 43]. From data, one hopes to eventually learn physical laws of nature or conservation laws that elude scientists in the biological sciences [63], computational fluid dynamics [49], and computational physics [45]. The literature contains many highly successful practical schemes based on deep learning techniques [38, 48]. However, the challenge remains to understand when and why deep learning is effective theoretically. This paper describes the first theoretically justified scheme for discovering scalar-valued elliptic partial differential equations (PDEs) in three variables from input–output data and provides a rigorous learning rate. While our novelties are mainly theoretical, we hope to motivate future practical choices in PDE learning.

We suppose that there is an unknown second-order uniformly elliptic linear PDE operatorFootnote 1\(\mathcal {L}:\mathcal {H}^2(D)\cap \mathcal {H}_0^1(D) \rightarrow L^2(D)\) with a bounded domain \(D\subset \mathbb {R}^3\) with Lipschitz smooth boundary [16], which takes the form

$$\begin{aligned} (\mathcal {L}u(x) = -\nabla \cdot \left( A(x) \nabla u\right) +c(x)\cdot \nabla u+d(x)u, \quad x\in D, \quad u|_{\partial D} = 0. \end{aligned}$$
(1)

Here, for every \(x\in D\), we have that \(A(x)\in \mathbb {R}^{3\times 3}\) is a symmetric positive definite matrix with bounded coefficient functions so thatFootnote 2\(A_{ij}\in L^{\infty }(D)\), \(c\in L^r(D)\) with \(r\ge 3\), \(d\in L^s(D)\) for \(s\ge 3/2\), and \(d(x)\ge 0\) [28]. We emphasize that the regularity requirements on the variable coefficients are quite weak.

The goal of PDE learning is to discover the operator \(\mathcal {L}\) from \(N\ge 1\) input–output pairs, i.e., \(\{(f_j,u_j)\}_{j=1}^N\), where \(\mathcal {L}u_j = f_j\) and \(u_j|_{\partial D} = 0\) for \(1\le j\le N\). There are two main types of PDE learning tasks: (1) Experimentally determined input–output pairs, where one must do the best one can with the predetermined information and (2) algorithmically determined input–output pairs, where the data-driven learning algorithm can select \(f_1,\ldots ,f_N\) for itself. In this paper, we focus on the PDE learning task where we have algorithmically determined input–output pairs. In particular, we suppose that the functions \(f_1,\ldots ,f_N\) are generated at random and are drawn from a Gaussian process (GP) (see Sect. 2.3). To keep our theoretical statements manageable, we restrict our attention to PDEs of the form:

$$\begin{aligned} \mathcal {L}u = -\nabla \cdot \left( A(x) \nabla u\right) , \quad x\in D, \quad u|_{\partial D} = 0. \end{aligned}$$
(2)

Lower-order terms in Eq. (1) should cause few theoretical problems [3], though our algorithm and our bounds get far more complicated.

The approach that dominates the PDE learning literature is to directly learn \(\mathcal {L}\) by either (1) learning parameters in the PDE [4, 64], (2) using neural networks to approximate the action of the PDE on functions [45,46,47,48,49], or (3) deriving a model by composing a library of operators with sparsity considerations  [9, 35, 52, 53, 59, 60]. Instead of trying to learn the unbounded, closed operator \(\mathcal {L}\) directly, we follow [6, 17, 18] and discover the Green’s function associated with \(\mathcal {L}\). That is, we attempt to learn the function \(G:D\times D\rightarrow \mathbb {R}^+\cup \{\infty \}\) such that [16]

$$\begin{aligned} u_j(x) = \int _{D} G(x,y)f_j(y) d y,\qquad x\in D, \qquad 1\le j\le N. \end{aligned}$$
(3)

Seeking G, as opposed to \(\mathcal {L}\), has several theoretical benefits:

  1. 1.

    The integral operator in Eq. (3) is compact [15], while \(\mathcal {L}\) is only closed [14]. This allows G to be rigorously learned by input–output pairs \(\{(f_j,u_j)\}_{j=1}^N\), as its range can be approximated by finite-dimensional spaces (see Theorem 3).

  2. 2.

    It is known that G has a hierarchical low-rank structure [3, Theorem 2.8]: for \(0<\epsilon <1\), there exists a function \(G_k(x,y) = \sum _{j=1}^k g_j(x)h_j(y)\) with \(k = \mathcal {O}(\log ^4(1/\epsilon ))\) such that [3, Theorem 2.8]

    $$\begin{aligned} \left\| G - G_k\right\| _{L^2(X\times Y)}\le \epsilon \left\| G\right\| _{L^2(X\times {\hat{Y}})}, \end{aligned}$$

    where \(X,Y\subseteq D\) are sufficiently separated domains, and \(Y\subseteq {\hat{Y}}\subseteq D\) denotes a larger domain than Y (see Theorem 4 for the definition). The further apart X and Y, the faster the singular values of G decay. Moreover, G also has an off-diagonal decay property [19, 25]:

    $$\begin{aligned} G(x,y) \le \frac{c}{\Vert x - y\Vert _2}\Vert G\Vert _{L^2(D\times D)}, \qquad x\ne y\in D, \end{aligned}$$

    where c is a constant independent of x and y. Exploiting these structures of G leads to a rigorous algorithm for constructing a global approximant to G (see Sect. 4).

  3. 3.

    The function G is smooth away from its diagonal, allowing one to efficiently approximate it [19].

Once a global approximation \({\tilde{G}}\) has been constructed for G using input–output pairs, given a new righthand side f one can directly compute the integral in Eq. (3) to obtain the corresponding solution u to Eq. (1). Usually, numerically computing the integral in Eq. (3) must be done with sufficient care as G possesses a singularity when \(x = y\). However, our global approximation \({\tilde{G}}\) has an hierarchical structure and is constructed as 0 near the diagonal. Therefore, for each fixed \(x\in D\), we simply recommend that \(\int _{D} {\tilde{G}}(x,y) f_j(y)d y\) is partitioned into the panels that corresponds to the hierarchical decomposition, and then discretized each panel with a quadrature rule.

1.1 Main Contributions

There are two main contributions in this paper: (1) the generalization of the randomized singular value decomposition (SVD) algorithm for learning matrices from matrix-vector products to Hilbert–Schmidt (HS) operators and (2) a theoretical learning rate for discovering Green’s functions associated with PDEs of the form Eq. (2). These contributions are summarized in Theorem 1 and 3.

Theorem 1 says that, with high probability, one can recover a near-best rank k HS operator using \(k+p\) operator-function products, for a small integer p. In the bound of the theorem, a quantity, denoted by \(0< \gamma _k \le 1\), measures the quality of the input–output training pairs (see Sects. 3.1 and 3.4). We then combine Theorem 1 with the theory of Green’s functions for elliptic PDEs to derive a theoretical learning rate for PDEs.

In Theorem 3, we show that Green’s functions associated with uniformly elliptic PDEs in three dimensions can be recovered using \(N=\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output pairs \((f_j,u_j)_{j=1}^N\) to within an accuracy of \(\mathcal {O}(\varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon )\) with high probability, for \(0<\epsilon <1\). Our learning rate associated with uniformly elliptic PDEs in three variables is therefore \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\). The quantity \(0< \varGamma _\epsilon \le 1\) (defined in Sect. 4.4.2) measures the quality of the GP used to generate the random functions \(\{f_j\}_{j=1}^N\) for learning G. We emphasize that the number of training pairs is small only if the GP’s quality is high. The probability bound in Theorem 3 implies that the constructed approximation is close to G with high probability and converges almost surely to the Green’s function as \(\epsilon \rightarrow 0\).

1.2 Organization of Paper

The paper is structured as follows. In Sect. 2, we briefly review HS operators and GPs. We then generalize the randomized SVD algorithm to HS operators in Sect. 3. Next, in Sect. 4, we characterize the learning rate for PDEs of the form of Eq. (2) (see Theorem 3). Finally, we conclude and discuss potential further directions in Sect. 5.

2 Background Material

We begin by reviewing quasimatrices (see Sect. 2.1), HS operators (see Sect. 2.2), and GPs (see Sect. 2.3).

2.1 Quasimatrices

Quasimatrices are an infinite-dimensional analogue of tall-skinny matrices [57]. Let \(D_1,D_2\subseteq \mathbb {R}^d\) be two domains with \(d\ge 1\) and denote by \(L^2(D_{1})\) the space of square-integrable functions defined on \(D_{1}\). Many of results in this paper are easier to state using quasimatrices. We say that \(\varvec{\Omega }\) is a \(D_1\times k\) quasimatrix, if \(\varvec{\Omega }\) is a matrix with k columns where each column is a function in \(L^2(D_1)\). That is,

$$\begin{aligned} \varvec{\Omega } = \begin{bmatrix} \omega _1 \, |&\! \cdots \!&| \, \omega _k \end{bmatrix}, \qquad \omega _j\in L^2(D_1). \end{aligned}$$

Quasimatrices are useful to define analogues of matrix operations for HS operators [11, 56,57,58]. For example, if \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) is a HS operator, then we write \(\mathscr {F}\varvec{\Omega }\) to denote the quasimatrix obtained by applying \(\mathscr {F}\) to each column of \(\varvec{\Omega }\). Moreover, we write \(\varvec{\Omega }^*\varvec{\Omega }\) and \(\varvec{\Omega }\varvec{\Omega }^*\) to mean the following:

$$\begin{aligned} \varvec{\Omega }^*\varvec{\Omega } = \begin{bmatrix}\langle \omega _1,\omega _1 \rangle &{} \cdots &{} \langle \omega _1,\omega _k \rangle \\ \vdots &{} \ddots &{} \vdots \\ \langle \omega _k,\omega _1 \rangle &{} \cdots &{} \langle \omega _k,\omega _k \rangle \end{bmatrix}, \qquad \varvec{\Omega }\varvec{\Omega }^* = \sum _{j=1}^k \omega _j(x)\omega _j(y), \end{aligned}$$

where \(\langle \cdot , \cdot \rangle \) is the \(L^2(D_1)\) inner-product. Many operations for rectangular matrices in linear algebra can be generalized to quasimatrices [57].

2.2 Hilbert–Schmidt Operators

HS operators are an infinite-dimensional analogue of matrices acting on vectors. Since \(L^2(D_{1})\) is a separable Hilbert space, there is a complete orthonormal basis \(\{e_j\}_{j=1}^\infty \) for \(L^2(D_{1})\). We call \(\mathscr {F}: L^2(D_1)\rightarrow L^2(D_2)\) a HS operator [23, Ch. 4] with HS norm \(\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\) if \(\mathscr {F}\) is linear and

$$\begin{aligned} \Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}} :=\left( \sum _{j=1}^\infty \Vert \mathscr {F}e_j\Vert _{L^2(D_2)}^2\right) ^{1/2}<\infty . \end{aligned}$$

The archetypical example of an HS operator is an HS integral operator \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) defined by

$$\begin{aligned} (\mathscr {F}f)(x)=\int _{D_1} G(x,y)f(y)d y, \qquad f\in L^2(D_1),\quad x\in D_2, \end{aligned}$$

where \(G\in L^2(D_2\times D_1)\) is the kernel of \(\mathscr {F}\) and \(\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}=\Vert G\Vert _{L^2(D_2\times D_1)}\). Since HS operators are compact operators, they have an SVD [23, Theorem 4.3.1]. That is, there exists a nonnegative sequence \(\sigma _1\ge \sigma _2\ge \cdots \ge 0\) and an orthonormal basis \(\{q_{j}\}_{j=1}^\infty \) for \(L^2(D_2)\) such that for any \(f\in L^2(D_1)\) we have

$$\begin{aligned} \mathscr {F}f = \sum _{\begin{array}{c} j=1\\ \sigma _j>0 \end{array}}^{\infty }\sigma _j\langle e_j, f\rangle q_{j}, \end{aligned}$$
(4)

where the equality holds in the \(L^2(D_2)\) sense. Note that we use the complete SVD, which includes singular functions associated with the kernel of \(\mathscr {F}\). Moreover, one finds that \(\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^\infty \sigma _j^2\), which shows that the HS norm is an infinite-dimensional analogue of the Frobenius matrix norm \(\Vert \cdot \Vert _{F }\). In the same way that truncating the SVD after k terms gives a best rank k matrix approximation, truncating Eq. (4) gives a best approximation in the HS norm. That is, [23, Theorem 4.4.7]

$$\begin{aligned} \Vert \mathscr {F}-\mathscr {F}_k\Vert _{{{\,\mathrm{HS}\,}}}^2=\sum _{j=k+1}^\infty \sigma _j^2, \qquad \mathscr {F}_kf = \sum _{j=1}^{k}\sigma _j\langle e_j,f\rangle q_{j}, \quad f\in L^2(D_1). \end{aligned}$$

In this paper, we are interested in constructing an approximation to G in Eq. (3) from input–output pairs \(\{(f_j,u_j)\}_{j=1}^N\) such that \(u_j = \mathscr {F}f_j\).

Throughout this paper, the HS operator denoted by \(\varvec{\Omega }\varvec{\Omega }^*\mathscr {F}: L^2(D_1)\rightarrow L^2(D_2)\) is given by \(\varvec{\Omega }\varvec{\Omega }^*\mathscr {F}f = \sum _{j=1}^k \langle \omega _j,\mathscr {F}f\rangle \omega _j\). If we consider the operator \(\varvec{\Omega }^*\mathscr {F}: L^2(D_1)\rightarrow \mathbb {R}^k\), then \(\Vert \varvec{\Omega }^*\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^\infty \Vert \mathscr {F}e_j\Vert _{2}^2\). Similarly, for \(\mathscr {F}\varvec{\Omega }: \mathbb {R}^k\rightarrow L^2(D_2)\) we have \(\Vert \mathscr {F}\varvec{\Omega }\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^k \Vert \mathscr {F}{\tilde{e}}_j\Vert _{L^2(D_2)}^2\), where \(\{{\tilde{e}}_j\}_{j=1}^k\) is an orthonormal basis of \(\mathbb {R}^k\). Moreover, if \(\varvec{\Omega }\) has full column rank then \(\mathbf {P}_{\varvec{\Omega }}\mathscr {F}= \varvec{\Omega }(\varvec{\Omega }^*\varvec{\Omega })^{\dagger }\varvec{\Omega }^*\mathscr {F}\) is the orthogonal projection of the range of \(\mathscr {F}\) onto the column space of \(\varvec{\Omega }\). Here, \((\varvec{\Omega }^*\varvec{\Omega })^{\dagger }\) is the pseudo-inverse of \(\varvec{\Omega }^*\varvec{\Omega }\).

2.3 Gaussian Processes

A GP is an infinite-dimensional analogue of a multivariate Gaussian distribution and a function drawn from a GP is analogous to a randomly generated vector. If \(K: D\times D\rightarrow \mathbb {R}\) is a continuous symmetric positive semidefinite kernel, where \(D\subseteq \mathbb {R}^d\) is a domain, then a GP is a stochastic process \(\{X_t,\,t\ge 0\}\) such that for every finite set of indices \(t_1,\ldots ,t_n\ge 0\) the vector of random variables \((X_{t_1},\ldots ,X_{t_n})\) is a multivariate Gaussian distribution with mean \((0,\ldots ,0)\) and covariance \(K_{ij} = K(t_i,t_j)\) for \(1\le i,j\le n\). We denote a GP with mean \((0,\ldots ,0)\) and covariance K by \(\mathcal {GP}(0,K)\).

Since K is a continuous symmetric positive semidefinite kernel, it has nonnegative eigenvalues \(\lambda _1\ge \lambda _2\ge \cdots \ge 0\) and there is an orthonormal basis of eigenfunctions \(\{\psi _j\}_{j=1}^\infty \) of \(L^2(D)\) such that [23, Theorem 4.6.5]:

$$\begin{aligned} K(x,y)=\sum _{j=1}^\infty \lambda _j\psi _j(x)\psi _j(y),\quad \int _{D} K(x,y)\psi _j(y)d y= \lambda _j \psi _j(x),\quad x,y\in D, \end{aligned}$$
(5)

where the infinite sum is absolutely and uniformly convergent [39]. In addition, we define the trace of the covariance kernel K by \(\smash {{{\,\mathrm{Tr}\,}}(K):=\sum _{j=1}^\infty \lambda _j}<\infty \). The eigendecomposition of K gives an algorithm for generating functions from \(\smash {\mathcal {GP}(0,K)}\). In particular, if \(\omega \sim \sum _{j=1}^{\infty } \sqrt{\lambda _j} c_j\psi _j\), where the coefficients \(\{c_j\}_{j=1}^\infty \) are independent and identically distributed standard Gaussian random variables, then \(\omega \sim \mathcal {GP}(0,K)\) [26, 33]. We also have

$$\begin{aligned} \mathbb {E}\!\left[ \Vert \omega \Vert _{L^2(D)}^2\right] =\sum _{j=1}^\infty \lambda _j\mathbb {E}\!\left[ c_j^2\right] \Vert \psi _j\Vert _{L^2(D)}^2=\sum _{j=1}^\infty \lambda _j=\int _{D}K(y,y)\,d y<\infty , \end{aligned}$$

where the last equality is analogous to the fact that the trace of a matrix is equal to the sum of its eigenvalues. In this paper, we restrict our attention to GPs with positive definite covariance kernels so that the eigenvalues of K are strictly positive.

In Fig. 1, we display the squared-exponential kernel defined as \(K_{\text {SE}}(x,y)=\exp (-|x-y|^2/(2\ell ^2))\) for \(x,y\in [-1,1]\) [50, Chapt. 4] with parameters \(\ell = 1,0.1,0.01\) together with sampled functions from \(\mathcal {GP}(0,K_{\text {SE}})\). We observe that the functions become more oscillatory as the length-scale parameter \(\ell \) decreases and hence the numerical rank of the kernel increases or, equivalently, the associated eigenvalues \(\{\lambda _j\}\) decay more slowly to zero.

3 Low-Rank Approximation of Hilbert–Schmidt Operators

In a landmark paper, Halko, Martinsson, and Tropp proved that one could learn the column space of a finite matrix—to high accuracy and with a high probability of success—by using matrix-vector products with standard Gaussian random vectors [22]. We now set out to generalize this from matrices to HS operators. Alternative randomized low-rank approximation techniques such as the generalized Nyström method [42] might also be generalized in a similar manner. Since the proof is relatively long, we state our final generalization now.

Fig. 1
figure 1

Squared-exponential covariance kernel \(K_{\text {SE}}\) with parameter \(\ell =1,0.1,0.01\) (top row) and five functions sampled from \(\mathcal {GP}(0, K_{\text {SE}})\) (bottom row)

Theorem 1

Let \(D_1,D_2\subseteq \mathbb {R}^d\) be domains with \(d\ge 1\) and \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) be a HS operator. Select a target rank \(k\ge 1\), an oversampling parameter \(p\ge 2\), and a \(D_1\times (k+p)\) quasimatrix \(\varvec{\Omega }\) such that each column is drawn from \(\mathcal {GP}(0,K)\), where \(K:D_1\times D_1\rightarrow \mathbb {R}\) is a continuous symmetric positive definite kernel with eigenvalues \(\lambda _1\ge \lambda _2\ge \cdots >0\). If \(\mathbf {Y}=\mathscr {F}\varvec{\Omega }\), then

$$\begin{aligned} \mathbb {E}\!\left[ \Vert \mathscr {F}- \mathbf {P}_\mathbf {Y}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\right] \le \left( 1+\sqrt{\frac{1}{\gamma _k}\frac{k(k+p)}{p-1}}\,\right) \left( \sum _{j=k+1}^{\infty }\sigma _j^2\right) ^{1/2}, \end{aligned}$$
(6)

where \(\gamma _k = k/(\lambda _1{{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1}))\) with \(\mathbf {C}_{ij}=\int _{D_1\times D_1}v_i(x)K(x,y)v_j(y)d xd y\) for \(1\le i,j\le k\). Here, \(\mathbf {P}_\mathbf {Y}\) is the orthogonal projection onto the vector space spanned by the columns of \(\mathbf {Y}\), \(\sigma _j\) is the jth singular value of \(\mathscr {F}\), and \(v_j\) is the jth right singular vector of \(\mathscr {F}\).

Assume further that \(p\ge 4\), then for any \(s,t\ge 1\), we have

$$\begin{aligned} \Vert \mathscr {F}-\mathbf {P}_{\mathbf {Y}}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\le \sqrt{1+ t^2s^2 \frac{3}{\gamma _k}\frac{k(k+p)}{p+1}\sum _{j=1}^\infty \frac{\lambda _j}{\lambda _1}}\,\left( \sum _{j=k+1}^\infty \sigma _j^2\right) ^{1/2}, \end{aligned}$$
(7)

with probability \(\ge 1 - t^{-p}-[s e^{-(s^2-1)/2}]^{k+p}\).

We remark that the term \([s e^{-(s^2-1)/2}]^{k+p}\) in the statement of Theorem 1 is bounded by \(e^{-s^2}\) for \(s\ge 2\) and \(k+p\ge 5\). In the rest of the section, we prove this theorem.

3.1 Three Caveats that Make the Generalization Non-Trivial

One might imagine that the generalization of the randomized SVD algorithm from matrices to HS operators is trivial, but this is not the case due to three caveats:

  1. 1.

    The randomized SVD on finite matrices always uses matrix-vector products with standard Gaussian random vectors [22]. However, for GPs, one must always have a continuous kernel K in \(\mathcal {GP}(0,K)\), which discretizes to a non-standard multivariate Gaussian distribution. Therefore, we must extend [22, Theorem 10.5] to allow for non-standard multivariate Gaussian distributions. The discrete version of our extension is the following:

Corollary 1

Let \(\mathbf {A}\) be a real \(n_2\times n_1\) matrix with singular values \(\sigma _1\ge \cdots \ge \sigma _{\min \{n_1,n_2\}}\). Choose a target rank \(k\ge 1\) and an oversampling parameter \(p\ge 2\). Draw an \(n_1\times (k+p)\) Gaussian matrix, \(\varvec{\Omega }\), with independent columns where each column is from a multivariate Gaussian distribution with mean \((0,\ldots ,0)^\top \) and positive definite covariance matrix \(\mathbf {K}\). If \(\mathbf {Y}=\mathbf {A}\varvec{\Omega }\), then the expected approximation error is bounded by

$$\begin{aligned} \mathbb {E}\left[ \Vert \mathbf {A}-\mathbf {P}_\mathbf {Y}\mathbf {A}\Vert _{F }\right] \le \left( 1+\sqrt{\frac{k+p}{p-1}\sum _{j=n_1-k+1}^{n_1}\frac{\lambda _1}{\lambda _j}}\,\right) \left( \sum _{j=k+1}^\infty \sigma _j^2\right) ^{1/2}, \end{aligned}$$
(8)

where \(\lambda _1 \ge \cdots \ge \lambda _{n_1} > 0\) are the eigenvalues of \(\mathbf {K}\) and \(\mathbf {P}_\mathbf {Y}\) is the orthogonal projection onto the vector space spanned by the columns of \(\mathbf {Y}\). Assume further that \(p\ge 4\), then for any \(s,t\ge 1\), we have

$$\begin{aligned} \Vert \mathbf {A}-\mathbf {P}_\mathbf {Y}\mathbf {A}\Vert _{F }\le \left( \!1+ ts\cdot \sqrt{\frac{3(k+p)}{p+1}\left( \sum _{j=1}^{n_1}\lambda _j\right) \sum _{j=n_1-k+1}^{n_1}\frac{1}{\lambda _j}}\,\right) \!\left( \sum _{j=k+1}^\infty \sigma _j^2\right) ^{1/2}, \end{aligned}$$

with probability \(\ge 1 - t^{-p}-[s e^{-(s^2-1)/2}]^{k+p}\).

Choosing a covariance matrix \(\mathbf {K}\) with sufficient eigenvalue decay so that \(\lim _{n_1\rightarrow \infty }\sum _{j=1}^{n_1} \lambda _j <\infty \) allows \(\mathbb {E}[\Vert \varvec{\Omega }\Vert _{F }^2]\) to remain bounded as \(n_1\rightarrow \infty \). This is of interest when applying the randomized SVD algorithm to extremely large matrices and is critical for HS operators. A stronger statement of this result [8, Theorem 2] shows that prior information on \(\mathbf {A}\) can be incorporated into the covariance matrix to achieve lower approximation error than the randomized SVD with standard Gaussian vectors.

  1. 2.

    We need an additional essential assumption. The kernel in \(\mathcal {GP}(0,K)\) is “reasonable” for learning \(\mathscr {F}\), where reasonableness is measured by the quantity \(\gamma _k\) in Theorem 1. If the first k right singular functions of the HS operator \(v_1,\ldots ,v_k\) are spanned by the first \(k+m\) eigenfunctions of K \(\psi _1,\ldots ,\psi _{k+m}\), for some \(m\in \mathbb {N}\), then (see Eq. (11) and Lemma 2)

    $$\begin{aligned} \frac{1}{k}\sum _{j=1}^k\frac{\lambda _1}{\lambda _j}\le \frac{1}{\gamma _k} \le \frac{1}{k}\sum _{j=m+1}^{k+m}\frac{\lambda _1}{\lambda _j}. \end{aligned}$$

    In the matrix setting, this assumption always holds with \(m=n_1-k\) (see Corollary 1) and one can have \(\gamma _k = 1\) when \(\lambda _1=\cdots =\lambda _{n_1}\) [22, Theorem 10.5].

  2. 3.

    Probabilistic error bounds for the randomized SVD in [22] are derived using tail bounds for functions of standard Gaussian matrices [30, Sect. 5.1]. Unfortunately, we are not aware of tail bounds for non-standard Gaussian quasimatrices. This results in a slightly weaker probability bound than [22, Theorem 10.7].

3.2 Deterministic Error Bound

Apart from the three caveats, the proof of Theorem 1 follows the outline of the argument in [22, Theorem 10.5]. We define two quasimatrices \(\mathbf {U}\) and \(\mathbf {V}\) containing the left and right singular functions of \(\mathscr {F}\) so that the jth column of \(\mathbf {V}\) is \(v_j\). We also denote by \(\varvec{\Sigma }\) the infinite diagonal matrix with the singular values of \(\mathscr {F}\), i.e., \(\sigma _1\ge \sigma _2\ge \cdots \ge 0\), on the diagonal. Finally, for a fixed \(k\ge 1\), we define the \(D_1\times k\) quasimatrix as the truncation of \(\mathbf {V}\) after the first k columns and \(\mathbf {V}_2\) as the remainder. Similarly, we split \(\varvec{\Sigma }\) into two parts:

$$\begin{aligned}\begin{array}{cccccc} &{} k &{} \infty \\ {\varvec{\Sigma }} = &{}\bigg ( \begin{array}{ll} &{}{\varvec{\Sigma }}_1 \\ &{}0 \end{array} &{} \begin{array}{ll} &{}0 \\ &{}{\varvec{\Sigma }}_2 \end{array}\bigg ) &{} \begin{array}{ll} &{}k \\ &{}\infty \\ \end{array} \end{array}. \end{aligned}$$

We are ready to prove an infinite-dimensional analogue of [22, Theorem 9.1] for HS operators.

Theorem 2

(Deterministic error bound) Let \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) be a HS operator with SVD given in Eq. (4). Let \(\varvec{\Omega }\) be a \(D_1\times \ell \) quasimatrix and \(\mathbf {Y}=\mathscr {F}\varvec{\Omega }\). If \(\varvec{\Omega }_1=\mathbf {V}_1^*\varvec{\Omega }\) and \(\varvec{\Omega }_2=\mathbf {V}_2^*\varvec{\Omega }\), then assuming \(\varvec{\Omega }_1\) has full rank, we have

$$\begin{aligned} \Vert \mathscr {F}-\mathbf {P}_\mathbf {Y}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}^2\le \Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2+\Vert \varvec{\Sigma }_2\varvec{\Omega }_2\varvec{\Omega }_1^{\dagger }\Vert _{{{\,\mathrm{HS}\,}}}^2, \end{aligned}$$

where \(\mathbf {P}_\mathbf {Y}=\mathbf {Y}(\mathbf {Y}^*\mathbf {Y})^\dagger \mathbf {Y}^*\) is the orthogonal projection onto the space spanned by the columns of \(\mathbf {Y}\) and \(\smash {\varvec{\Omega }_1^{\dagger } = (\varvec{\Omega }_1^*\varvec{\Omega }_1)^{-1}\varvec{\Omega }_1^*}\).

Proof

First, note that because \(\mathbf {U}\mathbf {U}^*\) is the orthonormal projection onto the range of \(\mathscr {F}\) and \(\mathbf {U}\) is a basis for the range, we have

$$\begin{aligned} \Vert \mathscr {F}-\mathbf {P}_\mathbf {Y}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}} = \Vert \mathbf {U}\mathbf {U}^*\mathscr {F}-\mathbf {P}_{\mathbf {Y}}\mathbf {U}\mathbf {U}^*\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}. \end{aligned}$$

By Parseval’s theorem [51, Theorem 4.18], we have

$$\begin{aligned} \Vert \mathbf {U}\mathbf {U}^* \mathscr {F}-\mathbf {P}_{\mathbf {Y}}\mathbf {U}\mathbf {U}^* \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}= \Vert \mathbf {U}^*\mathbf {U}\mathbf {U}^* \mathscr {F}-\mathbf {U}^*\mathbf {P}_{\mathbf {Y}}\mathbf {U}\mathbf {U}^* \mathscr {F}\mathbf {V}\Vert _{{{\,\mathrm{HS}\,}}}. \end{aligned}$$

Moreover, we have the equality \(\Vert \mathscr {F}-\mathbf {P}_{\mathbf {Y}}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}=\Vert (\mathbf {I}-\mathbf {P}_{\mathbf {U}^*\mathbf {Y}})\mathbf {U}^*\mathscr {F}\mathbf {V}\Vert _{{{\,\mathrm{HS}\,}}}\) because the inner product \(\langle \sum _{j=1}^\infty \alpha _j u_j,\sum _{j=1}^\infty \beta u_j\rangle =0\) if and only if \(\sum _{j=1}^\infty \alpha _j \beta _j=0\). We now take \(\mathbf {A}=\mathbf {U}^*\mathscr {F}\mathbf {V}\), which is a bounded infinite matrix such that \(\Vert \mathbf {A}\Vert _{F }=\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}<\infty \). The statement of the theorem immediately follows from the proof of [22, Theorem 9.1]. \(\square \)

This theorem shows that the bound on the approximation error \(\Vert \mathscr {F}-\mathbf {P}_\mathbf {Y}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\) depends on the singular values of the HS operator and the test matrix \(\varvec{\Omega }\).

3.3 Probability Distribution of \(\varvec{\Omega }_1\)

If the columns of \(\varvec{\Omega }\) are independent and identically distributed as \(\mathcal {GP}(0,K)\), then the matrix \(\varvec{\Omega }_1\) in Theorem 2 is of size \(k\times \ell \) with entries that follow a Gaussian distribution. To see this, note that

$$\begin{aligned} \varvec{\Omega }_1 = \mathbf {V}_1^*\varvec{\Omega } = \left( \begin{array}{ccc} \langle v_1,\omega _1\rangle &{} \cdots &{} \langle v_1, \omega _{\ell }\rangle \\ \vdots &{} \ddots &{} \vdots \\ \langle v_k,\omega _1\rangle &{} \cdots &{} \langle v_k,\omega _{\ell }\rangle \end{array} \right) , \qquad \omega _j\sim \mathcal {GP}(0,K). \end{aligned}$$

If \(\omega \sim \mathcal {GP}(0,K)\) with K given in Eq. (5), then we find that \(\langle v,\omega \rangle \sim \mathcal {N}(0,\sum _{j=1}^\infty \lambda _j \langle v,\psi _j\rangle ^2)\) so we conclude that \(\varvec{\Omega }_1\) has Gaussian entries with zero mean. Finding the covariances between the entries is more involved.

Lemma 1

With the same setup as Theorem 2, suppose that the columns of \(\varvec{\Omega }\) are independent and identically distributed as \(\mathcal {GP}(0,K)\). Then, the matrix \(\varvec{\Omega }_1=\mathbf {V}_1^*\varvec{\Omega }\) in Theorem 2 has independent columns and each column is identically distributed as a multivariate Gaussian with positive definite covariance matrix \(\mathbf {C}\) given by

$$\begin{aligned} \mathbf {C}_{ij} = \int _{D_1\times D_1} v_i(x)K(x,y)v_{j}(y)d xd y, \qquad 1\le i,j\le k, \end{aligned}$$
(9)

where \(v_i\) is the ith column of \(\mathbf {V}_1\).

Proof

We already know that the entries are Gaussian with mean 0. Moreover, the columns are independent because \(\omega _1,\ldots ,\omega _\ell \) are independent. Therefore, we focus on the covariance matrix. Let \(1\le i,i'\le k\), \(1\le j,j'\le \ell \), then since \(\mathbb {E}\!\left[ \langle v_i,\omega _j\rangle \right] = 0\) we have

$$\begin{aligned} {{\,\mathrm{cov}\,}}(\langle v_i,\omega _j\rangle , \langle v_{i'},\omega _{j'}\rangle ) = \mathbb {E}\left[ \langle v_i,\omega _j\rangle \, \langle v_{i'},\omega _{j'}\rangle \right] = \mathbb {E}\left[ X_{ij}X_{i'j'}\right] , \end{aligned}$$

where \(X_{ij} = \langle v_i,\omega _j\rangle \). Since \(\langle v_i,\omega _j\rangle \sim \sum _{n=1}^\infty \sqrt{\lambda _n}c_n^{(j)} \langle v_i,\psi _n\rangle \), where \(c_n^{(j)}\sim \mathcal {N}(0,1)\), we have

$$\begin{aligned} {{\,\mathrm{cov}\,}}(\langle v_i,\omega _j\rangle , \langle v_{i'},\omega _{j'}\rangle ) = \mathbb {E}\left[ \lim _{m_1,m_2\rightarrow \infty }X_{ij}^{m_1} X_{i'j'}^{m_2}\right] ,\qquad X_{ij}^{m_1}:=\sum _{n=1}^{m_1} \sqrt{\lambda _n}c_n^{(j)} \langle v_i,\psi _n\rangle . \end{aligned}$$

We first show that \(\lim _{m_1,m_2\rightarrow \infty }\left| \mathbb {E}\!\left[ \!X_{ij}^{m_1}X_{i'j'}^{m_2}\right] -\mathbb {E}\!\left[ X_{ij}X_{i'j'}\right] \right| =0\). For any \(m_1,m_2\ge 1\), we have by the triangle inequality,

$$\begin{aligned}&\left| \mathbb {E}\!\left[ \!X_{ij}^{m_1}X_{i'j'}^{m_2}\!\right] -\mathbb {E}\!\left[ X_{ij}X_{i'j'}\right] \right| \\&\qquad \le \mathbb {E}\!\left[ \left| X_{ij}^{m_1}X_{i'j'}^{m_2}-X_{ij}X_{i'j'}\right| \right] \\&\qquad \le \mathbb {E}\!\left[ \left| (X_{ij}^{m_1}-X_{ij})X_{i'j'}^{m_2}\right| \right] \!\!+\mathbb {E}\!\left[ \left| X_{ij}(X_{i'j'}^{m_2}-X_{i'j'})\right| \right] \\&\qquad \le \mathbb {E}\!\left[ \left| X_{ij}^{m_1}-X_{ij}\right| ^2\right] ^{\tfrac{1}{2}}\!\mathbb {E}\!\left[ \left| X_{i'j'}^{m_2}\right| ^2\right] ^{\tfrac{1}{2}}\!\! + \mathbb {E}\!\left[ \left| X_{i'j'}-X_{i'j'}^{m_2}\right| ^2\right] ^{\tfrac{1}{2}}\!\mathbb {E}\!\left[ \left| X_{ij}\right| ^2\right] ^{\tfrac{1}{2}}\!, \end{aligned}$$

where the last inequality follows from the Cauchy–Schwarz inequality. We now set out to show that both terms in the last inequality converge to zero as \(m_1,m_2\rightarrow \infty \). The terms \(\smash {\mathbb {E}[|X_{i'j'}^{m_2}|^2]}\) and \(\smash {\mathbb {E}[|X_{ij}|^2]}\) are bounded by \(\sum _{n=1}^\infty \lambda _n<\infty \), using the Cauchy–Schwarz inequality. Moreover, we have

$$\begin{aligned} \mathbb {E}\left[ \left| X_{ij}^{m_1}-X_{ij}\right| ^2\right] = \mathbb {E}\left[ \left| \sum _{n=m_1+1}^\infty \sqrt{\lambda _n}c_n^{(j)} \langle v_i,\psi _n\rangle \right| ^2\right] \le \sum _{n=m_1+1}^\infty \lambda _n \xrightarrow [m_1 \rightarrow \infty ]{} 0, \end{aligned}$$

because \(X_{ij} - X_{ij}^{m_1}\sim \mathcal {N}(0,\sum _{n=m_1+1}^\infty \lambda _n\langle v_i,\psi _n\rangle ^2)\). Therefore, we find that \({{\,\mathrm{cov}\,}}(X_{ij}, X_{i'j'}) = \lim _{m_1,m_2\rightarrow \infty }\mathbb {E}[X_{ij}^{m_1}X_{i'j'}^{m_2}]\) and we obtain

$$\begin{aligned} {{\,\mathrm{cov}\,}}(X_{ij},X_{i'j'})&=\lim _{m_1,m_2\rightarrow \infty }\mathbb {E}\left[ \sum _{n=1}^{m_1}\sum _{n'=1}^{m_2} \sqrt{\lambda _n \lambda _{n'}}c_n^{(j)}c_{n'}^{(j')} \langle v_i,\psi _n\rangle \langle v_{i'},\psi _{n'}\rangle \right] \\&=\lim _{m_1,m_2\rightarrow \infty }\sum _{n=1}^{m_1}\sum _{n'=1}^{m_2} \sqrt{\lambda _n \lambda _{n'}}\mathbb {E}[c_n^{(j)}c_{n'}^{(j')}] \langle v_i,\psi _n\rangle \langle v_{i'},\psi _{n'}\rangle . \end{aligned}$$

The latter expression is zero if \(n\ne n'\) or \(j\ne j'\) because then \(c_n^{(j)}\) and \(c_{n'}^{(j')}\) are independent random variables with mean 0. Since \(\mathbb {E}[(c_n^{(j)})^2] = 1\), we have

$$\begin{aligned} {{\,\mathrm{cov}\,}}(X_{ij},X_{i'j'})= {\left\{ \begin{array}{ll} \sum _{n=1}^\infty \lambda _n \langle v_i,\psi _n\rangle \langle v_{i'},\psi _{n}\rangle , &{} j=j',\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

The result follows as the infinite sum is equal to the integral in Eq. (9). To see that \(\mathbf {C}\) is positive definite, let \(a\in \mathbb {R}^k\), then \(a^* \mathbf {C} a=\mathbb {E}[Z_a^2]\ge 0\), where \(Z_a\sim \mathcal {N}(0,\sum _{n=1}^\infty \lambda _n \langle a_1 v_1+\cdots +a_k v_k,\psi _n\rangle ^2)\). Moreover, \(a^* \mathbf {C} a = 0\) implies that \(a=0\) because \(v_1,\ldots ,v_k\) are orthonormal and \(\{\psi _n\}\) is an orthonormal basis of \(L^2(D_1)\). \(\square \)

Lemma 1 gives the distribution of the matrix \(\varvec{\Omega }_1\), which is essential to prove Theorem 1 in Sect. 3.6. In particular, \(\varvec{\Omega }_1\) has independent columns that are each distributed as a multivariate Gaussian with covariance matrix given in Eq. (9).

3.4 Quality of the Covariance Kernel

To investigate the quality of the kernel, we introduce the Wishart distribution, which is a family of probability distributions over symmetric and nonnegative-definite matrices that often appear in the context of covariance matrices [61]. If \(\varvec{\Omega }_1\) is a \(k\times \ell \) random matrix with independent columns, where each column is a multivariate Gaussian distribution with mean \((0,\ldots ,0)^\top \) and covariance \(\mathbf {C}\), then \(\mathbf {A}=\varvec{\Omega }_1 \varvec{\Omega }_1^*\) has a Wishart distribution [61]. We write \(\mathbf {A}\sim W_k(\ell ,\mathbf {C})\). We note that \(\Vert \varvec{\Omega }_1^{\dagger }\Vert _{F }^2={{\,\mathrm{Tr}\,}}[(\varvec{\Omega }_1^{\dagger })^*\varvec{\Omega }_1^{\dagger }]={{\,\mathrm{Tr}\,}}(\mathbf {A}^{-1})\), where the second equality holds with probability one because the matrix \(\mathbf {A}=\varvec{\Omega }_1\varvec{\Omega }_1^*\) is invertible with probability one (see [41, Theorem 3.1.4]). By [41, Theorem 3.2.12] for \(\ell -k\ge 2\), we have \(\mathbb {E}[\mathbf {A}^{-1}]=\frac{1}{\ell -k-1}\mathbf {C}^{-1}\), \(\mathbb {E}[{{\,\mathrm{Tr}\,}}(\mathbf {A}^{-1})]={{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})/(\ell -k-1)\), and conclude that

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Omega }_1^{\dagger }\Vert _{F }^2\right] =\frac{1}{\gamma _k\lambda _1}\frac{k}{\ell -k-1},\qquad \gamma _k :=\frac{k}{\lambda _1 {{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})}. \end{aligned}$$
(10)

The quantity \(\gamma _k\) can be viewed as measuring the quality of the covariance kernel K for learning the HS operator \(\mathscr {F}\) (see Theorem 1). First, \(1\le \gamma _k<\infty \) as \(\mathbf {C}\) is symmetric positive definite. Moreover, for \(1\le j\le k\), the jth largest eigenvalue of \(\mathbf {C}\) is bounded by the jth largest eigenvalue of K as \(\mathbf {C}\) is a principal submatrix of \(\mathbf {V}^*K\mathbf {V}\) [27, Sect. III.5]. Therefore, the following inequality holds,

$$\begin{aligned} \frac{1}{k}\sum _{j=1}^k\frac{\lambda _1}{\lambda _j}\le \frac{1}{\gamma _k}<\infty , \end{aligned}$$
(11)

and the harmonic mean of the first k scaled eigenvalues of K is a lower bound for \(1/\gamma _k\). In the ideal situation, the eigenfunctions of K are the right singular functions of \(\mathscr {F}\), i.e., \(\psi _n=v_n\), \(\mathbf {C}\) is a diagonal matrix with entries \(\lambda _1,\ldots ,\lambda _k\), and \(\gamma _k=k/(\sum _{j=1}^k \lambda _1/\lambda _j)\) is as small as possible.

We now provide a useful upper bound on \(\gamma _k\) in a more general setting.

Lemma 2

Let \(\mathbf {V}_1\) be a \(D_1\times k\) quasimatrix with orthonormal columns and assume that there exists \(m\in \mathbb {N}\) such that the columns of \(\mathbf {V}_1\) are spanned by the first \(k+m\) eigenvectors of the continuous positive definite kernel \(K:D_1\times D_1\rightarrow \mathbb {R}\). Then

$$\begin{aligned} \frac{1}{\gamma _k} \le \frac{1}{k}\sum _{j=m+1}^{k+m}\frac{\lambda _1}{\lambda _j}, \end{aligned}$$

where \(\lambda _1\ge \lambda _2\ge \cdots > 0\) are the eigenvalues of K. This bound is tight in the sense that the inequality can be attained as an equality.

Proof

Let \(\mathbf {Q} = \left[ v_1\,|\,\cdots \,|\,v_k\,|\,q_{k+1}\,|\cdots \,|\,q_{k+m}\right] \) be a quasimatrix with orthonormal columns whose columns form an orthonormal basis for \(\mathrm{Span}(\psi _1,\ldots ,\psi _{k+m})\). Then, \(\mathbf {Q}\) is an invariant space of K and \(\mathbf {C}\) is a principal submatrix of \(\mathbf {Q}^*K\mathbf {Q}\), which has eigenvalues \(\lambda _1\ge \cdots \ge \lambda _{k+m}\). By [27, Theorem 6.46] the k eigenvalues of \(\mathbf {C}\), denoted by \(\mu _1,\ldots ,\mu _{k}\), are greater than the first \(k+m\) eigenvalues of K: \(\mu _{j}\ge \lambda _{m+j}\) for \(1\le j\le k\), and the result follows as the trace of a matrix is the sum of its eigenvalues. \(\square \)

3.5 Probabilistic Error Bounds

As discussed in Sect. 3.1, we need to extend the probability bounds of the randomized SVD to allow for non-standard Gaussian random vectors. The following lemma is a generalization of [22, Theorem A.7].

Lemma 3

Let \(k,\ell \ge 1\) such that \(\ell -k\ge 4\) and \(\varvec{\Omega }_1\) be a \(k\times \ell \) random matrix with independent columns such that each column has mean \((0,\ldots ,0)^\top \) and positive definite covariance \(\mathbf {C}\). For all \(t\ge 1\), we have

$$\begin{aligned} \mathbb {P}\left\{ \Vert \varvec{\Omega }_1^\dagger \Vert _{F }^2>\frac{3{{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})}{\ell -k+1}\cdot t^2\right\} \le t^{-(\ell -k)}. \end{aligned}$$

Proof

Since \(\varvec{\Omega }_1\varvec{\Omega }_1^*\sim W_{k}(\ell ,\mathbf {C})\), the reciprocals of its diagonal elements follow a scaled chi-square distribution [41, Theorem 3.2.12], i.e.,

$$\begin{aligned} \frac{\left( (\varvec{\Omega }_1\varvec{\Omega }_1^*)^{-1}\right) _{jj}}{\left( \mathbf {C}^{-1}\right) _{jj}} \sim X_j^{-1},\qquad X_j\sim \chi _{\ell -k+1}^2,\qquad 1\le j\le k. \end{aligned}$$

Let \(Z=\Vert \varvec{\Omega }_1^\dagger \Vert _{F }^2={{\,\mathrm{Tr}\,}}[(\varvec{\Omega }_1\varvec{\Omega }_1^*)^{-1}]\) and \(q=(\ell -k)/2\). Following the proof of [22, Theorem A.7], we have the inequality

$$\begin{aligned} \mathbb {P}\left\{ |Z|\ge \frac{3{{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})}{\ell -k+1}\cdot t^2\right\} \le \left[ \frac{3{{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})}{\ell -k+1}\cdot t^2\right] ^{-q}\mathbb {E}\left[ |Z|^q\right] ,\quad t\ge 1. \end{aligned}$$

Moreover, by the Minkowski inequality, we have

$$\begin{aligned} \left( \mathbb {E}\left[ |Z^q|\right] \right) ^{1/q}=\left( \mathbb {E}\left[ \left| \sum _{j=1}^k[\mathbf {C}^{-1}]_{jj}X_j^{-1}\right| ^q\right] \right) ^{1/q}\!\!\le \sum _{j=1}^k [\mathbf {C}^{-1}]_{jj}\mathbb {E}\left[ |X_j^{-1}|^q\right] ^{1/q}\le \frac{3{{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})}{\ell -k+1}, \end{aligned}$$

where the last inequality is from [22, Lemma A.9]. The result follows from the argument in the proof of [22, Theorem A.7]. \(\square \)

Under the assumption of Lemma 2, we find that Lemma 3 gives the following bound:

$$\begin{aligned} \mathbb {P}\left\{ \Vert \varvec{\Omega }_1^\dagger \Vert _{F }>t\cdot \sqrt{\frac{3}{\ell -k+1}\sum _{j=m+1}^{k+m}\lambda _j^{-1}}\right\} \le t^{-(\ell -k)}. \end{aligned}$$

In particular, in the finite-dimensional case when \(\lambda _1=\cdots =\lambda _n=1\), we recover the probabilistic bound found in [22, Theorem A.7].

To obtain the probability statement found in Eq. (13), we require control of the tail of the distribution of a Gaussian quasimatrix with non-standard covariance kernel (see Sect. 3.6). In the theory of the randomized SVD, one relies on the concentration of measure results [22, Prop. 10.3]. However, we need to employ a different strategy and instead directly bound the HS norm of \(\varvec{\Omega }_2\). One difficulty is that the norm of this matrix must be controlled for large dimensions n, which leads to a weaker probability bound than [22]. While it is possible to apply Markov’s inequality to obtain deviation bounds, we highlight that Lemma 4 provides a Chernoff-type bound, i.e., exponential decay of the tail distribution of \(\Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}\), which is crucial to approximate Green’s functions (see Sect. 4.4.3).

Lemma 4

With the same notation as in Theorem 2, let \(\ell \ge k \ge 1\). For all \(s\ge 1\), we have

$$\begin{aligned} \mathbb {P}\left\{ \Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2>\ell s^2{{\,\mathrm{Tr}\,}}(K)\right\} \le \left[ s e^{-(s^2-1)/2}\right] ^{\ell }. \end{aligned}$$

Proof

We first remark that

$$\begin{aligned} \Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2 \le \Vert \varvec{\Omega }\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^\ell Z_j,\qquad Z_j :=\Vert \omega _j\Vert _{L^2(D_1)}^2, \end{aligned}$$
(12)

where the \(Z_j\) are independent and identically distributed (i.i.d) because \(\omega _j\sim \mathcal {GP}(0,K)\) are i.i.d. For \(1\le j\le \ell \), we have (c.f. Sect. 2.3),

$$\begin{aligned} \omega _j = \sum _{m=1}^\infty c_m^{(j)}\sqrt{\lambda _m}\psi _m, \end{aligned}$$

where \(c_m^{(j)}\sim \mathcal {N}(0,1)\) are i.i.d for \(m\ge 1\) and \(1\le j\le \ell \). First, since the series in Eq. (12) converges absolutely, we have

$$\begin{aligned} Z_j = \sum _{m=1}^\infty (c_m^{(j)})^2\lambda _m= \lim _{N\rightarrow \infty }\sum _{m=1}^N X_m, \qquad X_m = (c_m^{(j)})^2\lambda _m, \end{aligned}$$

where the \(X_m\) are independent random variables and \(X_m \sim \lambda _m\chi ^2\) for \(1\le m\le N\). Here, \(\chi ^2\) denotes the chi-squared distribution [40, Chapt. 4.3].

Let \(N\ge 1\) and \(0<\theta <1/(2{{\,\mathrm{Tr}\,}}(K))\), we can bound the moment generating function of \(\sum _{m=1}^N X_m\) as

$$\begin{aligned} \mathbb {E}\left[ e^{\theta \sum _{m=1}^N X_m}\right]&= \prod _{m=1}^N\mathbb {E}\left[ e^{\theta X_m}\right] = \prod _{m=1}^N (1-2\theta \lambda _m)^{-1/2}\le \left( 1-2\theta \sum _{m=1}^N \lambda _m\right) ^{-1/2}\\&\le \left( 1-2\theta {{\,\mathrm{Tr}\,}}(K)\right) ^{-1/2}, \end{aligned}$$

because \(X_m/\lambda _m\) are independent random variables that follow a chi-squared distribution. Using the monotone convergence theorem, we have

$$\begin{aligned} \mathbb {E}\left[ e^{\theta Z_j}\right] \le (1-2\theta {{\,\mathrm{Tr}\,}}(K))^{-1/2}. \end{aligned}$$

Let \({\tilde{s}}\ge 0\) and \(0<\theta <1/(2{{\,\mathrm{Tr}\,}}(K))\), by the Chernoff bound [10, Theorem 1], we obtain

$$\begin{aligned} \mathbb {P} \left\{ \Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2 > \ell (1+{\tilde{s}}){{\,\mathrm{Tr}\,}}(K)\right\}&\le e^{-(1+{\tilde{s}}){{\,\mathrm{Tr}\,}}(K)\ell \theta }\mathbb {E}\left[ e^{\theta Z_j}\right] ^\ell \\&=e^{-(1+{\tilde{s}}){{\,\mathrm{Tr}\,}}(K)\ell \theta }(1-2\theta {{\,\mathrm{Tr}\,}}(K))^{-\ell /2}. \end{aligned}$$

We can minimize this upper bound over \(0<\theta <1/(2{{\,\mathrm{Tr}\,}}(K))\) by choosing \(\theta = {\tilde{s}}/(2(1+{\tilde{s}}){{\,\mathrm{Tr}\,}}(K))\), which gives

$$\begin{aligned}\mathbb {P} \left\{ \Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2 > \ell (1+{\tilde{s}}){{\,\mathrm{Tr}\,}}(K)\right\} \le (1+{\tilde{s}})^{\ell /2} e^{-\ell {\tilde{s}}/2}.\end{aligned}$$

Choosing \(s=\sqrt{1+{\tilde{s}}}\ge 1\) concludes the proof. \(\square \)

Lemma 4 can be refined further to take into account the interaction between the Hilbert–Schmidt operator \(\mathscr {F}\) and the covariance kernel K (see [8, Lemma 7]).

3.6 Randomized SVD Algorithm for HS Operators

We first prove an intermediary result, which generalizes [22, Prop. 10.1] to HS operators. Note that one may obtain sharper bounds using a suitably chosen covariance kernels that yields a lower approximation error [8].

Lemma 5

Let \(\varvec{\Sigma }_2\), \(\mathbf {V}_2\), and \(\varvec{\Omega }\) be defined as in Theorem 2, and \(\mathbf {T}\) be an \(\ell \times k\) matrix, where \(\ell \ge k\ge 1\). Then,

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \mathbf {V}_2^* \varvec{\Omega } \mathbf {T}\Vert _{{{\,\mathrm{HS}\,}}}^2\right] \le \lambda _1\Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2\Vert \mathbf {T}\Vert _{F }^2, \end{aligned}$$

where \(\lambda _1\) is the first eigenvalue of K.

Proof

Let \(\mathbf {T} = \mathbf {U}_\mathbf {T} \mathbf {D}_\mathbf {T} \mathbf {V}_\mathbf {T}^*\) be the SVD of \(\mathbf {T}\). If \(\{v_{\mathbf {T},i}\}_{i=1}^k\) are the columns of \(\mathbf {V}_\mathbf {T}\), then

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \mathbf {V}_2^* \varvec{\Omega }\mathbf {T}\Vert _{{{\,\mathrm{HS}\,}}}^2\right] = \sum _{i=1}^k\mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \varvec{\Omega }_2\mathbf {U}_{\mathbf {T}} \mathbf {D}_\mathbf {T} \mathbf {V}_\mathbf {T}^* v_{\mathbf {T},i}\Vert _2^2\right] , \end{aligned}$$

where \(\varvec{\Omega }_2=\mathbf {V}_2^*\varvec{\Omega }\). Therefore, we have

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \varvec{\Omega }_2\mathbf {T}\Vert _{{{\,\mathrm{HS}\,}}}^2\right] = \sum _{i=1}^k ((\mathbf {D}_\mathbf {T})_{ii})^2 \mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \varvec{\Omega }_2\mathbf {U}_{\mathbf {T}}(:,i)\Vert _{2}^2\right] . \end{aligned}$$

Moreover, using the monotone convergence theorem for non-negative random variables, we have

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \varvec{\Omega }_2\mathbf {U}_{\mathbf {T}}(:,i)\Vert _{2}^2\right]&= \mathbb {E}\left[ \sum _{n=1}^\infty \sum _{j=1}^{\ell }\sigma _{k+n}^2 \left| \varvec{\Omega }_2(n,j)\right| ^2\mathbf {U}_{\mathbf {T}}(j,i)^2\right] \\&= \sum _{n=1}^\infty \sum _{j=1}^{\ell }\sigma _{k+n}^2 \mathbf {U}_{\mathbf {T}}(j,i)^2\mathbb {E}\left[ \left| \varvec{\Omega }_2(n,j)\right| ^2\right] , \end{aligned}$$

where \(\sigma _{k+1},\sigma _{k+2},\ldots \) are the diagonal elements of \(\varvec{\Sigma }_2\). Then, the quasimatrix \(\varvec{\Omega }_2\) has independent columns and, using Lemma 1, we have

$$\begin{aligned} \mathbb {E}\left[ |\varvec{\Omega }_2(n,j)|^2\right] = \int _{D_1\times D_1}v_{k+n}(x)K(x,y)v_{k+n}(y)d xd y, \end{aligned}$$

where \(v_{k+n}\) is the nth column of \(\mathbf {V}_2\). Then, \(\mathbb {E}\left[ |\varvec{\Omega }_2(n,j)|^2\right] \le \lambda _1\), as \(\mathbb {E}\left[ |\varvec{\Omega }_2(n,j)|^2\right] \) is written as a Rayleigh quotient. Finally, we have

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Sigma }_2 \mathbf {V}_2^* \varvec{\Omega } \mathbf {T}\Vert _{{{\,\mathrm{HS}\,}}}^2\right] \le \lambda _1\sum _{i=1}^k ((\mathbf {D}_{\mathbf {T}})_{ii})^2\sum _{j=1}^{\ell }\mathbf {U}_{\mathbf {T}}(j,i)^2\sum _{n=1}^\infty \sigma _{k+n}^2 = \lambda _1 \Vert \mathbf {T}\Vert _{F }^2\Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2, \end{aligned}$$

by orthonormality of the columns on \(\mathbf {U}_{\mathbf {T}}\). \(\square \)

We are now ready to prove Theorem 1, which shows that the randomized SVD can be generalized to HS operators.

Proof of Theorem 1

Let \(\varvec{\Omega }_1, \varvec{\Omega }_2\) be the quasimatrices defined in Theorem 2. The \(k\times (k+p)\) matrix \(\varvec{\Omega }_1\) has full rank with probability one and by Theorem 2, we have

$$\begin{aligned} \mathbb {E}\left[ \Vert (\mathbf {I}-\mathbf {P}_\mathbf {Y})\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\right]&\le \mathbb {E}\left[ \left( \Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2+ \Vert \varvec{\Sigma }_2\varvec{\Omega }_2\varvec{\Omega }_1^{\dagger }\Vert _{{{\,\mathrm{HS}\,}}}^2\right) ^{1/2}\right] \le \Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}} + \mathbb {E}\Vert \varvec{\Sigma }_2\varvec{\Omega }_2\varvec{\Omega }_1^{\dagger }\Vert _{{{\,\mathrm{HS}\,}}}\\&\le \Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}} + \mathbb {E}\left[ \Vert \varvec{\Sigma }_2\varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2\right] ^{1/2}\mathbb {E}\left[ \Vert \varvec{\Omega }_1^\dagger \Vert _{F }^2\right] ^{1/2}, \end{aligned}$$

where the last inequality follows from Cauchy–Schwarz inequality. Then, using Lemma 5 and Eq. (10), we have

$$\begin{aligned} \mathbb {E}\left[ \Vert \varvec{\Sigma }_2\varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2\right] \le \lambda _1(k+p)\Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2,\qquad \text {and} \qquad \mathbb {E}\left[ \Vert \varvec{\Omega }_1\Vert ^2_{F }\right] \le \frac{1}{\gamma _k\lambda _1}\frac{k}{p-1}. \end{aligned}$$

where \(\gamma _k\) is defined in Sect. 3.4. The observation that \(\Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2=\sum _{j=k+1}^\infty \sigma _j^2\) concludes the proof of Eq. (6).

For the probabilistic bound in Eq. (7), we note that by Theorem 2 we have,

$$\begin{aligned} \Vert \mathscr {F}-\mathbf {P}_\mathbf {Y}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}^2\le \Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2+\Vert \varvec{\Sigma }_2\varvec{\Omega }_2\varvec{\Omega }_1^{\dagger }\Vert _{{{\,\mathrm{HS}\,}}}^2\le (1+\Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2\Vert \varvec{\Omega }_1^{\dagger }\Vert _{F }^2)\Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2, \end{aligned}$$

where the second inequality uses the submultiplicativity of the HS norm. The bound follows from bounding \(\Vert \varvec{\Omega }_1^\dagger \Vert _{F }^2\) and \(\Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2\) using Lemma 3 and 4, respectively. \(\square \)

4 Recovering the Green’s Function from Input–Output Pairs

It is known that the Green’s function associated with Eq. (2) always exists, is unique, is a nonnegative function \(G : D \times D \rightarrow \mathbb {R}^+ \cup \{\infty \}\) such that

$$\begin{aligned} u(x) =\int _D G(x,y)f(y)d y, \qquad f \in \mathcal {C}_c^\infty (D), \end{aligned}$$

and for each \(y\in \Omega \) and any \(r>0\), we have \(G(\cdot , y) \in \mathcal {H}^1(D \setminus B_r(y))\cap \mathcal {W}_0^{1,1}(D)\) [19].Footnote 3 Since the PDE in Eq. (2) is self-adjoint, we also know that for almost every \(x,y\in D\), we have \(G(x,y) = G(y,x)\) [19].

We now state Theorem 3, which shows that if \(N = \mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) and one has N input–output pairs \(\{(f_j,u_j)\}_{j=1}^N\) with algorithmically selected \(f_j\), then the Green’s function associated with \(\mathcal {L}\) in Eq. (2) can be recovered to within an accuracy of \(\mathcal {O}(\varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon )\) with high probability. Here, the quantity \(0<\varGamma _\epsilon \le 1\) measures the quality of the random input functions \(\{f_j\}_{j=1}^N\) (see Sect. 4.4.2).

Theorem 3

Let \(0<\epsilon <1\), \(D\subset \mathbb {R}^3\) be a bounded Lipschitz domain, and \(\mathcal {L}\) given in Eq. (2). If G is the Green’s function associated with \(\mathcal {L}\), then there is a randomized algorithm that constructs an approximation \({\tilde{G}}\) of G using \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output pairs such that, as \(\epsilon \rightarrow 0\), we have

$$\begin{aligned} \Vert G-{\tilde{G}}\Vert _{L^2(D\times D)} = \mathcal {O} \left( \varGamma _\epsilon ^{-1/2}\log ^{3}(1/\epsilon )\epsilon \right) \Vert G\Vert _{L^2(D\times D)}, \end{aligned}$$
(13)

with probability \(\ge 1 - \mathcal {O}(\epsilon ^{\log (1/\epsilon )-6})\). The term \(\varGamma _\epsilon \) is defined by Eq. (25).

Our algorithm that leads to the proof of Theorem 3 relies on the extension of the randomized SVD to HS operator (see Sect. 3) and a hierarchical partition of the domain of G into “well-separated” domains.

4.1 Recovering the Green’s Function on Admissible Domains

Roughly speaking, as \(\Vert x-y\Vert _2\) increases G becomes smoother about (xy), which can be made precise using so-called admissible domains [1, 2, 21]. Let \({{\,\mathrm{diam}\,}}X:=\sup _{x,y\in X}\Vert x-y\Vert _2\) be the diameter of X, \({{\,\mathrm{dist}\,}}(X,Y):=\inf _{x\in X,y\in Y}\Vert x-y\Vert _2\) be the shortest distance between X and Y, and \(\rho >0\) be a fixed constant. If \(X,Y\subset \mathbb {R}^3\) are bounded domains, then we say that \(X\times Y\) is an admissible domain if \({{\,\mathrm{dist}\,}}(X,Y)\ge \rho \max \{{{\,\mathrm{diam}\,}}X, {{\,\mathrm{diam}\,}}Y\}\); otherwise, we say that \(X\times Y\) is non-admissible. There is a weaker definition of admissible domains as \({{\,\mathrm{dist}\,}}(X,Y)\ge \rho \min \{{{\,\mathrm{diam}\,}}X, {{\,\mathrm{diam}\,}}Y\}\) [21, p. 59], but we do not consider it.

4.1.1 Approximation Theory on Admissible Domains

It turns out that the Green’s function associated with Eq. (2) has rapidly decaying singular values when restricted to admissible domains. Roughly speaking, if \(X,Y\subset D\) are such that \(X\times Y\) is an admissible domain, then G is well-approximated by a function of the form [3]

$$\begin{aligned} G_k(x,y) = \sum _{j=1}^k g_j(x)h_j(y), \qquad (x,y)\in X\times Y, \end{aligned}$$
(14)

for some functions \(g_1,\ldots ,g_k\in L^2(X)\) and \(h_1,\ldots ,h_k\in L^2(Y)\). This is summarized in Theorem 4, which is a corollary of [3, Theorem 2.8].

Theorem 4

Let G be the Green’s function associated with Eq. (2) and \(\rho >0\). Let \(X,Y\subset D\) such that \({{\,\mathrm{dist}\,}}(X,Y)\ge \rho \max \{{{\,\mathrm{diam}\,}}X, {{\,\mathrm{diam}\,}}Y\}\). Then, for any \(0<\epsilon <1\), there exists \(k\le k_\epsilon :=\lceil c(\rho ,{{\,\mathrm{diam}\,}}D,\kappa _C)\rceil \lceil \log (1/\epsilon )\rceil ^{4}+\lceil \log (1/\epsilon )\rceil \) and an approximant, \(G_k\), of G in the form given in Eq. (14) such that

$$\begin{aligned} \Vert G-G_k\Vert _{L^2(X\times Y)}\le \epsilon \Vert G\Vert _{L^2(X\times {\hat{Y}})}, \qquad {\hat{Y}}:=\{y\in D,\, {{\,\mathrm{dist}\,}}(y,Y)\le \frac{\rho }{2}{{\,\mathrm{diam}\,}}Y\}, \end{aligned}$$

where \(\kappa _C = \lambda _{\max }/\lambda _{\min }\) is the spectral condition number of the coefficient matrix A(x) in Eq. (2)Footnote 4 and c is a constant that only depends on \(\rho \), \({{\,\mathrm{diam}\,}}D\), \(\kappa _C\).

Proof

In [3, Theorem 2.8], it is shown that if \(Y={\tilde{Y}}\cap D\) and \({\tilde{Y}}\) is convex, then there exists \(k\le c_{\rho /2}^3\lceil \log (1/\epsilon )\rceil ^{4}+\lceil \log (1/\epsilon )\rceil \) and an approximant, \(G_k\), of G such that

$$\begin{aligned} \Vert G(x,\cdot )-G_k(x,\cdot )\Vert _{L^2(Y)}\le \epsilon \Vert G(x,\cdot )\Vert _{L^2({\hat{Y}})}, \qquad x\in X, \end{aligned}$$
(15)

where \({\hat{Y}}:=\{y\in D,\, {{\,\mathrm{dist}\,}}(y,Y)\le \frac{\rho }{2}{{\,\mathrm{diam}\,}}Y\}\) and \(c_{\rho /2}\) is a constant that only depends on \(\rho \), \({{\,\mathrm{diam}\,}}Y\), and \(\kappa _C\). As remarked by [3], \({\tilde{Y}}\) can be included in a convex of diameter \({{\,\mathrm{diam}\,}}D\) that includes D to obtain the constant \(c(\rho ,{{\,\mathrm{diam}\,}}D,\kappa _C)\). The statement follows by integrating the error bound in Eq. (15) over X. \(\square \)

Since the truncated SVD of G on \(X\times Y\) gives the best rank \(k_\epsilon \ge k\) approximation to G, Theorem 4 also gives bounds on singular values:

$$\begin{aligned} \left( \sum \nolimits _{j=k_\epsilon +1}^\infty \sigma _{j,X\times Y}^2\right) ^{1/2} \le \Vert G-G_{k}\Vert _{L^2(X\times Y)} \le \epsilon \Vert G\Vert _{L^2(X\times {\hat{Y}})}, \end{aligned}$$
(16)

where \(\sigma _{j,X\times Y}\) is the jth singular value of G restricted to \(X\times Y\). Since \(k_\epsilon = \mathcal {O}(\log ^4(1/\epsilon ))\), we conclude that the singular values of G restricted to admissible domains \(X\times Y\) rapidly decay to zero.

4.1.2 Randomized SVD for Admissible Domains

Since G has rapidly decaying singular values on admissible domains \(X\times Y\), we use the randomized SVD for HS operators to learn G on \(X\times Y\) with high probability (see Sect. 3).

We start by defining a GP on the domain Y. Let \(\mathcal {R}_{Y\times Y}K\) be the restrictionFootnote 5 of the covariance kernel K to the domain \(Y\times Y\), which is a continuous symmetric positive definite kernel so that \(\mathcal {GP}(0,\mathcal {R}_{Y\times Y}K)\) defines a GP on Y. We choose a target rank \(k\ge 1\), an oversampling parameter \(p\ge 2\), and form a quasimatrix \(\varvec{\Omega } = \begin{bmatrix}f_1\,|\,\cdots \,|\, f_{k+p}\end{bmatrix}\) such that \(f_j\in L^2(Y)\) and \(f_j\sim \mathcal {GP}(0,\mathcal {R}_{Y\times Y} K)\) are identically distributed and independent. We then extend by zero each column of \(\varvec{\Omega }\) from \(L^2(Y)\) to \(L^2(D)\) by \(\mathcal {R}_Y^*\varvec{\Omega }=\begin{bmatrix}\mathcal {R}_Y^* f_1\,|\,\cdots \,|\, \mathcal {R}_Y^* f_{k+p}\end{bmatrix}\), where \(\mathcal {R}_Y^* f_j\sim \mathcal {GP}(0,\mathcal {R}_{Y\times Y}^*\mathcal {R}_{Y\times Y}K)\). The zero extension operator \(\mathcal {R}_Y^*:L^2(Y)\rightarrow L^2(D)\) is the adjoint of \(\mathcal {R}_Y:L^2(D)\rightarrow L^2(Y)\).

Given the training data, \(\mathbf {Y} = \begin{bmatrix}u_1\,|\,\cdots \,|\, u_{k+p} \end{bmatrix}\) such that \(\mathcal {L}u_j = \mathcal {R}_Y^* f_j\) and \(u_j|_{\partial D} = 0\), we now construct an approximation to G on \(X\times Y\) using the randomized SVD (see Sect. 3). Following Theorem 1, we have the following approximation error for \(t\ge 1\) and \(s\ge 2\):

$$\begin{aligned} \Vert G-{\tilde{G}}_{X\times Y}\Vert _{L^2(X\times Y)}^2 \le \left( 1+t^2s^2\frac{3}{\gamma _{k,X\times Y}}\frac{k(k+p)}{p+1}\sum _{j=1}^\infty \frac{\lambda _j}{\lambda _1}\,\right) \left( \sum \nolimits _{j=k+1}^\infty \sigma _{j,X\times Y}^2\right) ^{1/2},\nonumber \\ \end{aligned}$$
(17)

with probability greater than \(1-t^{-p}-e^{-s^2(k+p)}\). Here, \(\lambda _1\ge \lambda _2\ge \cdots >0\) are the eigenvalues of K, \({\tilde{G}}_{X\times Y} = \mathbf {P}_{\mathcal {R}_{X}\mathbf {Y}}\mathcal {R}_{X}\mathscr {F}\mathcal {R}_{Y}^*\) and \(\mathbf {P}_{\mathcal {R}_{X}\mathbf {Y}} = \mathcal {R}_{X}\mathbf {Y}((\mathcal {R}_{X}\mathbf {Y})^*\mathcal {R}_{X}\mathbf {Y})^{\dagger }(\mathcal {R}_{X}\mathbf {Y})^*\) is the orthogonal projection onto the space spanned by the columns of \(\mathcal {R}_{X}\mathbf {Y}\). Moreover, \(\gamma _{k,X\times Y}\) is a measure of the quality of the covariance kernel of \(\mathcal {GP}(0,\mathcal {R}_{Y\times Y}^*\mathcal {R}_{Y\times Y}K)\) (see Sect. 3.4) and, for \(1\le i,j\le k\), defined as \( \gamma _{k,X\times Y} = k/(\lambda _1{{\,\mathrm{Tr}\,}}(\mathbf {C}_{X\times Y}^{-1}))\), where

$$\begin{aligned}{}[\mathbf {C}_{X\times Y}]_{ij} = \int _{D\times D} \mathcal {R}_Y^*v_{i,X\times Y}(x)K(x,y) \mathcal {R}_Y^*v_{j,X\times Y}(y)d xd y, \end{aligned}$$

and \(v_{1,X\times Y},\ldots , v_{k,X\times Y}\in L^2(Y)\) are the first k right singular functions of G restricted to \(X\times Y\).

Unfortunately, there is a big problem with the formula \({\tilde{G}}_{X\times Y} = \mathbf {P}_{\mathcal {R}_X\mathbf {Y}}\mathcal {R}_{X}\mathscr {F}\mathcal {R}_{Y}^*\). It cannot be formed because we only have access to input–output data, so we have no mechanism for composing \(\mathbf {P}_{\mathcal {R}_X\mathbf {Y}}\) on the left of \(\mathcal {R}_{X}\mathscr {F}\mathcal {R}_{Y}^*\). Instead, we note that since the partial differential operator in Eq. (2) is self-adjoint, \(\mathscr {F}\) is self-adjoint, and G is itself symmetric. That means we can use this to write down a formula for \({\tilde{G}}_{Y\times X}\) instead. That is,

$$\begin{aligned} {\tilde{G}}_{Y\times X} = {\tilde{G}}_{X\times Y}^* = \mathcal {R}_{Y}\mathscr {F}\mathcal {R}_{X}^*\mathbf {P}_{\mathcal {R}_X\mathbf {Y}}, \end{aligned}$$

where we used the fact that \(\mathbf {P}_{\mathcal {R}_X\mathbf {Y}}\) is also self-adjoint. This means we can construct \({\tilde{G}}_{Y\times X}\) by asking for more input–output data to assess the quasimatrix \(\mathscr {F}(\mathcal {R}_X^*\mathcal {R}_X\mathbf {Y})\). Of course, to compute \({\tilde{G}}_{X\times Y}\), we can swap the roles of X and Y in the above argument.

With a target rank of \(k=k_\epsilon = \lceil c(\rho ,{{\,\mathrm{diam}\,}}D,\kappa _C)\rceil \lceil \log (1/\epsilon )\rceil ^{4}+\lceil \log (1/\epsilon )\rceil \) and an oversampling parameter of \(p = k_\epsilon \), we can combine Theorem 4 and Eq. (16) and (17) to obtain the bound

$$\begin{aligned} \Vert G-{\tilde{G}}_{X\times Y}\Vert _{L^2(X\times Y)}^2 \le \left( 1+t^2s^2\frac{6k_\epsilon }{\gamma _{k_{\epsilon },X\times Y}}\sum _{j=1}^\infty \frac{\lambda _j}{\lambda _1}\,\right) \epsilon ^2\Vert G\Vert _{L^2(X\times {\hat{Y}})}^2 , \end{aligned}$$

with probability greater than \(1-t^{-k_{\epsilon }}-e^{-2s^2 k_\epsilon }\). A similar approximation error holds for \({\tilde{G}}_{Y\times X}\) without additional evaluations of \(\mathscr {F}\). We conclude that our algorithm requires \(N_{\epsilon , X\times Y} \!= 2(k_\epsilon +p) = \mathcal {O}\!\left( \log ^4(1/\epsilon )\right) \) input–output pairs to learn an approximant to G on \(X\times Y\) and \(Y\times X\).

4.2 Ignoring the Green’s Function on Non-Admissible Domains

When the Green’s function is restricted to non-admissible domains, its singular values may not decay. Instead, to learn G we take advantage of the off-diagonal decay property of G. It is known that for almost every \(x\ne y\in D\) then

$$\begin{aligned} G(x,y)\le \frac{c_{\kappa _C}}{\Vert x-y\Vert _2}\Vert G\Vert _{L^2(D\times D)}, \end{aligned}$$
(18)

where \(c_{\kappa _C}\) is an implicit constant that only depends on \(\kappa _C\) (see [19, Theorem 1.1]).Footnote 6

If \(X\times Y\) is a non-admissible domain, then for any \((x,y)\in X\times Y\), we find that

$$\begin{aligned} \Vert x-y\Vert _2\le {{\,\mathrm{dist}\,}}(X,Y)+{{\,\mathrm{diam}\,}}(X)+{{\,\mathrm{diam}\,}}(Y)< (2+\rho )\max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}, \end{aligned}$$

because \({{\,\mathrm{dist}\,}}(X,Y)<\rho \max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}\). This means that \(x\in B_r(y)\cap D\), where \(r=(2+\rho )\max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}\). Using Eq. (18), we have

$$\begin{aligned} \int _X G(x,y)^2 dx&\le \int _{B_r(y)\cap D}G(x,y)^2d x \le c_{\kappa _C}^2\Vert G\Vert _{L^2(D\times D)}^2 \int _{B_r(y)}\Vert x-y\Vert _2^{-2}d x\\&\le 4\pi c_{\kappa _C}^2 r\Vert G\Vert _{L^2(D\times D)}^2. \end{aligned}$$

Noting that \({{\,\mathrm{diam}\,}}(Y)\le r/(2+\rho )\) and \(\int _Y 1 d y\le 4\pi (\mathrm{diam}(Y)/2)^3/3\), we have the following inequality for non-admissible domains \(X\times Y\):

$$\begin{aligned} \Vert G\Vert _{L^2(X\times Y)}^2\le \frac{2\pi ^2}{3(2+\rho )^3} c_{\kappa _C}^2 r^4 \Vert G\Vert _{L^2(D\times D)}^2, \end{aligned}$$
(19)

where \(r=(2+\rho )\max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}\). We conclude that the Green’s function restricted to a non-admissible domain has a relatively small norm when the domain itself is small. Therefore, in our approximant \({\tilde{G}}\) for G, we ignore G on non-admissible domains by setting \({\tilde{G}}\) to be zero.

4.3 Hierarchical Admissible Partition of Domain

We now describe a hierarchical partitioning of \(D\times D\) so that many subdomains are admissible domains, and the non-admissible domains are all small. For ease of notion, we may assume—without loss of generality—that \({{\,\mathrm{diam}\,}}D = 1\) and \(D\subset [0,1]^3\); otherwise, one should shift and scale D. Moreover, partitioning \([0,1]^3\) and restricting the partition to D is easier than partitioning D directly (Fig. 2). For the definition of admissible domains, we find it convenient to select \(\rho = 1/\sqrt{3}\).

Fig. 2
figure 2

Two levels of hierarchical partitioning of \([0,1]^3\). The blue and green domains are admissible, while the blue and red domains are non-admissible (Color figure online)

Let \(I = [0,1]^3\). The hierarchical partitioning for n levels is defined recursively as:

  • \(I_{1\times 1\times 1}:=I_1\times I_1\times I_1=[0,1]^3\) is the root for level \(L = 0\).

  • At a given level \(0\le L\le n-1\), if \(I_{j_1\times j_2\times j_3}:=I_{j_1}\times I_{j_2}\times I_{j_3}\) is a node of the tree, then it has 8 children defined as

    $$\begin{aligned} \{I_{2 j_1+n_j(1)}\times I_{2 j_2+n_j(2)} \times I_{2j_3+n_j(3)}\mid n_j\in \{0,1\}^3\}. \end{aligned}$$

    Here, if \(I_j=[a,b]\), \(0\le a<b\le 1\), then \(I_{2j} = \left[ a,\frac{a+b}{2}\right] \) and \(I_{2j+1} = \left[ \frac{a+b}{2},b\right] \).

The set of non-admissible domains can be given by this unwieldy expression

$$\begin{aligned} P_{\text {non-adm}} = \bigcup _{\begin{array}{c} \bigwedge _{i=1}^3|j_i-{\tilde{j}}_i|\le 1\\ 2^n\le j_1,j_2,j_3\le 2^{n+1}-1\\ 2^n\le {\tilde{j}}_1,{\tilde{j}}_2,{\tilde{j}}_3\le 2^{n+1}-1 \end{array}} I_{j_1\times j_2\times j_3}\times I_{{\tilde{j}}_1\times {\tilde{j}}_2\times {\tilde{j}}_3}, \end{aligned}$$
(20)

where \(\wedge \) is the logical “and” operator. The set of admissible domains is given by

$$\begin{aligned} P_{\text {adm}} = \bigcup _{L=1}^n \Lambda (P_{\text {non-adm}}(L-1))\backslash P_{\text {non-adm}}(L)), \end{aligned}$$
(21)

where \(P_{\text {non-adm}}(L)\) is the set of non-admissible domain for a hierarchical level of L and

Using Eqs. (20)–(21), the number of admissible and non-admissible domains are precisely \(|P_{\text {non-adm}}| = (3\times 2^n-2)^3\) and \(|P_{\text {adm}}| = \sum _{\ell =1}^n 2^{6}(3\times 2^{L-1}-2)^3-(3\times 2^{L}-2)^3\). In particular, the size of the partition at the hierarchical level \(0\le L\le n\) is equal to \(8^L\) and the tree has a total of \((8^{n+1}-1)/7\) nodes (see Fig. 3).

Finally, the hierarchical partition of \(D\times D\) can be defined via the partition \(P=P_{\text {adm}}\cup P_{\text {non-adm}}\) of \([0,1]^3\) by doing the following:

$$\begin{aligned} D\times D =\bigcup \limits _{\tau \times \sigma \in P} (\tau \cap D)\times (\sigma \cap D). \end{aligned}$$

The sets of admissible and non-admissible domains of \(D\times D\) are denoted by \(P_{\text {adm}}\) and \(P_{\text {non-adm}}\) in the next sections.

Fig. 3
figure 3

For illustration purposes, we include the hierarchical structure of the Green’s functions in 1D after 4 levels (left) and in 3D after 2 levels (right). The hierarchical structure in 3D is complicated as this is physically a 6-dimensional tensor that has been rearranged so it can be visualized (Color figure online)

4.4 Recovering the Green’s Function on the Entire Domain

We now show that we can recover G on the entire domain \(D\times D\).

4.4.1 Global Approximation on the Non-Admissible Set

Let \(n_\epsilon \) be the number of levels in the hierarchical partition \(D\times D\) (see Sect. 4.3). We want to make sure that the norm of the Green’s function on all non-admissible domains is small so that we can safely ignore that part of G (see Sect. 4.2). As one increases the hierarchical partitioning levels, the volume of the non-admissible domains get smaller (see Fig. 4).

Fig. 4
figure 4

For illustration purposes, we include the hierarchical structure of the Green function in 1D. The green blocks are admissible domains at that level, the gray blocks are admissible at a higher level, and the red blocks are the non-admissible domains at that level. The area of the non-admissible domains decreases at deeper levels (Color figure online)

Let \(X\times Y\in P_{\text {non-adm}}\) be a non-admissible domain, the two domains X and Y have diameter bounded by \(\sqrt{3}/2^{n_\epsilon }\) because they are included in cubes of side length \(1/2^{n_\epsilon }\) (see Sect. 4.3). Combining this with Eq. (19) yields

$$\begin{aligned} \Vert G\Vert _{L^2(X\times Y)}^2\le 2\pi ^2(6+\sqrt{3})c_{\kappa _C}^2 2^{-4n_\epsilon }\Vert G\Vert _{L^2(D\times D)}^2. \end{aligned}$$

Therefore, the \(L^2\)-norm of G on the non-admissible domain \(P_{\text {non-adm}}\) satisfies

$$\begin{aligned} \Vert G\Vert _{L^2(P_{\text {non-adm}})}^2 = \sum _{X\times Y\in P_{\text {non-adm}}}\Vert G\Vert _{L^2(X\times Y)}^2\le 54\pi ^2(6+\sqrt{3})c_{\kappa _C}^2 2^{-n_\epsilon }\Vert G\Vert _{L^2(D\times D)}^2, \end{aligned}$$

where we used \(|P_{\text {non-adm}}| = (3\times 2^{n_\epsilon }-2)^3\le 27(2^{3n_\epsilon })\). This means that if we select \(n_\epsilon \) to be

$$\begin{aligned} n_\epsilon = \left\lceil \log _2(54\pi ^2(6+\sqrt{3})c_{\kappa _C}^2)+2\log _2(1/\epsilon )\right\rceil \sim 2\log _2(1/\epsilon ), \end{aligned}$$
(22)

then we guarantee that \(\Vert G\Vert _{L^2(P_{\text {non-adm}})}\le \epsilon \Vert G\Vert _{L^2(D\times D)}\). We can safely ignore G on non-admissible domains—by taking the zero approximant—while approximating G to within \(\epsilon \).

4.4.2 Learning Rate of the Green’s Function

Following Sect. 4.1.2, we can construct an approximant \({\tilde{G}}_{X\times Y}\) to the Green’s function on an admissible domain \(X\times Y\) of the hierarchical partitioning using the HS randomized SVD algorithm, which requires \(N_{\epsilon ,X\times Y}=\smash {\mathcal {O}(\log ^4(1/\epsilon ))}\) input–output training pairs (see Sect. 4.1.2). Therefore, the number of training input–output pairs needed to construct an approximant to G on all admissible domains is given by

$$\begin{aligned} N_\epsilon = \sum _{X\times Y\in P_{\text {adm}}} N_{\epsilon ,X\times Y} = \mathcal {O}\left( |P_{\text {adm}}|\log ^4(1/\epsilon )\right) , \end{aligned}$$

where \(|P_{\text {adm}}|\) denotes the total number of admissible domains at the hierarchical level \(n_\epsilon \), which is given by Eq. (22). Then, we have (see Sect. 4.3):

$$\begin{aligned} |P_{\text {adm}}| = \sum _{\ell =1}^{n_\epsilon } 2^{6}(3\times 2^{\ell -1}-2)^3-(3\times 2^{\ell }-2)^3 \le 6^3 2^{3n_\epsilon }, \end{aligned}$$
(23)

and, using Eq. (22), we obtain \(|P_{\text {adm}}| = \mathcal {O}(1/\epsilon ^6)\). This means that the total number of required input–output training pairs to learn G with high probability is bounded by

$$\begin{aligned} N_\epsilon = \mathcal {O}\left( \epsilon ^{-6}\log ^4(1/\epsilon )\right) . \end{aligned}$$

4.4.3 Global Approximation Error

We know that with \(N_\epsilon = \mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output training pairs, we can construct an accurate approximant to G on each admissible and non-admissible domain. Since the number of admissible and non-admissible domains depends on \(\epsilon \), we now check that this implies a globally accurate approximant that we denote by \({\tilde{G}}\).

Since \({\tilde{G}}\) is zero on non-admissible domains and \(P_{\text {adm}}\cap P_{\text {non-adm}}\) has measure zero, we have

$$\begin{aligned} \Vert G-{\tilde{G}}\Vert _{L^2(D\times D)}^2 \le \epsilon ^2\Vert G\Vert _{L^2(D\times D)}^2+\sum _{X\times Y\in P_{\text {adm}}}\Vert G-{\tilde{G}}\Vert _{L^2(X\times Y)}^2. \end{aligned}$$
(24)

Following Sect. 4.4.2, if \(X\times Y\) is admissible then the approximation error satisfies

$$\begin{aligned} \Vert G-{\tilde{G}}_{X\times Y}\Vert _{L^2(X\times Y)}^2 \le 12 t^2s^2\frac{k_\epsilon }{\gamma _{k_{\epsilon },X\times Y}}\sum _{j=1}^\infty \frac{\lambda _j}{\lambda _1}\epsilon ^2\Vert G\Vert _{L^2(X\times {\hat{Y}})}^2 , \end{aligned}$$

with probability greater than \(1-t^{-k_{\epsilon }}-e^{-2s^2 k_\epsilon }\). Here, \({\hat{Y}}=\{y\in D,\, {{\,\mathrm{dist}\,}}(y,Y)\le {{\,\mathrm{diam}\,}}Y/2\sqrt{3}\}\) (see Theorem 4 with \(\rho =1/\sqrt{3}\)). To measure the worst \(\gamma _{k_\epsilon , X\times Y}\), we define

$$\begin{aligned} \varGamma _\epsilon = \min \{\gamma _{k_\epsilon ,X\times Y}: X\times Y\in P_{\text {adm}}\}. \end{aligned}$$
(25)

From Eq. (11), we know that \(0<\varGamma _\epsilon \le 1\) and that \(1/\varGamma _\epsilon \) is greater than the harmonic mean of the first \(k_\epsilon \) scaled eigenvalues of the covariance kernel K, i.e.,

$$\begin{aligned} \frac{1}{\varGamma _\epsilon } \ge \frac{1}{k_\epsilon }\sum _{j=1}^{k_\epsilon }\frac{\lambda _1}{\lambda _j}, \end{aligned}$$
(26)

Now, one can see that \(X\times {\hat{Y}}\) is included in at most \(5^3=125\) neighbors including itself. Assuming that all the probability bounds hold on the admissible domains, this implies that

$$\begin{aligned} \sum _{X\times Y\in P_{\text {adm}}}\!\!\!\!\Vert G-{\tilde{G}}\Vert _{L^2(X\times Y)}^2&\le \sum _{X\times Y\in P_{\text {adm}}}\!\!\!\!\Vert G-{\tilde{G}}\Vert _{L^2(X\times Y)}^2 \le 12 t^2s^2\frac{k_\epsilon }{\lambda _1\varGamma _{\epsilon }}{{\,\mathrm{Tr}\,}}(K)\epsilon ^2\!\!\!\!\!\!\!\!\sum _{X\times Y\in P_{\text {adm}}}\!\!\!\!\Vert G\Vert _{L^2(X\times {\hat{Y}})}^2\\&\le 1500 t^2s^2\frac{k_\epsilon }{\lambda _1\varGamma _{\epsilon }}{{\,\mathrm{Tr}\,}}(K)\epsilon ^2\Vert G\Vert ^2_{L^2(D\times D)}. \end{aligned}$$

We then choose \(t=e\) and \(s=k_\epsilon ^{1/4}\) so that the approximation bound on each admissible domain holds with probability of failure less than \(2e^{-\sqrt{k_\epsilon }}\). Finally, using Eq. (24) we conclude that as \(\epsilon \rightarrow 0\), the approximation error on \(D\times D\) satisfies

$$\begin{aligned} \Vert G-{\tilde{G}}\Vert _{L^2(D\times D)} = \mathcal {O}\left( \varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon \right) \Vert G\Vert _{L^2(D\times D)}, \end{aligned}$$

with probability \(\ge (1-2e^{-\sqrt{k_\epsilon }})^{6^3 2^{3n_\epsilon }}=1-\mathcal {O}(\epsilon ^{\log (1/\epsilon )-6})\), where \(n_\epsilon \) is given by Eq. (22). We conclude that the approximant \({\tilde{G}}\) is a good approximation to G with very high probability.

5 Conclusions and Discussion

This paper rigorously learns the Green’s function associated with a PDE rather than the partial differential operator (PDO). By extending the randomized SVD to HS operators, we can identify a learning rate associated with elliptic PDOs in three dimensions and bound the number of input–output training pairs required to recover a Green’s function approximately. One practical outcome of this work is a measure for the quality of covariance kernels, which may be used to design efficient kernels for PDE learning tasks.

There are several possible future extensions of these results related to the recovery of hierarchical matrices, the study of other partial differential operators, and practical deep learning applications, which we discuss further in this section.

5.1 Fast and Stable Reconstruction of Hierarchical Matrices

We described an algorithm for reconstructing Green’s function on admissible domains of a hierarchical partition of \(D\times D\) that requires performing the HS randomized SVD \(\mathcal {O}(\epsilon ^{-6})\) times. We want to reduce it to a factor that is \(\mathcal {O}(\text {polylog}(1/\epsilon ))\).

For \(n\times n\) hierarchical matrices, there are several existing algorithms for recovering the matrix based on matrix-vector products [5, 32, 36, 37]. There are two main approaches: (1) The “bottom-up” approach: one begins at the lowest level of the hierarchy and moves up and (2) The “top-down” approach: one updates the approximant by peeling off the off-diagonal blocks and going down the hierarchy. The bottom-up approach requires \(\mathcal {O}(n)\) applications of the randomized SVD algorithm [36]. There are lower complexity alternatives that only require \(\mathcal {O}(\log (n))\) matrix-vector products with random vectors [32]. However, the algorithm in [32] is not yet proven to be theoretically stable as errors from low-rank approximations potentially accumulate exponentially, though this is not observed in practice. For symmetric positive semi-definite matrices, it may be possible to employ a sparse Cholesky factorization [54, 55]. This leads us to formulate the following challenge:

figure a

If one can design such an algorithm and it can be extended to HS operators, then the \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) term in Theorem 3 may improve to \(\mathcal {O}(\text {polylog}(1/\epsilon ))\). This means that the learning rate of partial differential operators of the form of Eq. (2) will be a polynomial in \(\log (1/\epsilon )\) and grow sublinearly with respect to \(1/\epsilon \).

5.2 Extension to Other Partial Differential Operators

Our learning rate for elliptic PDOs in three variables (see Sect. 4) depends on the decay of the singular values of the Green’s function on admissible domains [3]. We expect that one can also find the learning rate for other PDOs.

It is known that the Green’s functions associated to elliptic PDOs in two dimensions exist and satisfy the following pointwise estimate [12]:

$$\begin{aligned} |G(x,y)|\le C\left( \frac{1}{\gamma R^2}+\log \left( \frac{R}{\Vert x-y\Vert _2}\right) \right) , \quad \Vert x-y\Vert _2\le R:=\frac{1}{2}\max (d_x,d_y),\nonumber \\ \end{aligned}$$
(27)

where \(d_x={{\,\mathrm{dist}\,}}(x,\partial D)\), \(\gamma \) is a constant depending on the size of the domain D, and C is an implicit constant. One can conclude that \(G(x,\cdot )\) is locally integrable for all \(x\in D\) with \(\Vert G(x,\cdot )\Vert _{L^p(B_r(x)\cap D)}<\infty \) for \(r>0\) and \(1\le p<\infty \). We believe that the pointwise estimate in Eq. (27) implies the off-diagonal low-rank structure of G here, as suggested in [3]. Therefore, we expect that the results in this paper can be extended to elliptic PDOs in two variables.

PDOs in four or more variables are far more challenging since we rely on the following bound on the Green’s function on non-admissible domains [19]:

$$\begin{aligned} G(x,y)\le \frac{c(d,\kappa _C)}{\lambda _{\min }}\Vert x-y\Vert _2^{2-d}, \qquad x\ne y\in D, \end{aligned}$$

where \(D\subset \mathbb {R}^d\), \(d\ge 3\) is the dimension, and c is a constant depending only on d and \(\kappa _C\). This inequality implies that the \(L^p\)-norm of G on non-admissible domains is finite when \(0\le p < d/(d-2)\). However, for a dimension \(d\ge 4\), we have \(p<2\) and one cannot ensure that the \(L^2\) norm of G is finite. Therefore, the Green’s function may not be compatible with the HS randomized SVD.

It should also be possible to characterize the learning rate for elliptic PDOs with lower order terms (under reasonable conditions) [13, 24, 28] and many parabolic operators [29] as the associated Green’s functions have similar regularity and pointwise estimates. The main task is to extend [3, Theorem 2.8] to construct separable approximations of the Green’s functions on admissible domains. In contrast, we believe that deriving a theoretical learning rate for hyperbolic PDOs remains a significant research challenge for many reasons. The first roadblock is that the Green’s function associated with hyperbolic PDOs do not necessarily lie in \(L^2(D\times D)\). For example, the Green’s function associated with the wave equation in three variables, i.e., \(\L =\partial _t^2-\nabla ^2\), is not square-integrable as

$$\begin{aligned} G(x,t,y,s) = \frac{\delta (t-s-\Vert x-y\Vert _2)}{4\pi \Vert x-y\Vert _2},\qquad (x,t),(y,s)\in \mathbb {R}^3\times [0,\infty ), \end{aligned}$$

where \(\delta (\cdot )\) is the Dirac delta function.

5.3 Connection with Neural Networks

There are many possible connections between this work and neural networks (NNs) from practical and theoretical viewpoints. The proof of Theorem 3 relies on the construction of a hierarchical partition of the domain \(D\times D\) and the HS randomized SVD algorithm applied on each admissible domain. This gives an algorithm for approximating Green’s functions with high probability. However, there are more practical approaches that currently do not have theoretical guarantees [17, 18].

A promising opportunity is to design a NN that can learn and approximate Green’s functions using input–output training pairs \(\{(f_j,u_j)\}_{j=1}^N\) [6]. Once a neural network \(\mathcal {N}\) has been trained such that \(\Vert \mathcal {N}-G\Vert _{L^2}\le \epsilon \Vert G\Vert _{L^2}\), the solution to \(\L u = f\) can be obtained by computing the following integral:

$$\begin{aligned} u(x) = \int _D \mathcal {N}(x,y)f(y)d y. \end{aligned}$$

Therefore, this may give an efficient computational approach for discovering operators since a NN is only trained once. Incorporating a priori knowledge of the Green’s function into the network architecture design could be particularly beneficial. One could also wrap the selection of the kernel in the GP for generating random functions and training data into a Bayesian framework.

Finally, we wonder how many parameters in a NN are needed to approximate a Green’s function associated with elliptic PDOs within a tolerance of \(0<\epsilon <1\). Can one exploit the off-diagonal low-rank structure of Green’s functions to reduce the number of parameters? We expect the recent work on the characterization of ReLU NNs’ approximation power is useful [20, 44, 62]. The use of NNs with high approximation power such as rational NNs might also be of interest to approximate the singularities of the Green’s function near the diagonal [7].