1 Introduction

We consider a system of particles or agents interacting in a random environment, with their motion described by a first-order stochastic differential equation in the form

$$\begin{aligned} \mathrm{d}{\varvec{x}_{i,t}} =\frac{1}{N} \sum _{i'= 1}^N \phi (\Vert \varvec{x}_{j,t} - \varvec{x}_{i,t}\Vert )(\varvec{x}_{j,t} - \varvec{x}_{i,t} )\mathrm{d}t+ \sigma \mathrm{d}\varvec{B}_{i,t}\,, \quad \hbox {for}\ i = 1, \ldots , N,\nonumber \\ \end{aligned}$$
(1.1)

where \(\varvec{x}_{i,t} \in {\mathbb {R}}^{d}\) represents the position of particle i at time t, \(\phi :{\mathbb {R}}^+\rightarrow {\mathbb {R}}\) is an interaction kernel dependent on the pairwise distance between particles, and \(\varvec{B}_t\) is a standard Brownian motion in \({\mathbb {R}}^{Nd}\), with \(\sigma >0\) representing the scale of the random noise. This is a gradient system, with the energy potential \(V_{\phi }:{\mathbb {R}}^{Nd} \rightarrow {\mathbb {R}}\)

$$\begin{aligned} V_{\phi }(\varvec{X}_t)=\frac{1}{2N}\sum _{i,i'}{\varPhi }(\Vert \varvec{x}_{i,t}-\varvec{x}_{i',t}\Vert )\quad \text { with }\ \ \ {\varPhi }'(r)=\phi (r)r\,, \end{aligned}$$
(1.2)

where \(\varvec{X}_t=(\varvec{x}_{i,t})_{i=1,\dots ,N}\in {\mathbb {R}}^{dN}\) is the state of the system. Letting

$$\begin{aligned} \varvec{f}_{\phi }:= -\nabla V_{\phi }\,, \end{aligned}$$
(1.3)

we can write Eq.(1.1) in vector format as

$$\begin{aligned} \mathrm{d}\varvec{X}_t&=\varvec{f}_{\phi }(\varvec{X}_t)\mathrm{d}t+\sigma \mathrm{d}\varvec{B}_t\,. \end{aligned}$$
(1.4)

The particles interact with each other based on their pairwise distance, with dissipation of the total energy, with the system tending to a stable point of the energy potential, while the random noise injects energy to the system.

Such systems of interacting particles arise in a wide variety of disciplines, including interacting physical particles [22, 49] or granular media [1,2,3, 8, 12, 13] in Physics, opinion aggregation on interacting networks in Social Science [24, 43, 46], and Monte Carlo sampling [36, 39], to name just a few.

Motivated by these applications, the inference of such systems from data gains increasing attention. For deterministic multi-particle systems, various types of learning techniques have been developed (see, e.g., [9, 14, 40, 41, 50, 55] and the reference therein). When it comes to stochastic multi-particle systems, only a few efforts have been made, e.g., learning reduced Langevin equations on manifolds in [19] (without, however, assuming nor exploiting the structure of pairwise interactions), learning parametric potential functions in [10, 15] from single trajectory data, estimating the diffusion parameter in [26], and estimating effective Langevin equations on manifolds in [19].

Our goal is to estimate the interaction kernel \(\phi \) given discrete-time observation data from trajectories \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\), where the initial conditions \(\{\varvec{X}^{(m)}_{t_0}\}_{m=1}^M\) are independent samples drawn from a distribution \(\mu _0\) on \({\mathbb {R}}^{dN}\), and \(t_0:t_L\) indicates times \(0=t_0<t_1<\cdots<t_l<\cdots <t_L=T\), with with \(t_l = l{\varDelta }t\).

Since, in general, little information about the analytical form of the kernel is available, we infer it in a nonparametric fashion (e.g., [6, 20, 23]). We note that the problem we consider is to learn a latent function in the drift term given observations from multiple trajectories, which is different from the ample literature on the inference of stochastic differential equations (see, e.g., [29, 34]), focusing either on parameter estimation or on inference for ergodic system. In particular, our learning approach is close in spirit to the nonparametric regression of the drift studied in [44] for ergodic system and in [17] from i.i.d paths. However, for systems of interacting particles one faces the curse of dimensionality when learning the high-dimensional drift directly as a general function on the high-dimensional state space \({\mathbb {R}}^{dN}\). Instead, we will exploit the structure of the system and learn the latent interaction kernel in the drift, which only depends on pairwise distances, and show that the curse of dimensionality may be avoided, when such inverse problem is well-conditioned.

We introduce a maximum likelihood estimator (MLE), along with an efficient algorithm that can be implemented in parallel over trajectories, with an hypothesis space adaptive to data to reach optimal accuracy. Under a coercivity condition, we prove that the MLE is consistent and converges at the min–max rate for one-dimensional nonparametric regression. We also analyze the discretization errors due to discrete-time observations: we show it leads to an error in the estimator that is of order \({\varDelta }t^{1/2}\) (with \({\varDelta }t=T/L=t_{l+1}-t_l\)), and as a result, it prevents us from obtaining the min–max learning rate in sample size. We demonstrate the effectiveness of our algorithm by numerical tests on prototype systems including opinion dynamics and a stochastic Lennard-Jones model (see Sect. 5). Numerical results verify our learning theory in the sense that the min–max rate of convergence is achieved, and the bias due to the numerical error is close to the order \({\varDelta }t^{1/2}\).

1.1 Overview of the Main Results

We consider an approximate maximum likelihood estimator (MLE), which is the maximizer of the approximate likelihood of the observed trajectories, over a suitable hypothesis space \({\mathcal {H}}\):

$$\begin{aligned} {\widehat{\phi }}_{L,T,M,{\mathcal {H}}}= \underset{\varphi \in {\mathcal {H}}}{{\text {arg}}{\text {min}}}\;{\mathcal {E}}_{L,T,M}(\varphi ), \end{aligned}$$

where \({\mathcal {E}}_{L,T,M}(\varphi )\) is an approximation of the negative log-likelihood of the discrete data \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\). Using the fact that the drift term \(\varvec{f}_{\phi }\) is linear in \(\phi \) and hence \({\mathcal {E}}_{L,T,M}(\varphi )\) is a quadratic functional, we propose an algorithm (see Algorithm 1) that efficiently computes this MLE by least squares. With a data-adaptive choice of the basis functions \(\{\psi _p\}_{p=1}^n\) for the hypothesis space \({\mathcal {H}}\), we obtain the MLE

$$\begin{aligned} {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} = \sum _{p=1}^n {\widehat{a}}_{L,T,M, {\mathcal {H}}}(p) \psi _p \end{aligned}$$
(1.5)

by computing the coefficients \({\widehat{a}}_{L,T,M, {\mathcal {H}}}\in {\mathbb {R}}^n\) from normal equations. The algorithm may be implemented by building in parallel the equations for each trajectory.

We develop a systematic learning theory on the performance of this MLE. We propose first a coercivity condition that ensures the robust identifiability of the kernel \(\phi \), in the sense that the derivative of the pairwise potential defined in (1.2), \({\varPhi }'(r) = \phi (r)r\), can be uniquely identified in the function space \(L^2({\mathbb {R}}^+,\rho _T)\), where \(\rho _T\) is the measure of all pairwise distances between particles. Then, we consider the convergence of the estimator, from both continuous-time and discrete-time observations, under the norm

$$\begin{aligned} {\left| \left| \left| \varphi \right| \right| \right| }:= \Vert \varphi (\cdot )\cdot \Vert _{L^2(\rho _T)}= \left( \int _{{\mathbb {R}}^+} |\varphi (r)r|^2\rho _T(\mathrm{d}r) \right) ^{1/2}\,. \end{aligned}$$
(1.6)

The case of continuous-time observations (Sect. 3). We consider the MLE

$$\begin{aligned} {\widehat{\phi }}_{T,M,{\mathcal {H}}}= \underset{\varphi \in {\mathcal {H}}}{{\text {arg}}{\text {min}}}\;{\mathcal {E}}_{T,M}(\varphi ), \end{aligned}$$

where \({\mathcal {E}}_{T,M}(\varphi )\) is the exact negative log-likelihood of the continuous-time trajectories \(\{\varvec{X}^{(m)}_{[0,T]}\}_{m=1}^M\). We show that the MLE is consistent, that is, converges in probability to the true kernel under the norm \({\left| \left| \left| \cdot \right| \right| \right| }\). Furthermore, we show that the MLE converges at a rate which is independent of the dimension of the state space of the system and corresponds to the mini–max rate for one-dimensional nonparametric regression ([6, 16, 20, 23]), when choosing the hypothesis space adaptively according to data, the Hölder continuity s of the true kernel, and with dimension increasing with the amount of observed data. With \(\mathrm {dim}({\mathcal {H}}) \asymp (\frac{M}{\log M})^{\frac{1}{2s+1}}\), and assuming that the coercivity condition holds on \({\mathcal {H}}\) with a constant \(c_{{\mathcal {H}}}>0\), we have, with high probability and in expectation,

$$ {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}}-\phi \right| \right| \right| }^2 \lesssim \frac{1}{c^2_{{\mathcal {H}}}} \left( \frac{\log M}{ M}\right) ^{\frac{2s}{2s+1}} $$

The case of discrete-time observations (Sect. 4). In this case, derivatives and statistics of the trajectories in-between observations need to be approximated, while keeping the estimator efficiently computable: this leads to further approximations of the likelihood and consequently of the MLE. This discretization error of the approximations we use is of order 1/2 in the observation time gap \({\varDelta }t = T/L\), using an approximation of the likelihood based on the Euler–Maruyama integration scheme. We show that for some \(C>0\), for any \(\epsilon >0\), with high probability

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - \phi \right| \right| \right| } \le {\left| \left| \left| {\widehat{\phi }}_{T, \infty , {\mathcal {H}}} - \phi \right| \right| \right| } + C\left( \sqrt{\frac{n}{M}}\epsilon +{\varDelta }t ^{\frac{1}{2}}\right) \,, \end{aligned}$$

where \({\widehat{\phi }}_{T, \infty , {\mathcal {H}}} \) is the projection of the true kernel to \({\mathcal {H}}\) and n is the dimension of hypothesis space \({\mathcal {H}}\). The discretization error will flatten the learning curve when the sample size is large, overshadowing the sampling error and the approximation error cause by working within the hypothesis space. For some positive constants \(c_2 \) and \(c_3\), where \({\widehat{\phi }}_{T, \infty , {\mathcal {H}}} \) is the projection of the true kernel to \({\mathcal {H}}\). The numerical error may overshadow the sampling error and the approximation error of the hypothesis space.

Fig. 1
figure 1

Diagram of the (regularized) MLE error analysis and convergence. For continuous-time observations, we refer to Proposition 3.4 for analysis of SE: \({\hat{\phi }}_{T,M,{\mathcal {H}}}-{\hat{\phi }}_{T,\infty ,{\mathcal {H}}}\) and Theorem 3.2 for bounding the total estimation error \({\hat{\phi }}_{T,M,{\mathcal {H}}}-\phi \). For discrete-time observations, we refer to Proposition 4.2 for SE-L: \({\hat{\phi }}_{L,T,M,{\mathcal {H}}}-{\hat{\phi }}_{L,T,\infty ,{\mathcal {H}}}\), Proposition 4.1 for DE: \({\hat{\phi }}_{L,T,\infty ,{\mathcal {H}}}-{\hat{\phi }}_{T,\infty ,{\mathcal {H}}}\), and Theorem 4.2 for \( {\hat{\phi }}_{L,T,M,{\mathcal {H}}}-\phi \)

In both cases, we decompose the error in the MLE into sampling error from the trajectory data, and approximation error from the hypothesis space, as illustrated in the diagram in Fig. 1. In the case of continuous-time observations, the sampling error is the error between \({\widehat{\phi }}_{T,M,{\mathcal {H}}} \) and the MLE from infinitely many trajectories (denoted by \({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}}\)): this will be controlled with concentration equalities. The approximation error \({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} -\phi \) is adaptively controlled by a proper choice of hypothesis space. The analysis is carried out in the infinite-dimensional space \(L^2(\rho _T)\). In the case of discrete-time observations, we provide a finite-dimensional analysis to study directly the MLE in our proposed algorithm, that is, analyzing the error of \({\widehat{a}}_{L,T,M, {\mathcal {H}}}\) in (1.5) with proper conditions on the basis functions. The sampling error \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \) is analyzed through \({\widehat{a}}_{L,T,M, {\mathcal {H}}} -{\widehat{a}}_{L,T,\infty , {\mathcal {H}}}\), and the discretization error between \({\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \) and \({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \) is analyzed through \({\widehat{a}}_{L,T,M, {\mathcal {H}}} -{\widehat{a}}_{T,\infty , {\mathcal {H}}}\). The discretization error comes from the discrete-time approximation of the likelihood, and it vanishes when the observation time gap \({\varDelta }t\) reduces to zero, recovering the convergence of the MLE as in the case of the continuous-time observations.

Table 1 Notation for the system of interacting particles driven by Eq. (1.1)

1.2 Notation and Outline

Throughout this paper, we use bold letters to denote vectors or vector-valued functions. We use the notation in Table 1 for variables in the system of interacting particles.

We call the system of relative positions to a reference particle, say, \((\varvec{r}_{1i} = \varvec{x}_{i,t}- \varvec{x}_{1,t})_{i=2}^N\), by “relative position system”. The relative position system can be ergodic under suitable conditions on the potential [37], and these relative positions are the variables we need to learn the interaction kernel. We point out that the interacting particle system (1.1) itself is not ergodic, because the center \({\overline{\varvec{x}}}_{t} = \frac{1}{N}\sum _{i=1}^N \varvec{x}_{i,t} \) satisfies \(\mathrm{d}{\overline{\varvec{x}}}_{t} = \sigma \frac{1}{N}\sum _{i=1}^N \mathrm{d}\varvec{B}_{i,t} \).

We restrict our attention to interaction kernels \(\phi \) in the admissible set

$$\begin{aligned} {\mathcal {K}}_{R,S}:=\{\varphi \in C^1({\mathbb {R}}_+) : \text {Supp}(\varphi ) \subset [0,R], \Vert \varphi \Vert _{\infty }+\Vert \varphi '\Vert _{\infty } \le S\}. \end{aligned}$$
(1.7)

Let \({\varOmega }\) be an arbitrary compact (or precompact) set of a Euclidean space (which may be \({\mathbb {R}}^{+}\), \({\mathbb {R}}^{d}\) or \({\mathbb {R}}^{dN}\)), with the Lebesgue measure unless otherwise specified. We consider the following function spaces

  • \(L^{\infty }({\varOmega })\): the space of bounded functions on \({\varOmega }\) with norm \(\Vert g\Vert _{\infty }=\mathrm{ess\,sup}_{x\in {\varOmega }}|g(x)|\);

  • \(C({\varOmega }):\) the closed subspace of \(L^{\infty }({\varOmega })\) consisting of continuous functions;

  • \(C_c({\varOmega }):\) the set of functions in \(C({\varOmega })\) with compact support;

  • \(C^{k,\alpha }({\varOmega })\) with \(k \in {\mathbb {N}}, 0<\alpha \le 1\): the space of functions whose k-th derivative is Hölder continuous of order \(\alpha \). In the special case of \(k=0\) and \(\alpha =1\), \(g \in C^{0,\alpha }({\varOmega })\) is called Lipchitz continuous on \({\varOmega }\); the Lipschitz constant of \(g \in \text {Lip}({\varOmega })\) is defined as \(\text {Lip}[g]:=\sup _{x\ne y} \frac{|g(x)-g(y)|}{\Vert x-y\Vert }.\)

We summarize the notation for the inference of the interaction kernel in Table 2. The function space in which we perform the estimation is the space of functions \(\varphi \) such that \(\varphi (\cdot )\cdot \in L^2({\mathbb {R}}^+, \rho _T)\), where \(\rho _T\) is the measure of pairwise distances between all particles on the time interval [0, T] (see (2.9)). We will focus on learning on the compact (finite- or infinite-dimensional) subset of \(L^{\infty }([0,R])\) (where [0, R] is the support of the functions in the admissible set \({\mathcal {K}}_{R,S}\)) in the theoretical analysis; however, in the numerical implementation we will use finite-dimensional linear subspaces \(L^2([0,R], \rho _T)\) spanned by piecewise polynomial functions. While these linear subspaces are not compact, it is shown that the minimizers over the whole linear space are bounded and thus the compactness requirements are not essential (e.g., see Theorem 11.3 in [23]). We shall therefore assume the compactness of the hypothesis space in the theoretical analysis.

Table 2 Notation used for the estimator of the interaction kernel \(\phi \)

The remainder of the paper is organized as follows: We first provide an overview of our learning theory. In Sect. 2, we present a practical learning algorithm with theory-guided optimal settings on the choice of hypothesis spaces and with a practical assessment of the learning results. We then demonstrate the effectiveness of the algorithm on prototype systems including a stochastic model for opinion dynamics and a stochastic Lennard-Jones model in Sect. 5. We establish a systematic learning theory analyzing the performance of the MLE, considering continuous-time observations in Sect. 3 and discrete-time observations in Sect. 4. We present in “Appendix” detailed proofs.

2 Nonparametric Inference of the Interaction Kernel

We present in this section the nonparametric technique we study for the inference of the interaction kernel and corresponding algorithms. We discuss the assessment of the performance of the estimator and its performance in trajectory prediction. The proposed estimator is based on maximum likelihood estimation on data-adaptive hypothesis spaces so as to achieve the optimal rate of convergence, guided by our learning theory in Sects. 34.

2.1 The Maximum Likelihood Estimator

As a variational approach, we set the error functional to be the negative log-likelihood of the data \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) and compute the maximum likelihood estimator (MLE).

The error functional. Recall that by the Girsanov theorem, for a continuous trajectory \(\varvec{X}_{[0,T]}\), its negative log-likelihood ratio between the measure induced by system (1.1), with an admissible kernel \(\phi \), and the Wiener measure is

$$\begin{aligned} {\mathcal {E}}_{\varvec{X}_{[0,T]}}(\phi )&=\frac{1}{2\sigma ^2TN}\int _{0}^T \left( \Vert \varvec{f}_{\phi }(\varvec{X}_t)\Vert ^2 - 2 \langle \varvec{f}_{\phi }(\varvec{X}_t), \mathrm{d}\varvec{X}_t \rangle \, \mathrm{d}t\right) . \end{aligned}$$
(2.1)

As we do not know the interaction kernel \(\phi \) that generated the trajectory \(\varvec{X}_{[0,T]}\), we can let \(\varphi \) be any possible admissible interaction kernel, and upon replacing \(\phi \) by \(\varphi \) in (2.1), observe that \({\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )\) is the log-likelihood of seeing the trajectory \(\varvec{X}_{[0,T]}\) if system (1.1) were driven by the interaction kernel \(\varphi \). In this case, \({\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )\) may be interpreted as a error functional, which we wish to minimize over \(\varphi \), in order to obtain an estimator for \(\phi \).

Given only discrete-time observations \(\varvec{X}_{t_0:t_L}\), where \((t_l = l{\varDelta }t, l=0,\dots ,L)\) with \({\varDelta }t = T/L\) (the case of non-equispaced-in-time observation is a straightforward generalization), the error functional \({\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )\) may be approximated as

$$\begin{aligned} {\mathcal {E}}_{ \varvec{X}_{t_1:t_L}}(\varphi )&:= \frac{ 1}{2\sigma ^2 TN} \sum _{l=0}^{L-1} \left( \Vert \varvec{f}_{\varphi }(\varvec{X}_{t_l})\Vert ^2 {\varDelta }t-2 \langle \varvec{f}_{\varphi }(\varvec{X}_{t_l}), \varvec{X}_{t_l}- \varvec{X}_{t_{l-1}} \rangle \right) . \end{aligned}$$
(2.2)

The corresponding approximate likelihood is equivalent to the likelihood based on the Euler–Maruyama (EM) scheme (whose transition probability density is Gaussian):

$$\begin{aligned} \varvec{X}_{t_{l+1}} = \varvec{X}_{t_{l}} + \varvec{f}_\varphi (\varvec{X}_{t_{l}}) {\varDelta }t + \sigma \sqrt{{\varDelta }t} {\mathbf {W}}_l\,,\quad {\mathbf {W}}_l\sim {\mathcal {N}}(0, I_{Nd\times Nd}) \end{aligned}$$
(2.3)

Note that while higher-order approximations of the stochastic integral (or, equivalently, approximations based on higher-order numerical schemes) may be more accurate than the EM scheme, they lead to nonlinear optimization problems in the computation of the MLE defined below, and we shall therefore avoid them. The EM-based approximation preserves the quadratic form of the error functional and leads to an optimization problem that can be solved by least squares. As we show in Theorem 4.2, this discrete-time approximation leads to an error term of order \({\varDelta }t^{1/2}\) in the MLE, which will be small in the regime on which we focus in this work.

Since the observed discrete-time trajectories \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) are independent, as the \(\varvec{X}_{t_0}^{(m)}\)’s are drawn i.i.d. from \(\mu _0\), the joint likelihood of the trajectories is the product of the likelihoods of each trajectory. Therefore, the corresponding empirical error functional is defined to be

$$\begin{aligned} {\mathcal {E}}_{L,T,M}(\varphi ):=\frac{1}{M}\sum _{m=1}^{M}{\mathcal {E}}_{ \varvec{X}_{t_1:t_L}^{(m)}}(\varphi ). \end{aligned}$$
(2.4)

A regularized Maximum Likelihood Estimator. The regularized MLE we consider is a minimizer of the above empirical error functional over a suitable hypothesis space \({\mathcal {H}}\):

$$\begin{aligned} {\widehat{\phi }}_{L,T,M,{\mathcal {H}}}= \underset{\varphi \in {\mathcal {H}}}{{\text {arg}}{\text {min}}}\;{\mathcal {E}}_{L,T,M}(\varphi ), \end{aligned}$$
(2.5)

This regularized MLE is well defined when the minimizer exists and is unique over \({\mathcal {H}}\). We call this MLE “regularized” to emphasize the constraint \(\phi \in {\mathcal {H}}\), and the fact that \({\mathcal {H}}\) will change with M, as in nonparametric statistics; this naming is somewhat not standard though. We shall discuss the uniqueness of the minimizer in Sect. 3.1, where we show it is guaranteed by a coercivity condition. When the hypothesis space \({\mathcal {H}}\) is a finite-dimensional linear space, say, \({\mathcal {H}}=\mathrm {span} \{\psi _i\}_{i=1}^n\) with basis functions \(\psi _i: {\mathbb {R}}^+\rightarrow {\mathbb {R}}\), the regularized MLE is the solution of a least squares problem. To see this, letting \(\varphi = \sum _{i=1}^{n} a(i) \psi _i\) and \(a :=(a(1),\dots ,a(n))\in {\mathbb {R}}^n\), we have \(\varvec{f}_{\varphi }(\varvec{X}) =\sum _{i= 1}^n a(i) \varvec{f}_{\psi _i}(\varvec{X}) \), due to the linear dependence of \(\varvec{f}_\varphi \) on \(\varphi \). Then, we can write the error functional in Eq.(2.2) for each trajectory as

$$\begin{aligned} {\mathcal {E}}_{\varvec{X}^{(m)}_{t_1:t_L}}(\varvec{a}) := {\mathcal {E}}_{\varvec{X}^{(m)}_{t_1:t_L}}(\varphi ) = a^T A^{(m)} a +a^T b^{(m)} , \end{aligned}$$

where the matrix \(A^{(m)} \in {\mathbb {R}}^{n\times n}\) and the vector \(b^{(m)}\in {\mathbb {R}}^{n}\) are given by

$$\begin{aligned} \begin{aligned} A^{(m)}(i,i')&= \frac{ 1}{2\sigma ^2LN} \sum _{l=0}^{L-1} \langle \varvec{f}_{\psi _i}(\varvec{X}_{t_{l}}^{(m)} ), \varvec{f}_{\psi _{i'}}(\varvec{X}_{t_{l}}^{(m)})\rangle , \quad \\ b^{(m)} (i)&= - \frac{ 1}{\sigma ^2 L{\varDelta }t N} \sum _{l=0}^{L-1} \langle \varvec{f}_{\psi _i}(\varvec{X}_{t_l}^{(m)}), \varvec{X}_{t_{l+1}}^{(m)}- \varvec{X}_{t_{l}}^{(m)} \rangle . \end{aligned} \end{aligned}$$
(2.6)

Hence, corresponding to \(\nabla {\mathcal {E}}_{L,T,M}=0\) for the error functional in (2.4), we solve the normal equations for a to obtain the solution \({\widehat{a}}_{L,T,M,{\mathcal {H}}}\):

$$\begin{aligned}&A_{M,L} {\widehat{a}}_{L,T,M,{\mathcal {H}}} =b_{M,L}, \text { where } \nonumber \\&A_{M,L}:=\frac{1}{M} \sum _{m=1}^{M} A^{(m)},\, b_{M,L}:=\frac{1}{M} \sum _{m=1}^{M} b^{(m)} \end{aligned}$$
(2.7)

and corresponding desired MLE for the interaction kernel:

$$\begin{aligned} {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} = \sum _{i=1}^{n} {\widehat{a}}_{L,T,M,{\mathcal {H}}}(i) \psi _i. \end{aligned}$$
(2.8)

The normal equations (2.7) are solved by least squares, so the solution always exists. We will show in Sect. 4 that assuming a coercivity condition, the matrix \(A_{M,L}\in {\mathbb {R}}^{n\times n}\) is invertible with high probability when M and L are large, so the least squares estimator is the unique solution to the normal equations, and the regularized MLE is the unique minimizer of the empirical error functional over \({\mathcal {H}}\).

2.2 Dynamics-Adapted Measures and Function Spaces

We will assess the estimation error in a suitable function space: \(L^2({\mathbb {R}}^+,\rho _T)\). Here \(\rho _T\) is the distribution of pairwise distances between all particles:

$$\begin{aligned} \rho _T (\mathrm{d}r)&:= \frac{1}{\left( {\begin{array}{c}N\\ 2\end{array}}\right) T}\int _{t = 0}^T\bigg [\sum _{i,i'=1, i< i' }^N {\mathbb {E}}[\delta _{r_{ii'}(t)}(\mathrm{d}r)] \, \mathrm{d}t\bigg ], \end{aligned}$$
(2.9)

where \(\delta \) is the Dirac \(\delta \)-distribution, so that \({\mathbb {E}}[\delta _{r_{ii'}(t)}(\mathrm{d}r)]\) is the distribution of the random variable \(r_{ii'}(t)= || \varvec{x}_{i,t} -\varvec{x}_{i',t} ||\), with \(\varvec{x}_{i,t}\) being the position of particle i at time t. Here the expectation is taken over the distribution \(\mu _0\) of initial conditions and realizations of the system. The probability measure \(\rho _T\) depends on both \(\mu _0\) and the measure determining the random noise on the path space, while it is independent of the observed data. The measure \(\rho _T\) encodes the information about the dynamics marginalized to pairwise distances; regions with large \(\rho _T\)-measure correspond to pairwise distances between particles that are often encountered during the dynamics.

With observations of M trajectories at L discrete-times each, we introduce a corresponding measure

$$\begin{aligned} \rho ^{L,M}_T(\mathrm{d}r)&:= \frac{1}{\left( {\begin{array}{c}N\\ 2\end{array}}\right) L M}\sum _{l = 0, m=1}^{L - 1, M} \bigg [\sum _{i,i'=1, i< i' }^N\delta _{r^{(m)}_{ii'}(t_l)}(\mathrm{d}r) \bigg ]\,, \end{aligned}$$
(2.10)

where \(r^{(m)}_{ii'}(t)= || \varvec{x}^{(m)}_{i,t} -\varvec{x}^{(m)}_{i',t} ||\) is from the m-th observed trajectory. We think of this as an approximation to \(\rho _T\), in two significantly different aspects. In L, because as \(L\rightarrow +\infty \) our observations tend to be continuous in time, and in M, as \(\rho ^{L,M}_T\) can be thought of, after letting \(M\rightarrow +\infty \), as an empirical approximation to \(\rho _T\) performed from data on the M independent trajectories.

Accuracy of the estimator. We measure the accuracy of our estimator \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) by the quantity

$$ \Vert ({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}-\phi )(\cdot )\cdot \Vert _{L^2({\mathbb {R}}^+, \rho _T)}. $$

The function \(\phi (\cdot )\cdot \), instead of \(\phi \), which at \(r\in {\mathbb {R}}_+\) takes value \(\phi (r)r\), appears naturally in our learning theory in Sect. 3, fundamentally because it is the derivative of the pairwise distance potential \({\varPhi }\) in (1.2). We define

$$\begin{aligned} {\left| \left| \left| \varphi \right| \right| \right| }:= \Vert \varphi (\cdot )\cdot \Vert _{L^2(\rho _T)}= \left( \int _{{\mathbb {R}}^+} |\varphi (r)r|^2\rho _T(\mathrm{d}r) \right) ^{1/2}. \end{aligned}$$
(2.11)

Then, the mean squared error (MSE) of the estimator is

$$\begin{aligned} {\mathbb {E}}{\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} -\phi \right| \right| \right| }^2. \end{aligned}$$
(2.12)

2.3 Hypothesis Spaces and Nonparametric Estimators

As standard in nonparametric estimation, we let the hypothesis space \({\mathcal {H}}\) grow in dimension with the number of observations, avoiding under- or over-parameterization, and leading to consistent estimators, that in fact reach an optimal min–max rate of convergence (see, e.g., [20, 21, 23]).

Similar to [40, 41], we set the basis functions \(\{\psi _i\}_{i=1}^n\) to be piecewise polynomials on a partition of the support of the density function of the empirical probability measure \(\rho ^{L,M}_T\).

Guided by the optimal rate convergence results in Sect. 3, we will set the dimension of the hypothesis space \({\mathcal {H}}\) to be

$$\begin{aligned} n=C ( M/\log {M})^{1/(2s+1)}, \end{aligned}$$
(2.13)

where the number s is the Hölder index of continuity of the basis functions, and it is to be chosen according the regularity of the true kernel. When T is large and when the relative position system is ergodic, we set

$$\begin{aligned} n=C ( N_{ess}/\log {N_{ess}})^{1/(2s+1)}, \end{aligned}$$

where \(N_{ess}:= M \frac{T}{\tau }\), with \(\tau \) denoting the auto-correlation time of the system, is the effective sample size of the data. Here the auto-correlation time \(\tau \) is the equivalent of the mixing time for a reversible ergodic Markov chain [35].

We estimate the auto-correlation time by the sum of the temporal auto-correlation function of a pairwise distance \(r_{i,j}\). (We refer to [51] for detailed discussion on the estimation of auto-correlation time, which is a whole subject by itself.)

We will prove bounds, that hold with high probability, on the mean squared error (MSE) of the MLE \(\phi _{L,T,M,{\mathcal {H}}_{n_M}}\) in (2.8) for fixed and large M, for a fixed time T and for suitable hypothesis spaces \({\mathcal {H}}_{n_M}\) with dimension \(n_M\) as in (2.13). When continuous-time trajectories are observed, the MSE is of the order \((\frac{\log M}{ M})^{\frac{2s}{2s+1}}\) with high probability, according to Theorem 3.2, and so is its expectation. In particular, this avoids the curse of dimensionality of the state space (dN). When the observations are discrete-time trajectories with observation gap \({\varDelta }t\), the error is of the order \((\frac{\log M}{ M})^{\frac{2s}{2s+1}} + {\varDelta }t^{1/2}\) with a high probability according to Theorem 4.2.

2.4 Algorithmic and Computational Considerations

figure a

The algorithm is summarized in Algorithm 1. Note that the normal matrices \(\{A^{(m)} \}\) and vectors \(\{b^{(m)} \}\) are defined trajectory-wise and therefore may be computed in parallel. When the size of the system is large (i.e., \(\mathrm{d}N\) is large), this allows one to accelerate the computation of the estimator, by assembling these normal matrices and vectors for each trajectory in parallel, and updating the normal matrix \(A_{M,L}\) and vector \(b_{M,L}\). The total computational cost of constructing our estimator, given P CPU’s, is \(O(L\frac{N^2d}{P}Mn^2+n^3)\). This becomes \(O(L\frac{N^2d}{P}M^{1+\frac{1}{2s+1}}+CM^{\frac{3}{2s+1}})\) when n is chosen optimally according to Theorem 3.2 and \(\phi \) is at least in \(C^{1,1}\) (corresponding to the index of regularity \(s\ge 2\) in the theorem).

2.5 Accuracy of Trajectory Prediction

One application of estimating the interaction kernel from data is to perform predictions of the dynamics. Given an estimator, the following proposition bounds its accuracy in predicting the trajectories of the system driven by the true interaction kernel:

Proposition 2.1

Let \({\widehat{\phi }} \in {\mathcal {K}}_{R,S}\) be an estimator of the true kernel \(\phi \) , where \( {\mathcal {K}}_{R,S}\) is the admissible set defined in (1.7). Denote by \(\widehat{\varvec{X}}_t\) and \(\varvec{X}_t\) the solutions of the systems with kernels \({\widehat{\phi }}\) and \(\phi \), respectively, starting from the same initial condition and with the same random noise. Then, we have

$$\begin{aligned} \sup _{t\in [0,T]} {\mathbb {E}}\left[ \frac{1}{N} \Vert \widehat{\varvec{X}}_t- \varvec{X}_t\Vert ^2\right] \le 2T^2 e^{8T^2 (R+1)^2S^2} {\left| \left| \left| \widehat{\phi }-\phi \right| \right| \right| }^2 \end{aligned}$$

where the measure \(\rho _T\) is defined by (2.9).

We postpone the proof to Sect. A.3.

3 Learning Theory: Continuous-Time Observations

We analyze first the regularized MLE in the case of continuous-time observations \(\{\varvec{X}^{(m)}_{[0,T]}\}_{m=1}^M\). We show that under a coercivity condition, the regularized MLE is consistent, and that with proper choice of the hypothesis spaces, we can achieve an optimal learning rate \((\frac{\log M}{ M})^{\frac{2s}{2s+1}}\).

Recall from Eq.(2.1) that

$$\begin{aligned} {\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )&:=\frac{1}{2\sigma ^2TN}\int _{0}^T \left( \Vert \varvec{f}_{\varphi }(\varvec{X}_t)\Vert ^2 - 2 \langle \varvec{f}_{\varphi }(\varvec{X}_t), d\varvec{X}_t \rangle \mathrm{d}t\right) \end{aligned}$$

is the negative log-likelihood of a trajectory \(\varvec{X}_{[0,T]}\), with respect to the measure induced by the system with interaction kernel \(\varphi \). Then, the negative log-likelihood of independent trajectories \(\{\varvec{X}^{(m)}_{[0,T]}\}_{m=1}^M\) is

$$\begin{aligned} {\mathcal {E}}_{T,M}(\varphi ):=\frac{1}{M}\sum _{m=1}^{M}{\mathcal {E}}_{ \varvec{X}_{[0,T]}^{(m)}}(\varphi )\,, \end{aligned}$$
(3.1)

and the regularized MLE over a hypothesis space \({\mathcal {H}}\) is

$$\begin{aligned} {\widehat{\phi }}_{T,M,{\mathcal {H}}} \in \underset{\varphi \in {\mathcal {H}}}{{\text {arg}}{\text {min}}}\;{\mathcal {E}}_{T,M}(\varphi )\,. \end{aligned}$$
(3.2)

The existence of the minimizer follows from the fact that the error functional \({\mathcal {E}}_{T,M}(\varphi )\) is quadratic in \(\varphi \), which in turn is a consequence of the linearity of \(\varvec{f}_\varphi \) in \(\varphi \). The uniqueness of the minimizer, however, requires a coercivity condition and is related to the learnability of the kernel, which we discuss in the next section.

3.1 Identifiability and Learnability: A Coercivity Condition

The uniqueness of the minimizer of the error functional \({\mathcal {E}}_{T,M}(\varphi )\) over the hypothesis space ensures that the kernel is identifiable. This is not granted, even when the number of observed trajectories is infinite: denote

$$\begin{aligned} {\mathcal {E}}_{T,\infty }(\varphi ):={\mathbb {E}} {\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi ) = \lim _{M\rightarrow \infty } {\mathcal {E}}_{T,M}(\varphi )\quad a.s.~, \end{aligned}$$
(3.3)

where \({\mathbb {E}}\) here, and in all that follows unless otherwise indicated, is the expectation over initial conditions, independently sampled from \(\mu _0\), and over the Wiener measure underlying the random noise, and observe that

$$\begin{aligned} {\mathcal {E}}_{T,\infty }(\varphi )-{\mathcal {E}}_{T,\infty }(\phi )&={\mathbb {E}}\bigg [ {\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )-{\mathcal {E}}_{T,\infty }(\phi ) \bigg ] =\frac{1}{2\sigma ^2NT}{\mathbb {E}}\int _{0}^{T}\Vert {\mathbf {f}}_{\varphi -\phi }(\varvec{X}_t)\Vert ^2\mathrm{d}t\,. \end{aligned}$$
(3.4)

Only when \({\mathbb {E}}\int _{0}^{T} \Vert {\mathbf {f}}_{\varphi -\phi }(\varvec{X}_t)\Vert ^2 dt > 0\) for any \(\varphi -\phi \ne 0\) can one ensure the uniqueness of minimizer. This motivates us to propose the following coercivity condition, introduced in [9] in the case of non-stochastic systems:

Definition 3.1

(Coercivity condition) We say that the stochastic system defined in (1.1) satisfies a coercivity condition on a set \({\mathcal {H}}\) of functions on \({\mathbb {R}}_+\), with a constant \(0<c_{{\mathcal {H}}}\), if

$$\begin{aligned} c_{{\mathcal {H}}} {\left| \left| \left| \varphi \right| \right| \right| }^2 \le \frac{1}{2\sigma ^2NT}\int _0^T\sum _{i=1}^{N}{\mathbb {E}}\bigg [ \big \Vert \frac{1}{N}\sum _{i'= 1}^{N} \varphi (r_{ii'}(t))\varvec{r}_{ii'}(t) \big \Vert ^2\bigg ]\mathrm{d}t \end{aligned}$$
(3.5)

for all \(\varphi \in {\mathcal {H}}\) such that \(\varphi (\cdot )\cdot \in L^2(\rho _T)\). Here \({\left| \left| \left| \cdot \right| \right| \right| }\) denotes the norm defined in (2.11). We will denote by \(c_{{\mathcal {H}}}\) the largest constant for which the inequality holds and call it the coercivity constant.

The coercivity condition ensures identifiability of the kernel. We emphasize that the kernel is latent, in the sense that its values at \(\{r_{ii'}=\Vert \varvec{x}_{i'}-\varvec{x}_i\Vert \}\) are undeterminable from data. In fact, to recover \((\phi (r_{ii'}) ) \in {\mathbb {R}}^{\frac{N(N-1)}{2}}\) from the observed trajectories, even if we ignore the stochastic noise in the system and assume to have access to \({\mathbf {f}}_\phi (\varvec{x})\in {\mathbb {R}}^{dN}\), which consists of a linear combination of \((\phi (r_{ii'}) )\) with coefficients \(\varvec{r}_{ii'}=\varvec{x}_{i'}-\varvec{x}_i\), we face a linear system that is underdetermined as soon as dN (=number of known quantities) \(\le \frac{N(N-1)}{2}\) (=number of unknowns), i.e., for \(d<(N-1)/2\). Thus, in general the exact values of \(\phi \) at locations \(\{r_{ii'}\}_{i,i'}\) cannot be determined. Furthermore, we have stochastic noise in the system. This suggests that the inverse problem of estimating the interaction kernel in a space of continuous functions is ill-posed. We will see that the coercivity condition ensures well-posedness in \(L^2(\rho _T)\), both in the sense of uniqueness and in the sense of stability.

The coercivity condition plays a key role in the learning of the kernel. Beyond ensuring learnability of kernels by ensuring the uniqueness of minimizer over any compact convex sets, it also enables us to control the error of the estimator by the discrepancy between the expectation of error functionals, as is shown in Proposition 3.1. We will use this property to establish the convergence of the estimators in later sections.

To simplify notation, we define a bilinear functional product over \({\mathcal {H}}\) by

$$\begin{aligned} \langle \langle {\varphi _1, \varphi _2}\rangle \rangle :=\frac{1}{2\sigma ^2NT}{\mathbb {E}}\int _{0}^{T}\langle {\mathbf {f}}_{\varphi _1}(\varvec{X}_t), {\mathbf {f}}_{\varphi _2}(\varvec{X}_t) \rangle \mathrm{d}t, \quad \forall \varphi _1, \varphi _2\in {\mathcal {H}}. \end{aligned}$$
(3.6)

Proposition 3.1

Let \({\mathcal {H}}\) be a compact convex subset of \(L^{2}({\mathbb {R}}^+,\rho _T)\) and assume the coercivity condition (3.5) holds true on \({\mathcal {H}}\) with constant \(c_{{\mathcal {H}}}\). Then, the error functional \({\mathcal {E}}_{T,\infty }\) defined in (3.3) has a unique minimizer over \({\mathcal {H}}\) in \(L^2(\rho _T)\):

$$\begin{aligned} \widehat{\phi }_{T,\infty , {\mathcal {H}}}=\underset{\varphi \in {\mathcal {H}}}{{\text {arg}}{\text {min}}}\; {\mathcal {E}}_{T,\infty }(\varphi )\,. \end{aligned}$$
(3.7)

Moreover, for all \(\varphi \in {\mathcal {H}}\),

$$\begin{aligned} {\mathcal {E}}_{T,\infty }(\varphi )- {\mathcal {E}}_{T,\infty }(\widehat{\phi }_{T,\infty , {\mathcal {H}}}) \ge c_{{\mathcal {H}}} {\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2. \end{aligned}$$
(3.8)

Proof

From Eq. (3.4), we have \({\mathcal {E}}_{T,\infty }(\varphi )-{\mathcal {E}}_{T,\infty }(\phi )=\langle \langle {\varphi -\phi , \varphi -\phi }\rangle \rangle \). Then,

$$\begin{aligned}&{\mathcal {E}}_{T,\infty }(\varphi )- {\mathcal {E}}_{T,\infty }(\widehat{\phi }_{T,\infty ,{\mathcal {H}}}) \\&\quad = \langle \langle {\varphi -\phi , \varphi -\phi }\rangle \rangle -\langle \langle {\widehat{\phi }_{T,\infty ,{\mathcal {H}}}-\phi , \widehat{\phi }_{T,\infty ,{\mathcal {H}}}-\phi }\rangle \rangle \\&\quad = \langle \langle { \varphi -\widehat{\phi }_{T,\infty ,{\mathcal {H}}}, \varphi +\widehat{\phi }_{T,\infty ,{\mathcal {H}}}-2\phi }\rangle \rangle \\&\quad = \langle \langle {\varphi -\widehat{\phi }_{T,\infty ,{\mathcal {H}}}, \varphi -\widehat{\phi }_{T,\infty ,{\mathcal {H}}} }\rangle \rangle + 2\langle \langle {\varphi -\widehat{\phi }_{T,\infty ,{\mathcal {H}}}, \widehat{\phi }_{T,\infty ,{\mathcal {H}}}-\phi }\rangle \rangle \\&\quad \ge c_{{\mathcal {H}}} {\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2 + 2\langle \langle {\varphi -\widehat{\phi }_{T,\infty ,{\mathcal {H}}}, \widehat{\phi }_{T,\infty ,{\mathcal {H}}}-\phi }\rangle \rangle , \end{aligned}$$

where the inequality follows from the coercivity condition. Then, Eq.(3.8) follows once we notice that \(\langle \langle {\varphi -\widehat{\phi }_{T, \infty ,{\mathcal {H}}}, \widehat{\phi }_{T,\infty ,{\mathcal {H}}}-\phi }\rangle \rangle \ge 0\) by the convexity of \({\mathcal {H}}\). In fact, since \(t\varphi +(1-t) \widehat{\phi }_{L,\infty ,{\mathcal {H}}} \in {\mathcal {H}}\), \(\forall t \in [0, 1]\), we have \({\mathcal {E}}_{T,\infty }(t\varphi +(1-t) \widehat{\phi }_{T,\infty ,{\mathcal {H}}} )- {\mathcal {E}}_{T,\infty }(\widehat{\phi }_{T,\infty ,{\mathcal {H}}}) \ge 0\) since \(\widehat{\phi }_{T,\infty ,{\mathcal {H}}}\) is a minimizer, and so, equivalently,

$$\begin{aligned}&t\langle \langle { \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}, t\varphi +(2-t)\widehat{\phi }_{T,\infty ,{\mathcal {H}}}-2\phi }\rangle \rangle \ge 0 \\&\quad \Leftrightarrow \langle \langle {\phi -\widehat{\phi }_{T,\infty ,{\mathcal {H}}}, t\phi +(2-t)\widehat{\phi }_{T,\infty , {\mathcal {H}}}-2\phi }\rangle \rangle \ge 0. \end{aligned}$$

Sending \(t \rightarrow 0^+\), we obtain \(\langle \langle {\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}, 2\widehat{\phi }_{T,\infty , {\mathcal {H}}}-2\phi }\rangle \rangle \ge 0.\) \(\square \)

Well-conditioning from coercivity. When the hypothesis space \({\mathcal {H}}\) is a finite-dimensional linear space, the coercivity constant provides a lower bound for the smallest eigenvalue of the limit of the normal equations matrix \(A_{M,L}\) in Eq.(2.7) as \(M,L\rightarrow +\infty \). Therefore, when the sample size M is large and when the observation frequency L is high, the matrix \(A_{M,L}\) is invertible with a high probability (see Corollary 4.2 for details), and thus, the coercivity condition ensures the uniqueness of the regularized MLE in Eq.(2.8):

Proposition 3.2

Suppose that the coercivity condition holds on \({\mathcal {H}}= \mathrm {span}\{\psi _1,\ldots , \psi _n\}\), where the basis functions satisfy \(\langle \psi _p(\cdot )\cdot ,\psi _{p'}(\cdot )\cdot \rangle _{L^2{(\rho _T)}}=\delta _{p,p'}\). Let \(A_{\infty }=\big (\langle \langle {\psi _{p},\psi _{p'}}\rangle \rangle \big )_{p, p'} \in {\mathbb {R}}^{n \times n}\) with the bilinear functional \(\langle \langle {\cdot ,\cdot }\rangle \rangle \) defined in (3.6). Then the smallest singular value of \(A_{\infty }\) is \(\lambda _{\min }(A_{\infty }) = c_{{\mathcal {H}}}\,.\)

Proof

For an arbitrary \(a\in {\mathbb {R}}^n\), denoting \(\psi = \sum _{p=1}^n a_p \psi _p\), we have

$$\begin{aligned} a^T A_\infty a = \langle \langle {\psi ,\psi }\rangle \rangle \ge c_{{\mathcal {H}}}{\left| \left| \left| \psi \right| \right| \right| }^2 = c_{{\mathcal {H}}}\Vert a\Vert ^2 \end{aligned}$$
(3.9)

where the first equality follows from that the functional \(\langle \langle {\cdot ,\cdot }\rangle \rangle \) is bilinear, and the inequality follows from the coercivity condition. Note that by the definition of the coercivity constant in (3.5), we have

$$\begin{aligned} c_{\mathcal {H}}= \sup _{\psi \in {\mathcal {H}}}\frac{\langle \langle {\psi ,\psi }\rangle \rangle }{{\left| \left| \left| \psi \right| \right| \right| }^2} = \sup _{\psi \in {\mathcal {H}}, {\left| \left| \left| \psi \right| \right| \right| } =1} \langle \langle {\psi ,\psi }\rangle \rangle , \end{aligned}$$

which is attained at some \(\psi _* \in {\mathcal {H}}\) since \({\mathcal {H}}\) is finite dimensional. Hence, the inequality in (3.9) becomes inequality for \(\psi _*\) and the smallest eigenvalue of \(A_{\infty }\) is \(c_{\mathcal {H}}\). \(\square \)

Proposition 3.2 suggests that for the hypothesis space \({\mathcal {H}}\), it is important to choose a basis that is orthonormal in \(L^2(\rho _T)\), so as to make the matrix in the normal equations as well-conditioned as possible given the dynamics. In practice, the unknown \(\smash {\rho _T}\) is approximated by the empirical density \(\smash {\rho _T^{L,M}}\). Therefore, when using local basis functions, it is natural to use a partition of the support of \(\smash {\rho _T^{ M}}\).

The coercivity condition and positive integral operators. The coercivity condition introduces constraints on the hypothesis spaces and on the distribution of the solutions of the system, and it is therefore natural that it depends on the distribution \(\mu _0\) of the initial condition \(\varvec{X}_0\), the true interaction kernel \(\phi \), and the random noise. We review below briefly the recent developments in [37, 38], where the coercivity condition is proved to hold on any compact sets of \(L^2(\rho _T)\) for special classes of systems, such as linear systems and nonlinear systems with a stationary distribution. As discussed in [9, 40, 41] for the deterministic cases, we believe that the coercivity condition is “generally” satisfied for “relevant” hypothesis spaces, with a constant independent of the number of particles N, thanks to the exchangeability of both the distribution of the initial conditions and that of the particles at any time t.

The coercivity condition is ensured by the positiveness of integral operators that arise in the expectation in Eq.(3.5). More precisely, recall that the drift of the SDE is cyclic in the indexes. Thus, the distribution of \(\varvec{X}_t\) is exchangeable and

for \(i\ne i'\), one has

$$ {\mathbb {E}}\big [ \varphi (r_{ii'}^t)\varphi (r_{ii''}^t) \langle \varvec{r}_{ii'}^t, \varvec{r}_{ii''}^t\rangle \big ]={\mathbb {E}}[\varphi (\Vert \varvec{r}_{12}^t\Vert )\varphi (\Vert \varvec{r}_{13}^t\Vert )\langle \varvec{r}_{12}^t,\varvec{r}_{13}^t \rangle ]. $$

One can rewrite Eq.(3.5) as

$$\begin{aligned}&(c_{{\mathcal {H}}}-\frac{N-1}{N^2}) \Vert \varphi (\cdot )\cdot \Vert _{L^2(\rho _T)}^2 \\&\quad \le \frac{(N-1)(N-2)}{N^2T}\int _0^T {\mathbb {E}}[\varphi (\Vert \varvec{r}_{12}^t\Vert )\varphi (\Vert \varvec{r}_{13}^t\Vert )\langle \varvec{r}_{12}^t,\varvec{r}_{13}^t \rangle ]\mathrm{d}t \\&\quad =\int _0^\infty \int _0^\infty \varphi (r)\varphi (s) {\overline{K}}_T(r,s)\mathrm{d}r\mathrm{d}s, \end{aligned}$$

where the integral kernel \({\overline{K}}_T:{\mathbb {R}}^+\times {\mathbb {R}}^+ \rightarrow {\mathbb {R}}\) is defined as

$$\begin{aligned} {\overline{K}}_T(r,s) := (rs)^{d} \int _{S^{d-1}}\int _{S^{d-1}}\langle \xi ,\eta \rangle \frac{1}{T}\int _0^T p_t(r\xi ,s\eta )\mathrm{d}t \mathrm{d}\xi \mathrm{d}\eta , \end{aligned}$$
(3.10)

with \(p_t(u,v)\) denoting the joint density function of the random vector \((\varvec{r}_{12}^t, \varvec{r}_{13}^t)\) and \(S^{d-1}\) denoting the unit sphere in \({\mathbb {R}}^d\). It is shown in [37, 38] that the associated integral operator defined by \({\overline{K}}_T\) is strictly positive definite, and therefore, the coercivity condition holds, for a large class of systems with interaction kernels in form of \(\phi (r)= (a+r^\theta )^{\gamma -1}r^{\theta -2}\) with \(a\ge 0\) and \(\{ (\theta , \gamma ) \in (0,1] \times (1,2]: \theta \gamma >1 \}\).

3.2 Consistency and Rate of Convergence of the Estimator

In this section, we consider using a family of finite-dimensional linear spaces \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) as hypothesis spaces and establish the consistency and rate of convergence of our estimators, which are our main theorems for continuous-time observations. We assume the spaces \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) satisfying Markov–Bernstein-type inequality: there exist \(c_1, \gamma >0\) s.t. for all \(\varphi \in {\mathcal {L}}_n\)

$$\begin{aligned} {\Vert \varphi '\Vert _{\infty }}\le c_1\mathrm {dim}({\mathcal {L}}_n)^\gamma {\Vert \varphi \Vert _{\infty } }\,. \end{aligned}$$
(3.11)

This condition has a long history and rich literature in classical approximation theory, where it is studied when function spaces satisfy (3.11) (e.g., see the survey paper [53]), which is an important step in establishing inverse approximation theorems. This kind of inequality holds true on many function spaces that are commonly used as approximation spaces in practice, including:

  • \({\mathcal {L}}_n:\) trigonometric polynomials of degree n on \([0, 2\pi ]\) (similarly on [0, R]), for which \(\Vert \phi '\Vert _{\infty }\le \frac{1}{2}({\mathrm {dim}({\mathcal {L}}_n)-1})\Vert \phi \Vert _{\infty }\). This result dates back to Bernstein [5].

  • \({\mathcal {L}}_n\): the polynomial space consisting of all polynomials with degree less than \(n-1\) on [0, R] (see Theorem 3.3 in [48]), for which \(\Vert \varphi '\Vert _{\infty }\le \frac{2}{R}{(\mathrm {dim}({\mathcal {L}}_n)+1)^2}\Vert \varphi \Vert _{\infty }\). As a result, (3.11) also holds true for polynomial splines; other extensions include rational functions. We refer to the reader to [30] for details.

If we choose a compact convex hypothesis set \({\mathcal {H}}_M\) contained in some \({\mathcal {L}}_n\), with a suitable correspondence between n and M, such that the distance between \({\mathcal {H}}_M\) and the true kernel \(\phi \) vanishes as M increases, the following consistency result holds:

Theorem 3.1

(Strong consistency of estimators) Suppose \(\phi \in {\mathcal {K}}_{R,S}\), the admissible set defined in (1.7). Let \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) satisfying (3.11) and

$$ \inf _{\varphi \in {\mathcal {L}}_{n}}\Vert \varphi -\phi \Vert _{\infty } \xrightarrow {n\rightarrow \infty } 0. $$

Let \(S_0 \ge S\) and \({\mathcal {H}}_M={\mathcal {B}}_{2S_0}^{\infty }({\mathcal {L}}_{n_M})=\{\varphi \in {\mathcal {L}}_{n_M}\,:\,||\varphi ||_\infty <2S_0\}\) with \(\mathrm {dim}({\mathcal {L}}_{n_M})=n_M\) and \(\lim _{M\rightarrow \infty }\frac{n_M\log n_M}{M} =0\). Finally, suppose the coercivity condition holds true on \( \cup _{n}{\mathcal {L}}_{n}\). Then, we have

$$ \lim _{M\rightarrow \infty }{\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_M}-\phi \right| \right| \right| } =0 \quad \text { with probability one.} $$

If we know the explicit approximation rate of the family \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\}\), then by carefully choosing the dimension of hypothesis spaces as a function of M, we can obtain a near-optimal rate of convergence of our estimators.

Theorem 3.2

(Convergence rate of estimators) Suppose \(\phi \in {\mathcal {K}}_{R,S}\), the admissible set defined in (1.7). Assume that there exits a sequence of linear spaces \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) satisfying (3.11) with the properties

  1. (i)

    \(\mathrm {dim}({\mathcal {L}}_n)\le c_0 n\) for \(n \in {\mathbb {N}}^+\),

  2. (ii)

    \(\inf _{\varphi \in {\mathcal {L}}_n}\Vert \varphi -\phi \Vert _{\infty }\le c_2n^{-s}\).

For example, when \(\phi \in C^{k,\alpha }\) with \(s=k+\alpha \ge 2\), we may choose \({\mathcal {L}}_n\) to consist of polynomial splines of degree \(\lfloor s-1\rfloor \) with uniform knots on [0, R]. Let \({\mathcal {H}}_n={\mathcal {B}}_{S_0}^{\infty }({\mathcal {L}}_{n})\) with \(S_0=c_2+S\) and \(n \asymp \left( {M}/{\log M}\right) ^{{1}/{(2s+1)}}\), and assume that the coercivity condition holds on \({\mathcal {L}}:=\cup _{n}{\mathcal {L}}_n\) with a constant \(c_{{\mathcal {L}}}>0\). Then, we have

$$ {\mathbb {E}}\left[ {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_{n}}-\phi \right| \right| \right| }^2\right] \le \frac{C}{c^2_{{{\mathcal {L}}}}} \left( \frac{\log M}{ M}\right) ^{\frac{2s}{2s+1}}\,, $$

where C is a constant depending only on \(\sigma ,N, T, R, S_0\).

It is fruitful to compare (up to log terms) the rate \({2s}/({2s+1})\) to that for nonparametric 1-dimensional regression, where one can observe directly noisy values of the target function \(\phi \) at sample points drawn i.i.d from \(\rho _T\). For the function space \(C^{k,\alpha }\), this rate is min–max optimal. Our numerical examples in Sect. 5 empirically validate the desired convergence rate for \(s=1,2\) where we use piecewise constant and linear polynomials. Note that in our setting, learning \(\phi \) is an inverse problem, as we do not directly observe the values \(\{\phi (\Vert \varvec{x}_{i',t}^{(m)} - \varvec{x}_{i,t}^{(m)}\Vert )\}_{ i, i' =1,\ m=1}^{N,N,M}\). While our result is stated in such a way that a knowledge of s is required, in fact an upper bound \({{\tilde{s}}}\) is sufficient, as choosing sufficiently regular splines, of degree \(\lfloor {\tilde{s}}-1\rfloor \) would give the optimal s-dependent rate, at the cost of possibly larger constants. We also remark that we do not require that the underlying stochastic process satisfies certain mixing properties nor starts from a stationary distribution. Obtaining this optimal convergence rate in M for short-time trajectory observations is therefore satisfactory. For long trajectories and under ergodicity assumptions, rates in terms of MT are likely to be obtainable: In Sect. 5, we present numerical evidence that suggests that the error does decrease with MT at a near-optimal rate.

3.3 Proof of the Main Theorems

In the following part, we present the proof for Theorem 3.2, which also yields the proof for Theorem 3.1. The main techniques include the Itô formula, concentration inequalities of unbounded random variables, and a generalization of a novel covering argument in [52] that enables us to deal efficiently with the fluctuations in the data due to the stochastic noise in the dynamics of the system.

One major obstacle in the non-asymptotic analysis of our regularized MLE estimators is the unboundedness of stochastic integral of the form \(\frac{1}{T}\int _{0}^{T}\langle {\mathbf {f}}_{\varphi }(\varvec{X}_t), d\varvec{X}_t\rangle \mathrm{d}t\) appearing in the empirical error functional. Unlike the deterministic case \(\sigma =0\), our empirical error functional \({\mathcal {E}}_{T,M}(\cdot )\) is in general not continuous over \({\mathcal {H}}\) with respect to the \(\Vert \cdot \Vert _{\infty }\) norm. In the following, we first leverage the general Itô formula described in Theorem A.3, to obtain a form of the empirical error functional that does not involve a stochastic integral and is amenable to analysis; we then show that it is continuous on \(C^{1,1}([0,R])\) with respect to the \(\Vert \cdot \Vert _{1,1}\) norm. Therefore, in the following preliminary results for the proofs of the main theorems, we consider the following generic hypothesis space:

Assumption 3.3

The hypothesis space \({\mathcal {H}}\) is a compact convex subset of \(C^{1,1}([0,R])\) with respect to the uniform norm \(\Vert \cdot \Vert _\infty \) and bounded above by \(S_0\ge S\).

Lemma 3.1

Suppose \(\varphi \in {\mathcal {K}}_{R,S}\), the admissible set defined in (1.7). Let

$$\begin{aligned} V_{\varphi }(\varvec{X}_t)=\frac{1}{2N}\sum _{i,i'}{\varPsi }(\Vert \varvec{x}_{i,t}-\varvec{x}_{i',t}\Vert )\, \text { with }\, {\varPsi }'(r)=\varphi (r)r\,; \end{aligned}$$
(3.12)

then, we have, almost surely

$$ -(\mathrm{d}V_{\varphi })(\varvec{X}_t)=\langle {\mathbf {f}}_{\varphi }(\varvec{X}_t), \mathrm{d}\varvec{X}_t\rangle +\frac{\sigma ^2}{2N}\sum _{i=1}^N\sum _{i'\ne i}\left( \varphi '(\Vert \varvec{x}_{ii'}\Vert )\Vert \varvec{x}_{ii'}\Vert +\varphi (\Vert \varvec{x}_{ii'}\Vert )d \right) \mathrm{d}t $$

Proof

Let \(g(\varvec{X})=V_{\varphi }(\varvec{X})\). Note that g is \(C^2\), with derivatives

$$\begin{aligned} \frac{\partial {g(\varvec{X})}}{\partial \varvec{x}_i}&=\frac{\partial {V_{\varphi }(\varvec{X})}}{\partial \varvec{x}_i}=\frac{1}{N}\sum _{i'\ne i}\varphi (\Vert \varvec{x}_i-\varvec{x}_{i'}\Vert )(\varvec{x}_{i}-\varvec{x}_{i'})\\ \frac{\partial ^2{g(\varvec{X})}}{\partial \varvec{x}_i\partial \varvec{x}_{k}}&=-\delta _{ki}\frac{1}{N} \\&\quad \left( \sum _{i'\ne i}\varphi (\Vert \varvec{x}_i-\varvec{x}_{i'}\Vert ){\mathbf {I}}_d+\frac{\varphi '(\Vert \varvec{x}_i-\varvec{x}_{i'}\Vert )}{\Vert \varvec{x}_i-\varvec{x}_{i'}\Vert }(\varvec{x}_i-\varvec{x}_{i'})\otimes (\varvec{x}_i-\varvec{x}_{i'})\right) \\&\quad -\delta _{k\ne i} \frac{1}{N} \\&\quad \left( \varphi (\Vert \varvec{x}_i-\varvec{x}_{k}\Vert ){\mathbf {I}}_d+\frac{\varphi '(\Vert \varvec{x}_i-\varvec{x}_{k}\Vert )}{\Vert \varvec{x}_i-\varvec{x}_{k}\Vert }(\varvec{x}_i-\varvec{x}_{k})\otimes (\varvec{x}_i-\varvec{x}_{k})\right) \,. \end{aligned}$$

Using Itô’s formula (Theorem A.3) for the Itô process \(g(\varvec{X}_t)\), the conclusion follows. \(\square \)

Proposition 3.3

Suppose \(\varphi _1, \varphi _2 \in {\mathcal {H}}\), then it holds almost surely that

$$\begin{aligned} |{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi _1)-{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi _2)| \le C_1 \Vert \varphi _1-\varphi _2\Vert _{\infty }+C_2\Vert \varphi _1'-\varphi _2'\Vert _{\infty }, \end{aligned}$$

where \(C_1=\frac{ R^2S_0}{\sigma ^2}+\frac{ R^2 }{2\sigma ^2T}+\frac{d}{2}\) and \(C_2=\frac{R}{2}\).

Proof

Note that

$$\begin{aligned}&{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi _1)-{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi _2)\\&\quad =\frac{1}{2\sigma ^2NT}\int _{0}^{T}\underbrace{\langle {\mathbf {f}}_{\varphi _1-\varphi _2}(\varvec{X}_t), {\mathbf {f}}_{\varphi _1+\varphi _2}(\varvec{X}_t)\rangle }_{I_1}\mathrm{d}t-2\underbrace{\langle {\mathbf {f}}_{\varphi _1-\varphi _2}(\varvec{X}_t), \mathrm{d}\varvec{X}_t\rangle }_{I_2}. \end{aligned}$$

\(I_1\) satisfies

$$\begin{aligned} |I_1|&\le \Vert {\mathbf {f}}_{\varphi _1+\varphi _2}(\varvec{X}_t)\Vert \Vert {\mathbf {f}}_{\varphi _1-\varphi _2}(\varvec{X}_t)\Vert \le N\Vert \varphi _1-\varphi _2\Vert _{\infty }\Vert \varphi _1+\varphi _2\Vert _{\infty }R^2, \end{aligned}$$

since \(\Vert {\mathbf {f}}_{\varphi }(\varvec{X}_t)\Vert \le \sqrt{N}R\Vert \varphi \Vert _{\infty }\). For \(I_2\), Lemma 3.1 yields

$$\begin{aligned} |I_2|&\le |V_{\varphi _1-\varphi _2}({\varvec{X}_0})-V_{\varphi _1-\varphi _2}(\varvec{X}_t)|\\&\quad +\frac{\sigma ^2}{2N}\left| \int _{0}^{T}\sum _{i=1}^N\sum _{i'\ne i}((\varphi _1-\varphi _2)'(r_{ii'})r_{ii'}+(\varphi _1-\varphi _2)(r_{ii'})d )\mathrm{d}t\right| \\&\le \frac{(N-1)}{2}\Vert \varphi _1-\varphi _2\Vert _{\infty }R^2+\frac{(N-1)\sigma ^2T}{2}(\Vert \varphi _1'-\varphi _2'\Vert _{\infty }R+\Vert \varphi _1-\varphi _2\Vert _{\infty }d), \end{aligned}$$

where we used

$$\begin{aligned} |V_{\varphi }(\varvec{X}_t)-V_{\varphi }({\varvec{X}_0})|&\le \frac{1}{2N}\sum _{ii'}|\int _{r_{ii',0}}^{r_{ii',T}} \varphi (r)r \mathrm{d}r | \le \frac{(N-1)\Vert \varphi \Vert _{\infty }R^2}{2}, \end{aligned}$$

which follows from its definition in (3.12). Combining the estimates for \(I_1\) and \(I_2\), and using \(\Vert \varphi _1 +\varphi _2\Vert _{\infty }\le 2S_0\), we obtain

$$\begin{aligned}&|{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi _1)-{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi _2)|\nonumber \\&\quad \le \frac{ R^2}{2\sigma ^2}\Vert \varphi _1-\varphi _2\Vert _{\infty } \left( \Vert \varphi _1+\varphi _2\Vert _{\infty } +\frac{1}{T}\right) \nonumber \\&\qquad +\frac{1}{2}(d\Vert \varphi _1-\varphi _2\Vert _{\infty }+\Vert \varphi _1'-\varphi _2'\Vert _{\infty }R)\nonumber \\&\quad \le C_1\Vert \varphi _1-\varphi _2\Vert _{\infty }+C_2\Vert \varphi _1'-\varphi _2'\Vert _{\infty }, \end{aligned}$$
(3.13)

where \(C_1=\frac{ R^2S_0}{\sigma ^2}+\frac{ R^2 }{2\sigma ^2T}+\frac{d}{2}\) and \(C_2=\frac{R}{2}\). \(\square \)

When \(M=\infty \), i.e., we observe infinitely many trajectories, the expectation of our error functional \({\mathcal {E}}_{T,\infty }\), as in (3.3), does not involve the stochastic integral term. From the proof of Proposition 3.3, we see that it is continuous over \({\mathcal {H}}\) with respect to the \(\Vert \cdot \Vert _{\infty }\) norm:

Corollary 3.1

Suppose \(\varphi _1, \varphi _2 \in {\mathcal {H}}\), then, with \(C_1=\frac{2R^2S_0}{\sigma ^2}\), we have

$$\begin{aligned} |{\mathcal {E}}_{T,\infty }(\varphi _1)-{\mathcal {E}}_{T,\infty }(\varphi _2)| \le C_1 \Vert \varphi _1-\varphi _2\Vert _{\infty }. \end{aligned}$$

Proof

Using (3.4), we obtain that

$$\begin{aligned} |{\mathcal {E}}_{T,\infty }(\varphi _1)-{\mathcal {E}}_{T,\infty }(\varphi _2)|&=\frac{1}{2\sigma ^2NT}{\mathbb {E}}\int _{0}^{T}\big | \langle {\mathbf {f}}_{\varphi _1-\varphi _2}(\varvec{X}_t), {\mathbf {f}}_{\varphi _1+\varphi _2-2\phi }(\varvec{X}_t) \rangle \big |\mathrm{d}t \\&\le \frac{R^2}{2\sigma ^2}\Vert \varphi _1-\varphi _2\Vert _{\infty } \Vert \varphi _1+\varphi _2-2\phi \Vert _{\infty } \\&\le \frac{2R^2S_0}{\sigma ^2}\Vert \varphi _1-\varphi _2\Vert _{\infty } \end{aligned}$$

\(\square \)

Recall the definition

$$ \widehat{\phi }_{T,\infty , {\mathcal {H}}}:=\underset{\varphi \in {\mathcal {H}}}{{\text {arg}}{\text {min}}}\; {\mathcal {E}}_{T,\infty }(\varphi ). $$

We now analyze the discrepancy between the empirical minimizer \(\widehat{\phi }_{T,M, {\mathcal {H}}}\) and \(\widehat{\phi }_{T,\infty , {\mathcal {H}}}\), which we called sampling error (SE) in the diagram in Fig. 1. We introduce a measurable function on the path space by

$$\begin{aligned} D_{\varphi }: ={\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )-{\mathcal {E}}_{\varvec{X}_{[0,T]}}(\widehat{\phi }_{T,\infty , {\mathcal {H}}}) \end{aligned}$$
(3.14)

for any \(\varphi \in {\mathcal {H}}\). From Proposition 3.1, we have

$$\begin{aligned} {\mathbb {E}}D_{\varphi }&={\mathcal {E}}_{T,\infty }(\varphi )-{\mathcal {E}}_{T,\infty }(\widehat{\phi }_{T,\infty , {\mathcal {H}}})\ge c_{{\mathcal {H}}}{\left| \left| \left| \varphi -{\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }, \end{aligned}$$
(3.15)

so \(D_{\varphi }\) in fact bounds (in expectation) the distance between \(\varphi \) and \(\widehat{\phi }_{ T,\infty , {\mathcal {H}}}\) w.r.t. the \(|||\cdot |||\)-norm. We now perform a non-asymptotic analysis of \(D_{\varphi }\). We shall show that the random variable \(D_{\varphi }\) satisfies moment conditions, sufficient to guarantee strong concentration about its expectation (Proposition 3.4). To do this, we decompose \(D_{\varphi }\) as the sum of a bounded component only involving time integrals and an unbounded component involving stochastic integrals:

$$\begin{aligned} D_{\varphi }:&=D_{\varphi }^{\mathrm {bd}}-D_{\varphi }^{\mathrm {ubd}}\\ D_{\varphi }^{\mathrm {bd}} :&=\frac{1}{2\sigma ^2NT}\int _{0}^{T}\langle {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t),{\mathbf {f}}_{\varphi +\widehat{\phi }_{T,\infty , {\mathcal {H}}}-2\phi }(\varvec{X}_t)\rangle \mathrm{d}t\\ D_{\varphi }^{\mathrm {ubd}}:&=\frac{1}{\sigma ^2 NT}\int _{0}^{T}\langle {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t),\mathrm{d}\varvec{B}(t)\rangle \end{aligned}$$

We prove moment conditions independently for each of these components in the next two Lemmata.

Lemma 3.2

(Bounds on \(D_{\varphi }^{\mathrm {ubd}}\)) For \(\varphi \in {\mathcal {H}}\) and \(p=2,3,4,\dots \),

$$\begin{aligned} {\mathbb {E}}\big | D_{\varphi }^{\mathrm {ubd}}|^p \le C \big (\big \Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\big \Vert _{\infty }\big )^{p-2}{\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2 \end{aligned}$$

where \(C=(\frac{p(p-1)}{2})^{\frac{p}{2}}\frac{R^{p-2}}{\sigma ^{2p}(NT)^{\frac{p+2}{2}}}\).

Proof

First of all, note that

$$\begin{aligned} \big \Vert {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t)\big \Vert \le \sqrt{N}R \big \Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\big \Vert _{\infty }\,. \end{aligned}$$
(3.16)

Therefore \({\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t)\) is a \(L^2\)-integrable process. Applying Theorem A.5, we obtain

$$\begin{aligned}&{\mathbb {E}} \bigg | \int _{0}^{T}\langle {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t), \mathrm{d}\varvec{B}(t)\rangle \bigg |^p \\&\quad \le C_{p,T} {\mathbb {E}}\int _{0}^{T}\big \Vert {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t) \big \Vert ^p\mathrm{d}t\\&\quad \le C_{p,T}\big (\sqrt{N}R\Vert {\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}\Vert _{\infty } \big )^{p-2}{\mathbb {E}}\int _{0}^T\big \Vert {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t)\big \Vert ^2\mathrm{d}t\\&\quad \le C_{p,T}\big (\sqrt{N}R\Vert {\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}\Vert _{\infty } \big )^{p-2} NT{\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2\,, \end{aligned}$$

with \(C_{p,T}=\big (\frac{p(p-1)}{2}\big )^{\frac{p}{2}}T^{\frac{p-2}{2}}\). The conclusion then follows by adding in the scaling factor \(\frac{1}{\sigma ^2 NT}\). \(\square \)

Lemma 3.3

(Bounds on \(D_{\varphi }^{\mathrm {bd}}\)) For \(\varphi \in {\mathcal {H}}\) and \(p = 2,3,4,\dots \),

$$\begin{aligned} {\mathbb {E}}\big | D_{\varphi }^{\mathrm {bd}} \big |^p \le \left( \frac{2S_0}{\sigma ^2}\right) ^pR^{2p-2}\big \Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\big \Vert _{\infty }^{p-2} {\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2. \end{aligned}$$

Proof

From inequality (3.16) and the linear dependence of \({\mathbf {f}}_{\varphi }\) on \(\varphi \), we have

$$\begin{aligned} \big | D_{\varphi }^{\mathrm {bd}}\big |&\le \frac{2S_0R}{\sigma ^2\sqrt{N}T}\int _0^T\big \Vert {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t)\big \Vert \mathrm{d}t \\&\le \frac{2S_0R}{\sigma ^2}\sqrt{\frac{1}{NT}\int _{0}^{T}\big \Vert {\mathbf {f}}_{\varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t)\big \Vert ^2\mathrm{d}t}. \end{aligned}$$

Therefore,

$$ {\mathbb {E}}|D_{\varphi }^{\mathrm {bd}}|^p \le \left( \frac{2S_0}{\sigma ^2}\right) ^pR^{2p-2}\big \Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\big \Vert _{\infty }^{p-2}{\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2. $$

\(\square \)

Now we combine Lemmas 3.2 and 3.3 to prove the moment condition for \(D_{\varphi }\).

Lemma 3.4

(Moment conditions) For \(\varphi \in {\mathcal {H}}\), and every \(p=2,3,\ldots \), we have

$$\begin{aligned}&{\mathbb {E}}\bigg [ \big | D_{\varphi }-{\mathbb {E}}D_{\varphi } \big |^p \bigg ] \le \frac{1}{2} p! K_{\varphi ,{\mathcal {H}}}^{p-2} C_{\varphi ,{\mathcal {H}}}, \end{aligned}$$

where

$$\begin{aligned} K_{\varphi ,{\mathcal {H}}}:= C_0\big \Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\big \Vert _{\infty }\,,\quad C_{\varphi ,{\mathcal {H}}}:=C_0^2 C_1{\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2 \end{aligned}$$
(3.17)

with \(C_0=\sqrt{\frac{2 e^2R^2}{\sigma ^4NT}}, \) \(C_1 =\max _{p\ge 2} \frac{1}{\sqrt{2\pi p}NTR^2}\bigg (1+ \frac{c_{\sigma ,S_0,R}}{C_0^p} \bigg ),\) and \( c_{\sigma ,S_0,R}=\max _{p\ge 2} (\frac{8S_0}{\sigma ^2e})^p \frac{R^{2p-2}}{\sqrt{2\pi }p^{p+\frac{1}{2}}}. \)

Proof

The proof is based on the Jensen’s inequality, Lemma 3.2 and Lemma 3.3.

$$\begin{aligned} {\mathbb {E}}\big | D_{\varphi }-{\mathbb {E}}D_{\varphi }\big |^p&\le 2^{p-1}{\mathbb {E}}\big | D_{\varphi }^{\mathrm {bd}}-{\mathbb {E}}D_{\varphi }^{\mathrm {bd}} \big |^p+2^{p-1}{\mathbb {E}}\big |D_{\varphi }^{\mathrm {ubd}} \big |^p \\&\le 2^{2p-1}{\mathbb {E}}\big |D_{\varphi }^{\mathrm {bd}} \big |^p +2^{p-1}{\mathbb {E}}[\big |D_{\varphi }^{\mathrm {ubd}} \big |^p \\&\le \frac{1}{2}C_0 \big \Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\big \Vert _{\infty }^{p-2}{\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2 \\&\le \frac{1}{2} p! (C_1\Vert \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}}\Vert _{\infty })^{p-2} C_1^2C_2{\left| \left| \left| \varphi -\widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2, \end{aligned}$$

where the constants are

$$\begin{aligned} C_0&:=\left( \frac{8S_0}{\sigma ^2}\right) ^pR^{2p-2}+(2p(p-1))^{\frac{p}{2}}\frac{R^{p-2}}{\sigma ^{2p}(NT)^{\frac{p+2}{2}}},\\ C_1&:=\sqrt{\frac{2 e^2R^2}{\sigma ^4NT}}, \quad \quad C_2:=\max _{p\ge 2} \bigg (\frac{1}{\sqrt{2\pi p}NTR^2}+ \frac{c_{\sigma ,S_0,R}}{C_1^p} \bigg ),\\ c_{\sigma ,S_0,R}&:=\max _{p\ge 2} \left( \frac{8S_0}{\sigma ^2e}\right) ^p \frac{R^{2p-2}}{\sqrt{2\pi }p^{p+\frac{1}{2}}}, \end{aligned}$$

and the last inequality is derived from the Stirling’s lemma. \(\square \)

We now tie the discrepancy functionals for finite and infinite M:

Proposition 3.4

(Sampling Error bound) Let \(0<\delta <1\) and \(\{\phi _j\}_{j=1}^{{\mathcal {N}}}\) be an \(\eta \)-net of functions in a compact convex hypothesis space \({\mathcal {H}} \subset Ball_{S_0}(L^{\infty }[0,R])\), that is for any function \(\varphi \in {\mathcal {H}}\), there exists some j such that \(\Vert \varphi -\phi _j\Vert _{\infty } \le \eta \). Denote

$$\begin{aligned} {\mathcal {D}}_{\varphi _j,\infty }:&= {\mathbb {E}}D_{\varphi _j}={\mathcal {E}}_{T,\infty }(\varphi _j)-{\mathcal {E}}_{T,\infty }({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}})\\ {\mathcal {D}}_{\varphi _j,M}:&={\mathcal {E}}_{T,M}(\varphi _j)-{\mathcal {E}}_{T,M}({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}})\,. \end{aligned}$$

Then with probability at least \(1-\frac{\delta }{2}\), we have

$$\begin{aligned} {\mathcal {D}}_{\varphi _j,\infty }- {\mathcal {D}}_{\varphi _j,M}\le \epsilon _{M,\delta , {\mathcal {N}}}+ \frac{1}{2}{\mathcal {D}}_{\varphi _j,\infty } \end{aligned}$$
(3.18)

for all j, where \(\epsilon _{M,\delta ,{\mathcal {N}}}=\frac{C}{M}\log (\frac{2{\mathcal {N}}}{\delta })\), \(C=2\frac{C_0^2C_1}{c_{\mathcal {H}}}+4C_0S_0\), and \(C_0,C_1\) as in (3.17), with \(c_{{\mathcal {H}}}\) the coercivity constant defined in (3.5).

Proof

For each \(\varphi _j \in {\mathcal {H}}\), recall that in Eq.(3.15), the coercivity condition on \({\mathcal {H}}\) implies that

$$ {\mathcal {D}}_{\varphi _j,\infty } \ge c_{{\mathcal {H}}} {\left| \left| \left| \varphi _j- \widehat{\phi }_{T,\infty , {\mathcal {H}}} \right| \right| \right| }^2. $$

Then, Eq.(3.17) in Lemma 3.4 yields that

$$\begin{aligned}&{\mathbb {E}}|{D}_{\varphi _j}-{\mathbb {E}}{D}_{\varphi _j}|^p \le \frac{1}{2} p! K_{\varphi _j,{\mathcal {H}}}^{p-2}\frac{C_0^2C_1}{c_{{\mathcal {H}}}} {\mathcal {D}}_{\varphi _j,\infty }. \end{aligned}$$
(3.19)

Therefore, the random variable \({D}_{\varphi _j}\) satisfies the moment condition in Corollary A.2, and so \(\forall \epsilon >0\)

$$ {\mathbb {P}}\left\{ {{\mathcal {D}}_{\varphi _j,\infty } -{\mathcal {D}}_{\varphi _j,M} \ge \sqrt{\epsilon (\epsilon + {\mathcal {D}}_{\varphi _j,\infty })}}\right\} \le \exp \bigg (\frac{-M\epsilon }{\frac{2C_0^2C_1}{c_{{\mathcal {H}}}}+2K_{\varphi _j,{\mathcal {H}}}}\bigg ). $$

We have \(K_{\varphi _j,{\mathcal {H}}}\le 2C_0S_0\) where \(C_0\) is defined in (3.17). Taking a union bound on all these events, over \(j \in \{1,2,\ldots ,{\mathcal {N}}\}\), we obtain that

$$\begin{aligned} {\mathbb {P}}\left\{ { \max _{1\le j \le {\mathcal {N}}}\frac{{\mathcal {D}}_{\varphi _j,\infty } -{\mathcal {D}}_{\varphi _j,M}}{\sqrt{{\mathcal {D}}_{\varphi _j,\infty }+\epsilon }} \ge \sqrt{\epsilon }}\right\} \le {\mathcal {N}}\exp \bigg ( \frac{-M\epsilon }{\frac{2C_0^2C_1}{c_{\mathcal {H}}}+4C_0S_0}\bigg )\,. \end{aligned}$$
(3.20)

Setting the right-hand side to be \(\frac{\delta }{2}\), we get \(\epsilon _{M,\delta ,{\mathcal {N}}}=\frac{C}{M}\log (\frac{2{\mathcal {N}}}{\delta })\), where \( C:=2\frac{C_0^2C_1}{c_{\mathcal {H}}}+4C_0S_0 \). Using the inequality \( \sqrt{\epsilon _{M,\delta ,{\mathcal {N}}}( \epsilon _{M,\delta ,{\mathcal {N}}}+D_{\varphi _j,\infty })}\le \epsilon _{M,\delta ,{\mathcal {N}}} +\frac{1}{2}D_{\varphi _j,\infty }, \) we conclude that with probability at least \(1-\frac{\delta }{2}\)

$$ {\mathcal {D}}_{\varphi _j,\infty } -{\mathcal {D}}_{\varphi _j,M} \le \epsilon _{M,\delta ,{\mathcal {N}}}+ \frac{1}{2}{\mathcal {D}}_{\varphi _j,\infty }. $$

\(\square \)

Proof of Theorem 3.2

For \({\mathcal {H}}_n={\mathcal {B}}_{S_0}^{\infty }({\mathcal {L}}_n)\), let \(\{\varphi _j: j=1,\ldots , {\mathcal {N}}\}\) be an \(\eta \)-net of \({\mathcal {H}}_n\). Let

$$ {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}=\mathrm {argmin}_{\varphi \in {\mathcal {H}}_n}{\mathcal {E}}_{T,M}(\varphi ). $$

Then there exists \(\varphi _{j_M}\) in the net such that \(\Vert \varphi _{j_M}-{\widehat{\phi }}_{T,M,{\mathcal {H}}_n}\Vert _{\infty } \le \eta \); by Corollary 3.1

$$\begin{aligned} \big |{\mathcal {D}}_{\varphi _{j_M},\infty }- {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},\infty }\big |&= \big |{\mathcal {E}}_{T,\infty }(\varphi _{j_M})- {\mathcal {E}}_{T,\infty }({\widehat{\phi }}_{T,M,{\mathcal {H}}_n})\big | \le \eta \frac{2S_0R^2}{\sigma ^2}. \end{aligned}$$
(3.21)

On the other hand, since \({\mathcal {H}}_n \subset {\mathcal {L}}_n \subset C^{1,1}([0,R])\), thanks to the almost sure bound in Proposition 3.3 and the uniformly bound \(\sup _{\varphi \in {\mathcal {L}}_n}\frac{\Vert \varphi '\Vert _{\infty }}{\Vert \varphi \Vert _{\infty } }\le c_1(\mathrm {dim}({\mathcal {L}}_n))^\gamma \) from assumption (3.11), we have, almost surely,

$$\begin{aligned} \begin{aligned} \big |{\mathcal {D}}_{\varphi _{j_M},M}- {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},M}\big |&=\big | {\mathcal {E}}_{T, M}(\varphi _{j_M})- {\mathcal {E}}_{T,M}({\widehat{\phi }}_{T,M,{\mathcal {H}}_n})\big |\\&\le \eta (C_1+c_1C_2 \mathrm {dim}({\mathcal {L}}_n)^\gamma ), \end{aligned} \end{aligned}$$
(3.22)

where \(C_1 = \frac{ R^2S_0}{\sigma ^2}+\frac{ R^2 }{2\sigma ^2T}+\frac{d}{2}\), \(C_2 =\frac{R}{2}\).

By Lemma 3.4, for each \(\eta >0\), with probability at least \(1-\frac{\delta }{2}\), (3.18) holds for this \(\eta \)-net \(\{\varphi _j: j=1,\ldots ,{\mathcal {N}}\}\). Combining (3.18) with (3.21) and (3.22), we conclude that, with probability at least \(1-\frac{\delta }{2}\),

$$\begin{aligned}&{\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},\infty }- {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},M} \\&\quad \le | {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},\infty }- {\mathcal {D}}_{\varphi _{j_M},\infty } | + |{\mathcal {D}}_{\varphi _{j_M},\infty }-{\mathcal {D}}_{\phi _{j_M},M}| \\&\qquad + |{\mathcal {D}}_{\phi _{j_M},M}- {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},M} | \\&\quad \le \eta (C_0+ C_1+c_1C_2\mathrm {dim}({\mathcal {L}}_n)^{\gamma } )+{\mathcal {D}}_{\phi _{j_M},\infty }- {\mathcal {D}}_{\phi _{j_M},M}\\&\quad \le \eta (C_0+ C_1+c_1C_2\mathrm {dim}({\mathcal {L}}_n)^{\gamma } )+ \epsilon _{M,\delta ,{\mathcal {N}}}+ \frac{1}{2} {\mathcal {D}}_{ \phi _{j_M}, \infty } \\&\quad \le \eta (\frac{3C_0}{2}+ C_1+c_1C_2\mathrm {dim}({\mathcal {L}}_n)^{\gamma } )+ \epsilon _{M,\delta ,{\mathcal {N}}}+ \frac{1}{2} {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},\infty }. \end{aligned}$$

Notice that \({\mathcal {D}}_{\smash {{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},M}}\le 0\), so the above inequality implies that

$$\begin{aligned} {\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},\infty } \le \eta (3C_0+ 2C_1+2c_1C_2\mathrm {dim}({\mathcal {L}}_n)^{\gamma } )+ 2\epsilon _{M,\delta ,{\mathcal {N}}}. \end{aligned}$$
(3.23)

Let \({\mathcal {S}}\) be a metric space and \(\eta >0.\) We define the covering number \({\mathcal {N}}({\mathcal {S}},\eta )\) to be the minimal number of disks in \({\mathcal {S}}\) with radius \(\eta \) covering \({\mathcal {S}}\). The covering number of \({\mathcal {H}}_n\) satisfies \({\mathcal {N}}({\mathcal {H}}_n, \eta ) \le \left( \frac{4S_0}{\eta }\right) ^{c_0n}\) (e.g., Proposition 5 in [20]). By the triangle inequality, we split the error we want to control into sampling error (SE) and approximation error (AE) (see Fig. 1)

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}-\phi \right| \right| \right| }^2\le 2{\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}- {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n} \right| \right| \right| }^2+2{\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n}- \phi \right| \right| \right| }^2. \end{aligned}$$
(3.24)

From (3.23) and the coercivity condition (3.15), we obtain that, with probability at least \(1-\frac{\delta }{2}\),

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}- {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n} \right| \right| \right| }^2&\le \frac{1}{c_{{\mathcal {H}}_n}}{\mathcal {D}}_{{\widehat{\phi }}_{T,M,{\mathcal {H}}_n}, \infty }\nonumber \\&\le \frac{\eta }{c_{{\mathcal {H}}_n}} (3C_0+ 2C_1+2c_1C_2\mathrm {dim}({\mathcal {L}}_n)^{\gamma } ) \nonumber \\&\quad + \frac{2}{c_{{\mathcal {H}}_n}}\epsilon _{M,\delta ,{\mathcal {N}}}. \end{aligned}$$
(3.25)

Let \(\phi _{{\mathcal {H}}_n}:=\mathrm {argmin}_{\psi \in {\mathcal {H}}_n}\Vert \psi -\phi \Vert _{\infty }\). By coercivity condition (3.15), we have

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n}- \phi _{{\mathcal {H}}_n} \right| \right| \right| }^2&\le \frac{1}{c_{{\mathcal {H}}_n}}({\mathcal {E}}_{T,\infty }(\phi _{{\mathcal {H}}_n})-{\mathcal {E}}_{T,\infty }({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n}))\\&\le \frac{1}{c_{{\mathcal {H}}_n}}({\mathcal {E}}_{T,\infty }(\phi _{{\mathcal {H}}_n})-{\mathcal {E}}_{T,\infty }(\phi )) \le \frac{1}{c_{{\mathcal {H}}_n}} {\left| \left| \left| \phi _{{\mathcal {H}}_n}-\phi \right| \right| \right| }^2, \end{aligned}$$

where we used

$$\begin{aligned} {\mathcal {E}}_{T,\infty }(\phi )&\le {\mathcal {E}}_{T,\infty }({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n})\,,\quad \big | {\mathcal {E}}_{T,\infty }(\phi )- {\mathcal {E}}_{T,\infty }(\varphi )\big |&\le {\left| \left| \left| \phi -\varphi \right| \right| \right| }^2\,, \quad \forall \varphi \in {\mathcal {H}}_n. \end{aligned}$$

Therefore, we have

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}_n}- \phi \right| \right| \right| }^2 \le (2+\frac{2}{c_{{\mathcal {H}}_n}}){\left| \left| \left| \phi _{{\mathcal {H}}_n}- \phi \right| \right| \right| }^2 \le \frac{4R^2}{c_{{\mathcal {H}}_n}}n^{-2s}. \end{aligned}$$
(3.26)

Now we combine the estimates (3.24), (3.25) and (3.26) together, and let \(\eta =n^{-2s-\gamma }\) and \(n=(\frac{M}{\log M})^{\frac{1}{2s+1}}\), and note that \(c_{{\mathcal {L}}}=c_{\cup _n{\mathcal {L}}_n}=c_{\cup _n{\mathcal {H}}_n}\le c_{{\mathcal {H}}_n}=c_{{\mathcal {L}}_n}\le 1\) for all n. We obtain that, with probability at least \(1-\frac{\delta }{2}\), the following estimate holds true:

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}-\phi \right| \right| \right| }^2&\le \frac{2\eta }{c_{{\mathcal {L}}}}(3C_0+2C_1+2c_1C_2\mathrm {dim}({\mathcal {L}}_n)^{\gamma })+\frac{8R^2}{c_{{\mathcal {L}}}}n^{-2s}+\frac{4}{c_{{\mathcal {L}}}}\epsilon _{M,\delta ,N}\nonumber \\&\le \frac{C_3}{c_{{\mathcal {L}}}}n^{-2s}+\frac{C_4}{c_{{\mathcal {L}}}}\frac{n\log n}{M}+ \frac{4C}{c_{{\mathcal {L}}}M}\log (\frac{2}{\delta })\nonumber \\&\le \frac{C_5}{c_{{\mathcal {L}}}} \left( \frac{\log M}{M}\right) ^{\frac{2s}{2s+1}}+\frac{4C}{c_{{\mathcal {L}}}M}\log ({2}/{\delta }), \end{aligned}$$
(3.27)

where we used (3.18) to get \(\epsilon _{M,\delta ,N}\), C, and \(\{C_i\}_{i=0}^2\) is defined in (3.18), (3.21), and (3.22) respectively, and

$$\begin{aligned} C_3&=6C_0+4C_1+4c_0^{\gamma }c_1C_2+8R^2\\ C_4&=4c_0C|\log (4S_0)|+4c_0(2s+\gamma )C\\ C_5&=C_3+\frac{C_4}{2s+1}. \end{aligned}$$

The bound in expectation is obtained by standard techniques, writing

$$\begin{aligned} {\mathbb {E}}{\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}-\phi \right| \right| \right| }^2=\int _{0}^{\infty } {\mathbb {P}}\left\{ { {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}-\phi \right| \right| \right| }^2 >\epsilon }\right\} \mathrm{d}\epsilon \,, \end{aligned}$$

and splitting the integration interval into \([0, \frac{C_5}{c_{{\mathcal {L}}}} (\frac{\log M}{M})^{\frac{2s}{2s+1}}]\) and \([\frac{C_5}{c_{{\mathcal {L}}}} (\frac{\log M}{M})^{\frac{2s}{2s+1}}, \infty ]\). On the first interval, we use \( {\mathbb {P}}\left\{ { {\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}-\phi \right| \right| \right| }^2 >\epsilon }\right\} \le 1\). On the second one, we use a change of variables and the probability estimate (3.27). We obtain

$$\begin{aligned} {\mathbb {E}}{\left| \left| \left| {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}-\phi \right| \right| \right| }^2&\le \frac{C_5}{c_{\cup _n{\mathcal {H}}_n}}\left( \frac{\log M}{ M}\right) ^{\frac{2s}{2s+1}}+\frac{4C}{c_{\cup _n{\mathcal {H}}_n}}\frac{1}{M}\nonumber \\&\le \frac{C_6}{c_{\cup _n{\mathcal {H}}_n}^2}\left( \frac{\log M}{ M}\right) ^{\frac{2s}{2s+1}} \end{aligned}$$
(3.28)

where \(C_6\) is an absolute constant only depending on \(\sigma , N, T, S_0,R\). \(\square \)

Proof of Theorem 3.1

In this proof, \(a \lesssim b\) means that there exists a constant c such that \(a\le cb\). For any \(\epsilon >0\), we claim

$$ \sum _{M=1}^{\infty } {\mathbb {P}}\left\{ {{\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\phi \right| \right| \right| }^2 \ge \epsilon }\right\} < \infty . $$

Strong consistency will then follow from the Borel–Cantelli lemma. Notice that

$$\begin{aligned} {\mathbb {P}}\left\{ { {\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\phi \right| \right| \right| }^2 \ge \epsilon }\right\}&\le {\mathbb {P}}\left\{ { {\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right| \right| \right| }^2 \ge \frac{\epsilon }{2} }\right\} \\&\quad + {\mathbb {P}}\left\{ {{\left| \left| \left| \widehat{\phi }_{T,\infty ,{\mathcal {H}}_M}-\phi \right| \right| \right| }^2 \ge \frac{\epsilon }{2} }\right\} \,, \end{aligned}$$

and \( {\mathbb {P}}\left\{ {{\left| \left| \left| \widehat{\phi }_{T,\infty ,{\mathcal {H}}_M}-\phi \right| \right| \right| }^2 \ge \frac{\epsilon }{2}}\right\} =0\) when M is large enough (see (3.26)). It suffices to prove

$$ \sum _{M=1}^{\infty } {\mathbb {P}}\left\{ {{\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right| \right| \right| }^2 \ge \epsilon }\right\} < \infty \,. $$

Let \(\{\varphi _j\}_{j=1}^{{\mathcal {N}}}\) be an \(\eta \) net for \({\mathcal {H}}_M\). Consider the event

$$ {\varLambda }_{\eta ,M,\epsilon }=\{\max _{1\le j \le {\mathcal {N}}} \frac{1}{2}{\mathcal {D}}_{\varphi _j,\infty } -{\mathcal {D}}_{\varphi _j,M}\ge \epsilon \}\,. $$

The bound (3.20) in Proposition 3.4 yields

$$\begin{aligned} {\mathbb {P}}\left\{ {{\varLambda }_{\eta ,M,\epsilon }}\right\} \lesssim {\mathcal {N}}\exp \big ( -c_{{\mathcal {H}}_M} M\epsilon \big ). \end{aligned}$$
(3.29)

Using the fact that there exists \(j_M \in \{1,2\ldots ,{\mathcal {N}}\}\) such that \(\Vert \phi -\varphi _{j_M}\Vert _{\infty }\le \eta \), and following the same argument as in (3.21) and (3.22), we obtain,

$$ {\mathbb {P}}\left\{ {\frac{1}{2} {\mathcal {D}}_{\widehat{\phi }_{T,M,{\mathcal {H}}_M},\infty } -{\mathcal {D}}_{\widehat{\phi }_{T,M,{\mathcal {H}}_M},M}\gtrsim \eta n_{M}^{\gamma }+\epsilon }\right\} \le {\mathbb {P}}\left\{ {{\varLambda }_{\eta ,M,\epsilon }}\right\} \lesssim {\mathcal {N}}\exp \big ( -c_{{\mathcal {H}}_M} M\epsilon \big )\,. $$

Notice that \({\mathcal {D}}_{\widehat{\phi }_{T,M,{\mathcal {H}}_M},M}\le 0\) and \( {\mathcal {D}}_{\widehat{\phi }_{T,M,{\mathcal {H}}_M},\infty } \ge c_{{\mathcal {H}}_M} {\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right| \right| \right| }^2\), so that we have

$$ {\mathbb {P}}\left\{ {c_{{\mathcal {H}}_M} {\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right| \right| \right| }^2 \gtrsim \eta n_{M}^{\gamma }+\epsilon }\right\} \lesssim {\mathcal {N}}\exp \big ( -c_{{\mathcal {H}}_M} M\epsilon \big )\,. $$

Let \(\eta n_{M}^{\gamma }=\epsilon \), i.e., \(\eta = n_{M}^{-\gamma }\epsilon \), by assumption, we have \(c_{{\mathcal {H}}_M}\ge c_{\cup _M {\mathcal {H}}_M}>0\) and \(\lim _{M\rightarrow \infty }\frac{n_M\log n_M}{M}= 0\), we have

$$\begin{aligned} {\mathbb {P}}\left\{ {{\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right| \right| \right| }^2 \gtrsim \epsilon }\right\}&\lesssim \exp \big ( n_{M} \log \frac{n_M}{\epsilon }-c_{\cup _M{\mathcal {H}}_M} M\epsilon \big )\\&=\exp \bigg (-c_{\cup _M{\mathcal {H}}_M}M\epsilon \big (1- \frac{n_{M}}{M\epsilon } \log \frac{n_M}{\epsilon }\big )\bigg )\\&\lesssim \exp \bigg (-\frac{1}{2}c_{\cup _M{\mathcal {H}}_M}M\epsilon \bigg ) \end{aligned}$$

when M is large enough. By the comparison theorem, the series

$$ \sum _{M=1}^{\infty } {\mathbb {P}}\left\{ {{\left| \left| \left| \widehat{\phi }_{T,M,{\mathcal {H}}_M}-\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right| \right| \right| }^2 \gtrsim \epsilon }\right\} $$

converges. The claim is proved. \(\square \)

4 Learning Theory: Discrete-Time Observations

In this section, we analyze the estimation error of the (regularized) MLE \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\), defined in (2.5) for finite-dimensional linear space \({\mathcal {H}}\) and for discrete-time observations. We show that it is of order \(\sqrt{\frac{n}{M}} + {\varDelta }t^{1/2}\) with high probability, where n is the dimension of \({\mathcal {H}}\) and \({\varDelta }t \) is the observation gap. As a consequence, the MLE is consistent when \(M\rightarrow \infty \) and \({\varDelta }t\rightarrow 0\); and the MLE converges at an optimal rate as when n is optimally chosen as in (2.13).

This section is organized as follows: we shall first prove the main theorem, Theorem 4.2, on the error bounds of the MLE in Sect. 4.1. We postpone the technical details, including concentration inequalities and discretization error bounds, to Sects. 4.24.3.

Recall that we denote \(\varvec{X}_{[0,T]}\) the solution to system (1.1) with the true interaction kernel \(\phi \) and denote \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) independent trajectories observed at discrete times \(t_l= l{\varDelta }t\) with \({\varDelta }t = T/L\). Recall that when \({\mathcal {H}}= \mathrm {span}\{\psi _p\}_{p=1}^n\), the MLE \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) is given in (2.8).

Throughout this section, we assume

Assumption 4.1

(Basis functions) Assume that the basis functions \(\{ \psi _p \}_{p=1}^n \subset C^1_b({\mathbb {R}}^+,{\mathbb {R}}) \) satisfy the following conditions:

  1. (a)

    \(\{\psi _p(\cdot )(\cdot )\}_{p=1}^n\) are orthonormal in \(L^2(\rho _T)\);

  2. (b)

    \(\max _k \Vert \psi _p\Vert _\infty \le b_0\), \(\max _k \Vert \psi '_k(\cdot )(\cdot )\Vert _\infty \le b_1\);

  3. (c)

    there exists a constant \(c_{\rho _T}\) such that \(n\le c_{\rho _T} \min (b_0^2R,b_1R^{3/2})\).

Item (a) aims to make the normal equations matrix nonsingular, as discussed in Proposition 3.2. In item (b), the uniform bound for the derivatives aims to control the discretization error due to discrete-time observations; the uniform boundness on the functions will be used for concentration inequalities. Item (c) states that the number of such orthonormal basis functions is bounded by the measure \(\rho _T\) and the uniform bounds of the functions and their derivatives. Item (c) often follows from \((a)-(b)\), with an intuition from examples including polynomials and trigonometric polynomials, and smoothed piecewise polynomials, if \(r^2\rho _T(\mathrm{d}r)\) is equivalent to the Lebesgue measure on an interval \([R_0,R]\subset {\mathbb {R}}^+\). Such an interval is where pairwise distances explore with a noticeable probability (see, for example, in Figs. 2 and 6). It exists in general when the initial distribution spreads out the pairwise distances or when the relative position system is ergodic, since the density of \(\rho _T\) is smooth and nonnegative on \({\mathbb {R}}^+\).

4.1 Error Bounds for the MLE

We show first that the \(L^2(\rho _T)\) error of the estimator \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) in (2.8) converges as \(M\rightarrow \infty \) and \({\varDelta }t : = T/L\rightarrow 0\), with high probability.

Theorem 4.2

(Error bounds for the MLE) Let the hypothesis space be \({\mathcal {H}}=\mathrm {span} \{\psi _i\}_{i=1}^n \), where the set of functions \(\{\psi _i\}_{i=1}^n\), satisfying Assumption 4.1, are orthonormal in \(L^2(\rho _T)\) with respect to the norm \({\left| \left| \left| \cdot \right| \right| \right| }\) defined in Eq. (2.11). Suppose that the coercivity condition holds on \({\mathcal {H}}\) with a constant \(c_{\mathcal {H}}>0\). Then, with a probability at least \(1-{(4n+2n^2) }\exp {\left( -\frac{\epsilon ^2 }{8c_1^2} \right) }\), the error of the estimator \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) in (2.8) satisfies

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - \phi \right| \right| \right| } \le {\left| \left| \left| {\widehat{\phi }}_{T, \infty , {\mathcal {H}}} - \phi \right| \right| \right| } + c_2\left( \sqrt{\frac{n}{M}}\epsilon + c_3{\varDelta }t ^{\frac{1}{2}}\right) , \end{aligned}$$
(4.1)

where \({\widehat{\phi }}_{T, \infty , {\mathcal {H}}} \) is the projection of the true kernel to \({\mathcal {H}}\), \({\varDelta }t = T/L \le c_{\mathcal {H}}/(2c_3)\), and the constants are

$$\begin{aligned} \begin{aligned} c_1&= R b_0 (R b_0+2\sigma /\sqrt{T}),\quad \\ c_2&= 4c_{\mathcal {H}}^{-1} (1+ {\left| \left| \left| \phi _{T, \infty , {\mathcal {H}}} \right| \right| \right| } ),\\ c_3&= dN(b_1+ b_0) Rb_0 (Rb_0 {\varDelta }t ^{\frac{1}{2}} + \sqrt{d}\sigma )\sqrt{c_{\rho _T} \min (b_0^2R,b_1R^{3/2})}.\\ \end{aligned} \end{aligned}$$
(4.2)

Remark 1

(The discretization error may dominate the statistical error) When the observation gap \({\varDelta }t=0\), we recover the min–max learning rate \(M^{-\frac{s}{2s+1}}\) proved in the previous section, if we choose the optimal dimension \(n=C ( M/\log {M})^{1/(2s+1)}\) for the hypothesis space. However, when \({\varDelta }t>0\), once \(M^{-\frac{s}{2s+1}}(\log {M})^{-\frac{1}{2(2s+1)}} \lesssim {\varDelta }t ^{\frac{1}{2}}\), the discretization error will dominate the error of the estimator, preventing us from observing the min–max learning rate. This phenomenon is well-illustrated by the left plots in Figs. 5 and 9 in our numerical experiments.

Remark 2

We assumed \(C^1_b\) regularity for the basis functions \(\{\psi _p\}\) for the above numerical error analysis, stronger than that of piecewise polynomials (which may be discontinuous) used in the numerical tests. Such a difference between the regularity requirements stems from the numerical representation, and we can view the piecewise polynomials as numerical approximations of regular functions. This view is supported by the numerical experiments: the estimator has only small jumps at the discontinuities in the high probability region.

Remark 3

A smaller coercivity constant \(c_{\mathcal {H}}\) corresponds to a worse conditioned problem (Proposition 3.2), and so the condition \(L\gtrsim 1/c_{\mathcal {H}}\) that requires L to be larger for small \(c_{\mathcal {H}}\) makes sense.

We shall prove Theorem 4.2 as follows: we first outline the main idea and introduce the key elements, such as the normal matrices and vectors and the empirical error functionals in their entries; then, we provide a proof with key but technical estimations, including the concentration inequalities and discretization error bounds, postponed to Sects. 4.24.3.

The error of the MLE \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}} \) consists of three parts: approximation error, discretization error and sampling error:

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - \phi \right| \right| \right| }&\le \underbrace{ {\left| \left| \left| {\widehat{\phi }}_{T, \infty , {\mathcal {H}}} - \phi \right| \right| \right| } }_{\text { Approximation Error} } + \underbrace{ {\left| \left| \left| {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} - {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| } }_{\text {Discretization Error}} \nonumber \\&\quad + \underbrace{ {\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \right| \right| \right| } }_{\text {Sampling Error}}, \end{aligned}$$
(4.3)

where \( {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \) is the infinite-data estimator. We shall study the discretization error and the sampling error by analyzing the differences between their coefficient vectors.

All these coefficients are solutions to the corresponding normal equations (e.g., Eq. (2.7)). To facilitate the study of these normal matrices and vectors, we introduce the following notions. For any \(f, g\in C^1_b({\mathbb {R}}^{Nd},{\mathbb {R}}^{Nd})\), we define the following functionals of the observation paths:

$$\begin{aligned} \begin{aligned} \xi (f, \varvec{X}_{t_0:t_L})&= \frac{1}{TN} \sum _{l=1}^{L-1} \langle f(\varvec{X}_{t_l}), \varvec{X}_{t_{l+1} } -\varvec{X}_{t_{l} } \rangle , \\ \eta (f, g,X_{t_0:t_L})&= \frac{1}{LN} \sum _{l=1}^{L-1} \langle f(\varvec{X}_{t_l}),g(\varvec{X}_{t_l}) \rangle , \\ \xi (f, \varvec{X}_{[0,T]})&= \frac{1}{TN} \int _0^T \langle f(\varvec{X}_t),\mathrm{d}\varvec{X}_t \rangle , \\ \eta (f, g, \varvec{X}_{[0,T]})&= \frac{1}{TN} \int _0^T \langle f(\varvec{X}_t),g(\varvec{X}_t) \rangle \mathrm{d}t. \end{aligned} \end{aligned}$$
(4.4)

Correspondingly, we define the empirical functionals

$$\begin{aligned} \begin{aligned} \xi _{M,L}(f)&= \frac{1}{M}\sum _{m=0}^M\xi (f, \varvec{X}_{t_0:t_L}^{(m)}),&\eta _{M,L}(f, g)&= \frac{1}{M}\sum _{m=0}^M\eta (f,g, \varvec{X}_{t_0:t_L}^{(m)}), \\ \xi _{M,\infty }(f)&= \frac{1}{M}\sum _{m=0}^M\xi (f, \varvec{X}_{[0,T]}^{(m)}),&\eta _{M,\infty }(f, g)&= \frac{1}{M}\sum _{m=0}^M\eta (f,g, \varvec{X}_{[0,T]}^{(m)}), \end{aligned} \end{aligned}$$
(4.5)

Using the notation of empirical functional introduced in (4.4)–(4.5), we consider the following normal matrixes and vectors:

$$\begin{aligned} \begin{aligned} b_{M,L}(k)&= \xi _{M,L}(f_{\psi _p}) ,&A_{M,L}(k,k')&= \eta _{M,L}(f_{\psi _p}, f_{\psi _{k'} }) , \\ b_{\infty ,L}(k)&= {\mathbb {E}}[ \xi (f_{\psi _p}, \varvec{X}_{t_0:t_L})] ,&A_{\infty ,L}(k,k')&= {\mathbb {E}}[\eta (f_{\psi _p}, f_{\psi _{k'} }, \varvec{X}_{t_0:t_L})], \\ b_{\infty }(k)&= {\mathbb {E}}[ \xi (f_{\psi _p}, \varvec{X}_{[0,T]})] ,&A_{\infty }(k,k')&= {\mathbb {E}}[\eta (f_{\psi _p}, f_{\psi _{k'} }, \varvec{X}_{[0,T]})]. \end{aligned} \end{aligned}$$
(4.6)

It is clear that (here, to ease the notation, we denote the coefficient \( {\widehat{a}}_{L,T,M,{\mathcal {H}}}\) in Fig. 1 as \(a_{M,L}\), and similarly for others)

$$\begin{aligned} \begin{aligned} {\widehat{\phi }}_{L,T,M,{\mathcal {H}}}&=\sum _{i=1}^n a_{M,L}(i)\psi _i,&\text { with } a_{M,L}&= A_{M,L}^{-1} b_{M,L} , \\ {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}}&=\sum _{i=1}^n a_{\infty ,L}(i)\psi _i,&\text { with } a_{\infty ,L}&= A_{\infty ,L}^{-1} b_{\infty ,L} ,\\ {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}}&=\sum _{i=1}^n a_\infty (i)\psi _i,&\text { with } a_{\infty }&= A_{\infty }^{-1}b_{\infty }. \end{aligned} \end{aligned}$$
(4.7)

Here the matrix \(A_\infty \) is invertible due the coercivity condition: its smallest eigenvalue is the coercivity constant \(c_{\mathcal {H}}\) (see Proposition 3.2). The matrix \(A_{\infty ,L}\) is invertible when \(L= T/({\varDelta }t)\) is large, with its smallest eigenvalue bounded below by \(c_{\mathcal {H}}- c_3 {\varDelta }t^{1/2}\), see Corollary 4.1. Furthermore, Corollary 4.2 in Sect. 4.3 shows that, with probability at \(1-\delta \), the matrix \(A_{M,L}\) is invertible with its smallest eigenvalue bounded blow by \(c_{\mathcal {H}}- \left( \sqrt{\frac{n}{M}}\epsilon + c_3{\varDelta }t ^{\frac{1}{2}}\right) \).

Proof of Theorem 4.2

By Eq. (4.3), it suffices to prove upper bounds for the discretization error and the sampling error separately:

$$\begin{aligned}&\text {discretization error:}&{\left| \left| \left| {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} - {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }&\le c_2c_3{\varDelta }t^{1/2} , \\&\text {sampling error:}&{\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \right| \right| \right| }&\le c_2\sqrt{\frac{n}{M}}\epsilon , \end{aligned}$$

where the second inequality holds with probability at least \(1-\delta \).

For the discretization error, since \(\{\psi _i(\cdot )(\cdot )\}\) are orthonormal in \(L^2(\rho _T)\), we have, by Eq. (4.7):

$$\begin{aligned} {\left| \left| \left| {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} - {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }^2&= {\left| \left| \left| \sum _{i=1}^n [ a_{\infty ,L}(i) - a_\infty (i ) ] \psi _i \right| \right| \right| }^2 \\&= \left\| a_{\infty ,L} - a_\infty \right\| ^2 = \Vert A_{\infty ,L}^{-1} b_{\infty ,L} - A_\infty ^{-1}b_{\infty }\Vert ^2. \end{aligned}$$

Using the formula \(A_{\infty ,L}^{-1} -A_{\infty }^{-1} = A_{\infty ,L}^{-1}(A_\infty - A_{\infty ,L})A_\infty ^{-1} \) (see, e.g., [25, Appendix B9]), we have

$$\begin{aligned} \left\| A_{\infty ,L}^{-1}b_{\infty ,L} - A_\infty ^{-1} b_\infty \right\|&= \left\| A_{\infty ,L}^{-1}(b_{\infty ,L}-b_\infty ) + (A_{\infty ,L}^{-1} -A_\infty ^{-1}) b_\infty \right\| \\&\le \left\| A_{\infty ,L}^{-1} \right\| \left( \left\| b_{\infty ,L}-b_\infty \right\| + \left\| A_\infty - A_{\infty ,L} \right\| \left\| A_\infty ^{-1} b_\infty \right\| \right) . \end{aligned}$$

Note that (i) \(\left\| A_{\infty ,L}^{-1} \right\| \le 2 c_{\mathcal {H}}^{-1}\), since \(c_3{\varDelta }t ^{1/2} < \frac{1}{2}c_{\mathcal {H}}\); (ii) by Proposition 4.1 (in Sect. 4.3) in combination of Assumption 4.1,

$$\begin{aligned} \left\| b_{\infty ,L}-b_\infty \right\| \le c_3 {\varDelta }t ^{1/2}, \quad \left\| A_{\infty ,L} - A_\infty \right\| \le c_3 {\varDelta }t ^{1/2}; \end{aligned}$$

and (iii) \(\left\| A_\infty ^{-1} b_\infty \right\| = {\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }\). Then,

$$\begin{aligned} \left\| A_{\infty ,L}^{-1}b_{\infty ,L} - A_\infty ^{-1} b_\infty \right\| \le 2c_{\mathcal {H}}^{-1}\left( 1+{\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }\right) c_3 {\varDelta }t ^{1/2}, \end{aligned}$$
(4.8)

and the inequality for the discretization error follows.

Similarly, for the sampling error, we have

$$\begin{aligned}&{\left| \left| \left| {\widehat{\phi }}_{L,T,M,{\mathcal {H}}} - {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \right| \right| \right| }\\&\quad = \left\| a_{M,L} - a_{\infty ,L} \right\| = \Vert A_{M,L}^{-1}b_{M,L} - A_{\infty ,L}^{-1} b_{\infty ,L} \Vert \\&\quad = \left\| A_{M,L}^{-1}(b_{M,L}-b_{\infty ,L}) + (A_{M,L}^{-1} -A_{\infty ,L}^{-1}) b_{\infty ,L} \right\| \\&\quad \le \left\| A_{M,L}^{-1} \right\| \left( \left\| b_{M,L}-b_{\infty ,L} \right\| + \left\| A_{M,L}^{-1} -A_{\infty ,L}^{-1} \right\| \left\| A_{\infty ,L}^{-1} b_{\infty ,L} \right\| \right) . \end{aligned}$$

Note that (i) \(\left\| A_{M,L}^{-1} \right\| \le 2 c_{\mathcal {H}}^{-1}\) when M and L are large enough such that \(\sqrt{\frac{n}{M}}\epsilon + c_3{\varDelta }t ^{\frac{1}{2}} < \frac{1}{2}c_{\mathcal {H}}\); (ii) we have, by Proposition 4.2 (in Sect. 4.3), that

$$\begin{aligned} \left\| b_{M,L}-b_{\infty ,L} \right\| \le \sqrt{\frac{n}{M}}\epsilon , \quad \left\| A_{M,L} - A_{\infty , L} \right\| \le \sqrt{\frac{n}{M}}\epsilon \end{aligned}$$

hold with probability at least \(1-\delta \); (iii) since \( c_3{\varDelta }t ^{\frac{1}{2}} < \frac{1}{2}c_{\mathcal {H}}\) and \(\left\| A_{\infty }^{-1} b_{\infty } \right\| = {\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }\), we have

$$\begin{aligned} \left\| A_{\infty ,L}^{-1} b_{\infty ,L} \right\| \le 2 \left( 1+{\left| \left| \left| {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right| \right| \right| }\right) . \end{aligned}$$

Then, the inequality for the sampling error follows. \(\square \)

In in the next two subsections, we prove Proposition 4.14.2 that we just used in the above proof. Section 4.2 studies the concentration inequalities and discretization error bounds for the empirical error functionals in (4.5), which are the entries of the normal matrices and vectors. Based on them, Sect. 4.3 provides error bounds for the normal matrices and the normal vectors in Proposition 4.14.2.

4.2 Concentration and Discretization Error of Empirical Functionals

We introduce concentration inequalities for the above empirical functionals on the path space of diffusion processes and a bound on the discretization error of the estimator due to discrete-time approximations. Our first lemma studies concentration of the discrete-time empirical functionals \(\xi _{M,L}\) and \(\eta _{M,L}\).

Lemma 4.1

(Concentration of empirical functionals) Let \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) be discrete-time observations, with \(t_l= l{\varDelta }t\) and \(L= T/{\varDelta }t\), of the system (1.1) with \(\phi \). Then for any \(f,g\in C_b({\mathbb {R}}^{dN}, {\mathbb {R}}^{dN} )\), the error functionals defined in (4.5) satisfy the concentration inequalities:

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\left\{ {|\xi _{M,L}(f) - {\mathbb {E}}[\xi _{M,L}(f)]|> \epsilon }\right\}&\le 4 e^{-\frac{M\epsilon ^2 }{8C_1^2}} \\ {\mathbb {P}}\left\{ {|\eta _{M,L}(f,g) - {\mathbb {E}}[\eta _{M,L}(f,g)]|> \epsilon }\right\}&\le 2 e^{-\frac{M\epsilon ^2 }{2C_2^2}}, \end{aligned} \end{aligned}$$
(4.9)

for any \(\epsilon >0\), where \(C_1 = \frac{1}{N}\Vert f\Vert _\infty \max (\frac{2\sigma \sqrt{N}}{\sqrt{T}}, \, \Vert f_\phi \Vert _\infty )\), and \(C_2 =\frac{1}{N} \Vert f\Vert _\infty \Vert g\Vert _\infty \). Furthermore,

$$\begin{aligned} {\mathbb {P}}\left\{ {|\xi _{M,L}(f) - {\mathbb {E}}[\xi _{M,L}(f)]|< \epsilon ,\, |\eta _{M,L}(f,g) - {\mathbb {E}}[\eta _{M,L}(f,g)]|< \epsilon }\right\}&\ge 1- 8 e^{-\frac{M\epsilon ^2 }{8C^2}}, \end{aligned}$$
(4.10)

where \(C= \frac{1}{N}\Vert f\Vert _\infty \max (\frac{2\sigma }{\sqrt{T/N}}, \Vert f_\phi \Vert _\infty ,\, \Vert g\Vert _\infty )\).

Proof

Note that \({\left| \eta (f, g,\varvec{X}_{[t_0:t_L]})\right| } \le \Vert f\Vert _\infty \Vert g\Vert _\infty \). Then, the exponential inequality for \(\eta _{ML}\) follows from the Hoeffding inequality, which states that, for i.i.d. random variables \(\{Z_i\}\) bounded above by K, one has \( {\mathbb {P}}\left\{ {{\left| \frac{1}{M}\sum _{m=1}^M (Z_i-{\mathbb {E}}Z_i) \right| } > \epsilon }\right\} \le 2 \exp {(-\frac{M\epsilon ^2}{2K^2}) } \).

To study \(\xi _{ML}\), we decompose \(\xi (f,\varvec{X}_{[t_0:t_L]})\) into two parts, a bounded part and a martingale part:

$$\begin{aligned} \xi (f,\varvec{X}_{[t_0:t_L]}) =&\frac{1}{TN} \sum _{l=0}^{L-1} \langle f(\varvec{X}_{t_l}) {\mathbf {1}}_{[t_l,t_{l+1}]}(s),\varvec{X}_{t_{l+1}} - \varvec{X}_{t_l} \rangle \\ =&\frac{1}{TN} \int _0^T\langle f^L(s),f_\phi (\varvec{X}_s) \rangle \mathrm{d}s + \frac{1}{TN} \int _0^T\langle f^L(s),\sigma \mathrm{d}\varvec{B}_s \rangle =: Z_T + Y_T, \end{aligned}$$

where we denote \(f^L(s) :=\sum _{l=0}^{L-1} f(\varvec{X}_{t_l}) {\mathbf {1}}_{[t_l,t_{l+1}]}(s) \). We call \(Z_T\) a bounded part because

$$\begin{aligned} {\left| Z_T\right| }={\left| \frac{1}{TN} \int _0^T\langle f^L(s),f_\phi (\varvec{X}_s) \rangle \mathrm{d}s\right| }\le \frac{1}{N} \Vert f\Vert _\infty \Vert f\Vert _\infty . \end{aligned}$$

We call \(Y_T\) a martingale part since \(TNY_T= \int _0^T\langle f^L(s),\sigma \mathrm{d}\varvec{B}_s \rangle \) is a martingale. Correspondingly, we can write

$$\begin{aligned} \xi _{ML} = \frac{1}{M}\sum _{m=1} Z_T^{(m)} + Y_T^{(m)}. \end{aligned}$$

Then, denoting \(K_1:=\frac{1}{N} \Vert f\Vert _\infty \Vert f_\phi \Vert _\infty \) and \(K_2= 2\sigma \Vert f\Vert _{\infty }\), and noticing that \(C_1= K_1 + K_2 /\sqrt{TN}\), we can conclude the first concentration inequality in (4.9) from

$$\begin{aligned}&{\mathbb {P}}\left\{ {|\xi _{M,L}(f) - {\mathbb {E}}[\xi _{M,L}(f)]|> \epsilon }\right\} \\&\quad \le \, {\mathbb {P}}\left\{ {{\left| \frac{1}{M}\sum _{m=1} Z_T^{(m) }- {\mathbb {E}}Z_T \right| }\ge \frac{\epsilon }{2} }\right\} + {\mathbb {P}}\left\{ { {\left| \frac{1}{M}\sum _{m=1} Y_T^{(m)} \right| }\ge \frac{\epsilon }{2}}\right\} \\&\quad \le \, 2 e^{-\frac{M\epsilon ^2}{2K_1^2}} + e^{-\frac{TNM\epsilon ^2}{8K_2^2}} , \end{aligned}$$

where the first exponential bound follows directly from Hoeffding inequality applied to \(\{Z_T^{(m)}\}\), and the second exponential bound \( {\mathbb {P}}\left\{ { {\left| \frac{1}{M}\sum _{m=1} Y_T^{(m)} \right| }\ge \frac{\epsilon }{2}}\right\} \le e^{-\frac{M\epsilon ^2}{8K_2^2}}\) is proved as follows.

Note that \({\mathbb {E}}Y_T=0\) and \(TNY_T= \int _0^T\langle f^L(s),\sigma \mathrm{d}\varvec{B}_s \rangle \) is a martingale satisfying \({\mathbb {E}}[e^{\sigma ^2\int _0^T{\left| f^L(s)\right| }^2d_s}] < \infty \) because \({\left| f^L(s)\right| }^2\le \Vert f\Vert _\infty \). By the Novikov theorem, the process \(( e^{\alpha TNY_T - \frac{\alpha ^2}{2} \sigma ^2 \int _0^T {\left| f^L(s)\right| }^2d_s }, T\ge 0)\) is a martingale for any \(\alpha \in {\mathbb {R}}\) (see, e.g., [31, Corollary 5.13]). Therefore, with \(\alpha = \frac{\lambda }{MTN}\), we have

$$\begin{aligned} {\mathbb {E}}\left[ e^{\frac{\lambda }{M}Y_T} \right] = {\mathbb {E}}\left[ e^{ \sigma ^2\frac{\lambda ^2}{(MTN)^2} \int _0^T {\left| f^L(s)\right| }^2d_s} \right] \le e^{\sigma ^2\frac{\lambda ^2}{M^2TN} \Vert f\Vert _\infty ^2} \end{aligned}$$

for any \(\lambda >0\). As a consequence, we have

$$\begin{aligned} {\mathbb {P}}\left\{ { {\left| \frac{1}{M}\sum _{m=1} Y_T^{(m)} \right| }\ge \frac{\epsilon }{2} }\right\} \le \inf _{\lambda>0} e^{-\frac{\lambda \epsilon }{2}} {\mathbb {E}}\left[ e^{\frac{\lambda }{M}Y_T} \right] \le \inf _{\lambda >0} e^{- \frac{\lambda \epsilon }{2} + \lambda ^2\frac{\sigma ^2\Vert f\Vert _\infty }{TNM}} \le e^{-\frac{TNM\epsilon ^2}{8K_2^2}}. \end{aligned}$$

Lastly, Eq. (4.10) follows directly from Eq. (4.9). \(\square \)

We remark that here we focus on the case \(M\rightarrow \infty \) with finite time T. If the relative position system is ergodic, one can extend the concentration inequalities to the case when \(T\rightarrow \infty \).

The next lemma shows that the discretization error of the empirical functionals, as discrete-time approximation of the integrals, is at order \({\varDelta }t^ \frac{1}{2}\).

Lemma 4.2

(Discretization error of empirical functionals) Let \(f,g\in C^1_b({\mathbb {R}}^{dN}, {\mathbb {R}}^{dN}) \). Let \(\varvec{X}_{t_0:t_L}\) be a discrete-time trajectory, with \(t_l=l{\varDelta }t\) and \(L= T/{\varDelta }t\), of the system (1.1) with \(\phi \). Then, the error functionals defined in (4.4) satisfy

$$\begin{aligned} \begin{aligned} {\left| {\mathbb {E}}[ \xi (f, \varvec{X}_{t_0:t_L})] - {\mathbb {E}}[\xi (f, \varvec{X}_{[0,T]})] \right| }&\le C_1 {\varDelta }t^\frac{1}{2}; \\ {\left| {\mathbb {E}}[\eta (f,g, \varvec{X}_{t_0:t_L})] - {\mathbb {E}}[\eta (f,g, \varvec{X}_{[0,T]})] \right| }&\le C_2 {\varDelta }t^\frac{1}{2}, \end{aligned} \end{aligned}$$
(4.11)

where the constants are

$$\begin{aligned} C_1&= \Vert \nabla f\Vert _\infty \Vert f_\phi \Vert _\infty \left( \Vert f_\phi \Vert _\infty {\varDelta }t^{\frac{1}{2}}/N + \sqrt{d/N}\sigma \right) , \\ C_2&=( \Vert \nabla f\Vert _\infty \Vert g\Vert _\infty + \Vert \nabla g\Vert _\infty \Vert f\Vert _\infty ) \left( \Vert f_\phi \Vert _\infty {\varDelta }t^{\frac{1}{2}}/N + \sqrt{d/N}\sigma \right) . \end{aligned}$$

Proof

Note that since \(\varvec{X}_{[0,T]}\) is a solution to system (1.1), we have for each l,

$$\begin{aligned} {\mathbb {E}}\int _{t_l}^{t_{l+1}}\langle f(\varvec{X}_{t_l}) - f(\varvec{X}_s), \mathrm{d}\varvec{X}_s \rangle =&\, {\mathbb {E}}\int _{t_l}^{t_{l+1}}\langle f(\varvec{X}_{t_l}) - f(\varvec{X}_s), f_\phi (\varvec{X}_s) \rangle \mathrm{d}s \\ \le&\,\Vert \nabla f\Vert _\infty \Vert f_\phi \Vert _\infty {\mathbb {E}}\int _{t_l}^{t_{l+1}} |\varvec{X}_{t_l} - \varvec{X}_s|\mathrm{d}s \\ \le&\, \Vert \nabla f\Vert _\infty \Vert f_\phi \Vert _\infty \left( \Vert f_\phi \Vert _\infty {\varDelta }t ^2 + \sqrt{dN}\sigma {\varDelta }t^{3/2} \right) , \end{aligned}$$

where in the first inequality we have applied the mean value theorem to bound \(f(\varvec{X}_{t_l}) - f(\varvec{X}_s)\):

$$\begin{aligned} |f(\varvec{X}_{t_l}) - f(\varvec{X}_s)|\le \Vert \nabla f\Vert _\infty |\varvec{X}_{t_l} - \varvec{X}_s|, \end{aligned}$$

and in the second inequality, we used the fact that

$$\begin{aligned} {\mathbb {E}}|\varvec{X}_{t_l} - \varvec{X}_s|&= {\mathbb {E}}\left| \int _{t_l}^s f_\phi (\varvec{X}_r))\mathrm{d}r +\sigma (\varvec{B}_s - \varvec{B}_{t_l}) \right| \\&\le \Vert f_\phi \Vert _\infty (s-t_{l}) + \sqrt{dN}\sigma (s-t_{l})^{1/2}. \end{aligned}$$

Thus, we obtain the bound in (4.11) by a summation over l:

$$\begin{aligned} {\mathbb {E}}[\xi (f, \varvec{X}_{t_0:t_L}) - \xi (f, \varvec{X}_{[0,T]}) ]&= \frac{1}{TN} \sum _{l=1}^{L-1} {\mathbb {E}}\int _{t_l}^{t_{l+1}}\langle f(\varvec{X}_{t_l}) - f(\varvec{X}_s), \mathrm{d}\varvec{X}_s \rangle , \\&\le \Vert \nabla f\Vert _\infty \Vert f_\phi \Vert _\infty \left( \Vert f_\phi \Vert _\infty {\varDelta }t /N + \sqrt{d/N}\sigma {\varDelta }t^{1/2} \right) . \end{aligned}$$

Similarly, we have

$$\begin{aligned} {\left| \langle f(\varvec{X}_{t_l}), g(\varvec{X}_{t_l}) \rangle - \langle f(\varvec{X}_s),g(\varvec{X}_s) \rangle \right| } \le (\Vert \nabla f\Vert _\infty \Vert g\Vert _\infty + \Vert \nabla g\Vert _\infty \Vert f\Vert _\infty ) {\left| \varvec{X}_s-\varvec{X}_{t_l}\right| }, \end{aligned}$$

and the bound for \(\eta \) follows from the fact that

$$\begin{aligned}&{\mathbb {E}}[\eta (f,g, \varvec{X}_{t_0:t_L}) - \eta (f, g, \varvec{X}_{[0,T]}) ] \\&\quad =\, \frac{1}{TN} \sum _{l=1}^{L-1} {\mathbb {E}}\int _{t_l}^{t_{l+1}}{\left| \langle f(\varvec{X}_{t_l}), g(\varvec{X}_{t_l}) \rangle - \langle f(\varvec{X}_s),g(\varvec{X}_s) \rangle \right| } \mathrm{d}s. \end{aligned}$$

\(\square \)

4.3 Error Bounds for the Normal Matrix and Vector

Proposition 4.1

(Discretization error) For the normal matrix \(A_{\infty ,L}\) and vector \(b_{\infty ,L}\) defined in (4.6) with \(\{ \psi _p \}_{p=1}^n\) satisfying Assumption  4.1, we have

$$\begin{aligned} \Vert b_{\infty ,L} - b_{\infty }\Vert \le \sqrt{n} C {\varDelta }t ^{\frac{1}{2}}\,,\quad \Vert A_{\infty ,L} - A_{\infty }\Vert \le \sqrt{n} C {\varDelta }t ^{\frac{1}{2}}, \end{aligned}$$

where the constant C is \( C = dN(b_1+ b_0) Rb_0 (Rb_0 {\varDelta }t ^{\frac{1}{2}} + \sqrt{\mathrm{d}}\sigma ) \).

Proof

Applying Lemma 4.2, in combination of the basic fact that \(\Vert b\Vert \le \sqrt{n} \max _{k=1,\dots ,n} |b(k)|\) for any \(b\in {\mathbb {R}}^n\), and \(\Vert A\Vert \le \sqrt{n} \max _{k,k'=1,\dots ,n} |A(k,k')|\) for any \(A\in {\mathbb {R}}^{n\times n}\), we obtain

$$\begin{aligned} \Vert b_{\infty ,L} - b_{\infty }\Vert \le \sqrt{n} C_1 {\varDelta }t ^{\frac{1}{2}}\,,\quad \Vert A_{\infty ,L} - A_{\infty }\Vert \le \sqrt{n} C_2 {\varDelta }t ^{\frac{1}{2}}, \end{aligned}$$

with constants \(C_1\) and \(C_2\) in the form of

$$\begin{aligned} C_1&=\Vert f_\phi \Vert _\infty \left( \Vert f_\phi \Vert _\infty {\varDelta }t^{\frac{1}{2}}/N + \sqrt{d/N}\sigma \right) \max _{k=1,\ldots , n} \Vert \nabla f_{\psi _p}\Vert _\infty , \\ C_2&= \left( \Vert f_\phi \Vert _\infty {\varDelta }t^{\frac{1}{2}}/N + \sqrt{d/N}\sigma \right) \max _{k,k'=1,\ldots , n} ( \Vert \nabla f_{\psi _p}\Vert _\infty \Vert f_{\psi _p'}\Vert _\infty + \Vert \nabla f_{\psi _{k'} }\Vert _\infty \Vert f_{\psi _p}\Vert _\infty ) . \end{aligned}$$

To complete the proof, we are left to estimate \(\left\| f_{\psi _p} \right\| _\infty \) and \( \left\| \nabla f_{\psi _p} \right\| _\infty \). From the definition of \(f_\cdot \), we have

$$\begin{aligned} \left\| f_{\psi _p} \right\| _\infty ^2 = \sup _x\sum _{i=1}^N {\left| \frac{1}{N}\sum _{j=1}^N \psi _p(|\varvec{X}_j-\varvec{X}_i|)(\varvec{X}_j-\varvec{X}_i)\right| }^2 \le R^2 b_0^2 N, \end{aligned}$$
(4.12)

and \(\left\| f_{\phi } \right\| _\infty \le R b_0\sqrt{N}\) as well. Note that for each \(i,i'\in \{1,\ldots , N\}\), with notation \(\varvec{r}_{ji}= \varvec{x}_j-\varvec{x}_i\) and \(r_{ji}= {\left| \varvec{r}_{ji} \right| }\), we have,

$$\begin{aligned} \nabla _{\varvec{x}_i'} \left( f_{\psi _p}(\varvec{x}) \right) _i&= \delta _{ii'} \frac{1}{N} \sum _{j=1, j\ne i}^N \left( \psi _p(r_{ji}){\mathbf {I}}_d+\psi _p'(r_{ji}) \frac{\varvec{r}_{ji}\otimes \varvec{r}_{ji} }{r_{ji}}\right) \\&\quad + \delta _{i'\ne i} \frac{1}{N}\left( \psi _p(r_{ii'}){\mathbf {I}}_d+\psi _p'(r_{ii'}) \frac{\varvec{r}_{ii'}\otimes \varvec{r}_{ii'} }{r_{ii'}}\right) . \end{aligned}$$

Thus, the norm of this \(d\times d\) matrix is uniformly bounded,

$$\begin{aligned} \sup _x \left\| \nabla _{\varvec{x}_i'} \left( f_{\psi _p}(\varvec{x}) \right) _i \right\| \le d (b_1 + b_0), \end{aligned}$$

and as a result, the norm of the \(dN \times dN\) matrix is uniformly bounded,

$$\begin{aligned} \left\| \nabla f_{\psi _p} \right\| _\infty \le dN (b_1 + b_0). \end{aligned}$$

Combining the above estimates with \(\Vert f_\phi \Vert _\infty \le Rb_0N\) (the same as \(\Vert f_{\psi _p}\Vert _\infty \)), we obtain that \(C_1\) and \(C_2\) are both bounded by C. \(\square \)

It follows directly that the matrix \(A_{\infty ,L}\) is invertible:

Corollary 4.1

The smallest eigenvalue of the matrix \(A_{\infty ,L}\) defined in (4.6) is bounded below by \(c_{\mathcal {H}}- c_3{\varDelta }t^{1/2}\) when \(c_3 {\varDelta }t^{1/2}< c_{\mathcal {H}}\), with \(c_3\) defined in (4.2).

Proof

Recall that from Proposition 3.2, we have \(a^T A_{\infty } a \ge c_{{\mathcal {H}}} |a|^2\) for an arbitrary \(a\in {\mathbb {R}}^n\). Then,

$$\begin{aligned} a^T A_{\infty ,L} a = a^T (A_{\infty ,L}-A_\infty ) a + a^T A_{\infty } a \ge (c_{{\mathcal {H}}} - c_2{\varDelta }t^{1/2}) \Vert a\Vert ^2 \end{aligned}$$

by Proposition 4.1 with the bound of \(\sqrt{n}\) in Assumption 4.1. \(\square \)

We prove next that the matrix \(A_{M,L}\) is invertible and concentrates around \(A_{\infty , L}\).

Proposition 4.2

(Concentration of the normal matrix and vector) Suppose that the coercivity condition holds on \({\mathcal {H}}=\mathrm {span} \{\psi _i\}_{i=1}^n\) with a constant \(c_{{\mathcal {H}}}>0\), where \(\{ \psi _p \}_{p=1}^n\) satisfying Assumption  4.1. Then, the normal matrix \(A_{M,L}\) and vector \(b_{M,L}\) defined in (4.6) satisfy concentration inequalities in the sense that for any \(\epsilon >0\),

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\left\{ {\left\| A_{M,L} - A_{\infty ,L} \right\|> \epsilon }\right\}&\le 2n^2 e^{-\frac{M\epsilon ^2 }{2nC^2}} \\ {\mathbb {P}}\left\{ {\left\| b_{M,L} - b_{\infty ,L} \right\| > \epsilon }\right\}&\le 4ne^{-\frac{M\epsilon ^2 }{8nC^2}}, \\ {\mathbb {P}}\left\{ {\left\| A_{M,L} - A_{\infty ,L} \right\|< \epsilon , \, \left\| b_{M,L}< b_{\infty ,L} \right\| < \epsilon }\right\}&\ge 1- (4n+2n^2) e^{-\frac{M\epsilon ^2 }{8nC^2}}\,, \end{aligned} \end{aligned}$$
(4.13)

where the constant C is \(C= R b_0 (R S_0+2\sigma /\sqrt{T})\).

Proof

Recall that by definition in (4.6), \(b_{M,L}(k) = \xi _{M,L}(f_{\psi _p}) \) with \( {\mathbb {E}}[b_{M,L}] =b_{\infty ,L}\) and \(A_{M,L}(k,k') = \eta _{M,L}(f_{\psi _p}, f_{\psi _{k'} })\) with \( {\mathbb {E}}[A_{M,L}]=A_{\infty ,L} \). Lemma 4.1 implies that each of these entries concentrates around their mean:

$$\begin{aligned} {\mathbb {P}}\left\{ {|\xi _{M,L}(f_{\psi _p}) - b_{\infty ,L}(k)|> \frac{\epsilon }{\sqrt{n}}}\right\}&\le 4 e^{-\frac{M\epsilon ^2 }{8nC^2}} , \\ {\mathbb {P}}\left\{ {|\eta _{M,L}(f_{\psi _p},f_{\psi _p'}) - A_{M,L}(k,k')|>\frac{\epsilon }{\sqrt{n}}}\right\}&\le 2 e^{-\frac{M\epsilon ^2 }{2nC^2}}. \end{aligned}$$

where the constant C is obtained from (4.12). In combination of the basic fact that \(\Vert b\Vert \le \sqrt{n} \max _{k=1,\dots ,n} |b(k)|\) for any \(b\in {\mathbb {R}}^n\), and \(\Vert A\Vert \le \sqrt{n} \max _{k,k'=1,\dots ,n} |A(k,k')|\) for any \(A\in {\mathbb {R}}^{n\times n}\), we obtain

$$\begin{aligned} {\mathbb {P}}\left\{ {\left\| b_{M,L} - b_{\infty ,L} \right\|> \epsilon }\right\}&\le \sum _k {\mathbb {P}}\left\{ {|\xi _{M,L}(f_{\psi _p}) - b_{\infty ,L}(k)|> \frac{\epsilon }{\sqrt{n}}}\right\} \le 4 n e^{-\frac{M\epsilon ^2 }{8nC^2}} , \\ {\mathbb {P}}\left\{ {\left\| A_{M,L} - A_{\infty ,L} \right\|> \epsilon }\right\}&\le \sum _{k,k'} {\mathbb {P}}\left\{ {|\eta _{M,L}(f_{\psi _p},f_{\psi _p'}) - A_{\infty ,L}(k,k')|>\frac{\epsilon }{\sqrt{n}}}\right\} \\&\le 2 n^2 e^{-\frac{M\epsilon ^2 }{2nC^2}}. \end{aligned}$$

The third exponential inequality follows directly by combining the first two. \(\square \)

Corollary 4.2

Denote \(\lambda _{\mathrm {min}}(A_{ML})\) the smallest eigenvalue of the normal matrix \(A_{ML}\) defined in (4.6). We have

$$\begin{aligned} {\mathbb {P}}\left\{ {\lambda _{\mathrm {min}}(A_{ML})> c_{{\mathcal {H}}} -\epsilon }\right\} >1-\delta \end{aligned}$$

with \(\delta =2n^2 \exp {\left( -\frac{M\epsilon _1^2 }{2nc_1^2} \right) } \), for any \(\epsilon _1>0\) and any \({\varDelta }t = T/L\) such that \( \epsilon _1 + c_3 {\varDelta }t ^{\frac{1}{2}}= \epsilon < c_{{\mathcal {H}}} \), where \(c_1 \) and \(c_3\) are defined in (4.2).

Proof

Note that for any \(a\in {\mathbb {R}}^n\) such that \(\Vert a\Vert =1\), we have, by Corollary 4.1, \(a^T A_{\infty , L}a \ge c_{{\mathcal {H}}} - c_3 {\varDelta }t ^{\frac{1}{2}} \). Meanwhile, Proposition 4.2 implies that

$$\begin{aligned} \Vert a^TA_{M,L} a - a^T A_{\infty , L}a \Vert&\le \Vert A_{M,L} - A_{\infty , L}\Vert \le \epsilon _1 \end{aligned}$$

with probability at least \(1-\delta \). Thus,

$$\begin{aligned} \Vert a^TA_{M,L} a \Vert \ge \Vert a^T A_{\infty , L}a \Vert -\epsilon _1 \ge c_{ {\mathcal {H}}}- c_2 {\varDelta }t ^{\frac{1}{2}} -\epsilon _1, \end{aligned}$$

and the corollary follows. \(\square \)

Remark 4

The above corollary requires \(\epsilon > c_3 {\varDelta }t ^{\frac{1}{2}}\). This condition can be removed if the coercivity holds for the discrete-time observations on \({\mathcal {H}}\) with a constant \(c_{{\mathcal {H}},T,L}\), which can be tested numerically from a data set with a large M. In fact, we obtain directly from the above proof that \( {\mathbb {P}}\left\{ {\lambda _{\mathrm {min}}(A_{ML})> c_{{\mathcal {H}},T,L} -\epsilon }\right\} >1-\delta \) with \(\delta =2n^2 \exp {\left( -\frac{M\epsilon _1^2 }{2nc_1^2} \right) } \), for any \(\epsilon >0\).

Remark 5

In practice, the minimum eigenvalue of \(A_\infty \) may be small due to the redundancy of the local basis functions or due to the coercivity constant on \({\mathcal {H}}\) being small. Thus, the smallest eigenvalue of \(A_{M,L}\) may be zero. On the other hand, these matrices are always symmetric and nonnegative, so it is advisable to regularize the matrix by pseudo-inverse.

5 Examples and Numerical Simulation Results

In this section, we performed numerical experiment to validate that our estimator defined in (2.5) and implemented by Algorithm 1, behaves in practice as predicted by the theory. We consider two examples: a stochastic opinion dynamical system and a stochastic Lennard-Jones system, using observations from simulated data.

The setup for the numerical simulations is as follows. We simulate sample paths on the time interval [0, T] with the standard Euler–Maruyama scheme (see (2.3)), with a sufficiently small time step length \(\mathrm{d}t\). When observations are made at every time step, i.e., \({\varDelta }t = t_{l+1}-t_l = \mathrm{d}t\) for each l, we view \(\varvec{X}_{\mathrm {train},M}:=\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) as continuous-time trajectories. When observations occur spaced in time with observation gap \({\varDelta }t\) equal to an integer multiple of \(\mathrm{d}t\), we refer to them here as discrete-time observations.

From the observations, we construct the empirical probability measure \(\rho ^{L,M}_T\) (defined in (2.10)), and let \([R_{\min },R_{\max }]\) be its support. We choose the hypothesis spaces \({\mathcal {H}}\) consisting of piecewise constant or piecewise linear polynomials on interval-based partitions of \([R_{\min },R_{\max }]\). This choice is dictated by the ease of obtaining an orthonormal basis for \({\mathcal {H}}\), ease and efficiency of computation, and ability to capture local features of the interaction kernel. To avoid discontinuities at the extremes of the intervals in the partition and to reduce stiffness of the equations of the system with the estimated interaction kernels, we interpolate the estimator linearly on a fine grid and extrapolate it with a constant to the left of \(R_{\min }\) and the right of \(R_{\max }\). This post-processing procedure ensures the Lipschitz continuity of the estimators. We use the post-processed estimators to predict and generate the dynamics with the estimated interaction kernels.

We mainly focus on the case where T is small and report on the results as follows:

  • Interaction kernel estimation. We compare \(\phi \) and \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\), the true and estimated interaction kernels (after smoothing), by plotting them side by side, superimposed with an estimate of \(\rho _T\), obtained as in (2.10) by using \(\smash {M_{\rho _T}}\) (\(\smash {M_{\rho _T}}\gg M\)) independent trajectories. The estimated kernel is plotted in terms of its mean and standard deviation, computed over 10 independent learning trials. To demonstrate the dependence of the estimator on the sample size and the scale of the random noise, we report the above for different values of M and \(\sigma \).

  • Trajectory prediction. In the spirit of Proposition 2.1, we compare the discrepancy between the true trajectories (evolved using \(\phi \)) and predicted trajectories (evolved using \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\)) on both the training time interval [0, T] and a future time interval \([T, T_{f}]\), over two different sets of initial conditions—one taken from the training data, and one consisting of new samples from \(\mu _0\). When simulating the trajectories for the systems driven by \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) using the EM scheme, we use the same initial conditions and the same realization of the random noise as in the trajectory of the system driven by \(\phi \). The mean trajectory error is estimated using M test trajectories (the same number as in the training data).

  • Rate of convergence. We report the convergence rate of \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) to \(\phi \) in the \({\left| \left| \left| \cdot \right| \right| \right| }\) norm on \(\smash {L^2(\rho _T)}\) as the sample size M increases, with the dimension of \({\mathcal {H}}\) growing with M according to Theorem 3.2, for different scales \(\sigma \) of the random noise. We also investigate numerically the convergence rate when both T and M increase, with the dimension of the hypothesis space \({\mathcal {H}}\) set according to the effective sample size as discussed in Sect. 2.2.

  • Discretization errors from discrete-time observations. To study the discretization error due to discrete-time observations, we report the convergence rate (in M) of estimators \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) obtained from data with different observation gaps \({\varDelta }t=T/L\). We also verify numerically that the \({\left| \left| \left| \cdot \right| \right| \right| }\) error of the estimators increases with \({\varDelta }t\) as predicted by Theorem 4.2. These experiments are carried out for different values of the square root of the diffusion constant \(\sigma \).

We will report the conclusions of our experiments in Sect. 5.3

5.1 Example 1: Stochastic Opinion Dynamics

We first consider a 1D system of stochastic opinion dynamics with interaction kernel

$$\begin{aligned} \phi (r) = \left\{ \begin{array}{ll} 0.4, &{} \quad 0 \le r< \frac{1}{\sqrt{2}}-0.05, \\ -0.3\cos (10\pi (r-\frac{1}{\sqrt{2}}+0.05)) + 0.7, &{} \quad \frac{1}{\sqrt{2}}-0.05 \le r< \frac{1}{\sqrt{2}}+0.05, \\ 1, &{} \quad \frac{1}{\sqrt{2}}+0.05 \le r<0.95,\\ 0.5 \cos (10\pi (r -0.95)) + 0.5, &{} \quad 0.95 \le r < 1.05\\ 0,&{}\quad 1.05 \le r \end{array} \right. \end{aligned}$$

It is straightforward to see that \(\phi \) is in \(C_c^{1,1}([0,2])\) and nonnegative. Systems of this form are motivated in various applications, from biology to in social science, where \(\phi \) models how the opinions of people influence each other (see [7, 11, 18, 33, 43] and references therein), with one or a multiplicity of consensuses may be eventually reached. In the system we consider, each agent tries to align its opinions more with its farther neighbors than with its closer neighbors: such interactions are called heterophilious. For deterministic systems of this type, [43] shows that the opinions of agents merge into clusters, with the number of clusters significantly smaller than the number of agents. This is natural, as increased alignment with farther neighbors increases mixing and consensus. In our stochastic setting, the random noise prevents the opinions from converging to single opinions. Instead, soft clusters form at large time, that are metastable states for the dynamics, i.e., states where agents dwell for long times, rarely switching between them.

Table 3 (OD) Parameters for the system

We study the performance of our estimators of the interaction kernel, from trajectory data. Table 3 summarizes parameters of the setup. In this example, we choose \({\mathcal {H}}_{n_M}\) to be the function space consisting of piecewise constant functions on \(n_M\) uniform partitions of the interval [0, 10].

Figure 2 shows that, as the number of trajectories increases, we obtain increasingly accurate approximations to the true interaction kernel, including at locations with sharp transitions of \(\phi \). The lack of artifacts at these locations is an advantage provided by the use of local basis. The estimators oscillate near 0, with amplitudes scaling with the level of noise. We believe that the reason for this phenomenon is that due to the structure of the equations, we have terms of the form \(\phi (0)\mathbf {0} = \mathbf {0}\) at, and near, 0, with subsequent loss of information about the interaction kernel about 0.

We then use the learned interaction kernels \({\widehat{\phi }}\) in Fig. 2 to predict the dynamics and summarize the results in Fig. 3 and Table 4. Even with \(M=32\), our estimator produces very accurate approximations of the true trajectories both in the training time interval [0, 5] and the future time interval [5, 50], including number and location of clusters, and the time of their formation. As M increases to 4096, we have more accurate predictions on the locations of clusters. We impute this improvement to the better reconstruction of estimators at locations near 0.

Fig. 2
figure 2

Stochastic opinion dynamics: comparison between true and estimated interaction kernels \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) for different values of M and \(\sigma \), together with histograms (shaded regions) for \(\rho _T\) and \(\rho _T^M\). In black: the true interaction kernel. In blue: the mean of estimators in 10 independent trials, with dash lines representing the standard deviation. From top to bottom: learning from \(M=2^5,2^{12}\) trajectories for kernels in systems with \(\sigma =0.1\) (left) and \(\sigma =0.5\) (right). The standard deviation bars on the estimated interaction kernels become smaller if M increases and \(\sigma \) decreases. The mean of the estimation error can be found in Fig. 4a (color figure online)

Next we investigate the convergence rate of estimators. It is well known in approximation theory (see Theorem 6.1 in [48]) that \(\inf _{\varphi \in {\mathcal {H}}_n}\Vert \varphi -\phi \Vert _{\infty } \le \text {Lip}[\phi ]n^{-1}\). With the dimension n being proportional to \(\smash {(\frac{M}{\log M})^{\frac{1}{3}}}\), Fig. 4 shows that the learning rate in terms of M is around \(M^{-0.34}\), which matches the optimal min–max rate \(M^{-\frac{1}{3}}\) stated in Theorem 3.2 with \(s=1\).

We also study the convergence of the estimator as the length of the trajectory T increases, for the estimator \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) from continuous-time trajectories (i.e., without gaps between observations). The auto-correlation time for this system is estimated to be about \(\tau = 10\) time units. Therefore, we use relatively long trajectories up to \(T=1500\) time units to test the convergence, contributing up to about 150 effective samples. We set the dimension of the hypothesis space to be \(n=4(\frac{MT/\mathrm{d}t}{\log (MT/\mathrm{d}t)})^{\frac{1}{3}} \) for each pair (MT), where \(\mathrm{d}t\) is the time step size of the Euler–Maruyama scheme. The convergence rate of the estimators in terms of MT is about 0.33, showing the equivalence of learning from a single long trajectory with multiple short trajectories when the underlying process is ergodic.

Fig. 3
figure 3

Stochastic opinion dynamics: trajectory prediction. In each panel: \(\varvec{X}_t\) (left column) and \({\widehat{\varvec{X}}}_t\) (right column) obtained with the true kernel \(\phi \) and the estimated interaction kernel \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) from \(M=32\) (top panel) and 4096 (bottom panel) trajectories, for an initial condition in the training data (top row in each panel) and a (new) initial condition randomly chosen from \(\mu _0\) (bottom row in each panel). The black dashed vertical line at \(t = T = 5\) divides the “training" interval [0, T] from the “prediction" interval [5,50]. As M increases, our estimators achieve better approximation of the true kernel overall, and at regions near 0 (see Fig. 2). As a result, they produced more faithful prediction of the number and location of clusters for large time. Statistics of trajectory prediction errors are reported in Table 4

Table 4 Stochastic opinion dynamics: means and standard deviations of trajectory prediction errors

We also investigate the effects of the scale of the random noise, which is represented by the standard deviation \(\sigma \). Figure 2 shows that the estimators for the system with \(\sigma =0.5\) have much large oscillations than those with \(\sigma =0.1\). The left plot in Fig. 4 shows that the scale of the random noise does not affect the learning rate, matching our theory. We also see that the absolute \(L^2(\rho _T)\) error of estimators increases as the system noise increases; this may indicate that the coercivity constant decreases as the level of noise in the system increases. The left plot in Fig. 5 shows that the scale of the errors increase linearly in \(\sigma \) (in particular, when the observation gap is 1).

Fig. 4
figure 4

Stochastic opinion dynamics: learning rates for continuous-time observations. Left: the convergence rate of the estimators in terms of M is 0.35 for \(\sigma = 0.5\) and is 0.33 for \(\sigma = 0.1\), close to the theoretical optimal min–max rate 1/3 (shown in the black dot line). Right: the convergence rate of the estimators in terms of MT, when both M and T increase, is about 0.33. The colors of points are assigned according to M. The learning rate is still close to the theoretical optimal min–max rate 1/3, showing the equivalence of learning from a single long trajectory with multiple short trajectories when the underlying process is ergodic (color figure online)

Fig. 5
figure 5

Stochastic opinion dynamics: discretization error due to discrete-time observation. Left: the learning rates of estimators \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) obtained from data with different observation gaps \({\varDelta }t= k \mathrm{d}t\) for k ranging from 11 to 100. Recall that \(L=T/{\varDelta }t\). As k increases, the learning rate curves become flat, due to the bias induced by discretization of the likelihood function (2.1) on coarse time grids. Right: the log–log plot of the absolute error of the estimator in terms of observation gap \({\varDelta }t= k \mathrm{d}t\) for k ranging from 1 to 100, for systems with different levels of random noise in terms of \(\sigma \), computed with \(M=1024\), \(T=5\) and \(\mathrm{d}t = 0.01\) fixed. The orders of the absolute error in both \(\sigma \) and \({\varDelta }t\) are bounded by the theoretical order \(\sigma O(({\varDelta }t)^{1/2})\), dominating the statistical error due to sampling, finite-dimensional approximation, and noise. The slopes of the lines are calculated using points whose x coordinate fall in the range \([-1,0]\)

Finally, we study the discretization error due to approximation of the integral in the likelihood using discrete-time observations. In the left plot of Fig. 5, as the observation gap k increases, the learning rate curves become flat, due to the error induced by discretization of the likelihood function (2.1). The right plot shows that the absolute error of the estimator is dominated by \(\sigma O(({\varDelta }t)^{1/2})\).

5.2 Example 2: Stochastic Lennard Jones Dynamics

In this example, we consider the Lennard-Jones-type kernel \(\phi (r)=\frac{{\varPhi }'(r)}{r}\), with

$$\begin{aligned} {\varPhi }(r) = \frac{p\epsilon }{(p-q)}\left[ \frac{q}{p}\left( \frac{r_m}{r}\right) ^{p}-\left( \frac{r_m}{r}\right) ^{q} \right] \end{aligned}$$

for some \(p>q \in {\mathbb {N}}\). The system of particles is assumed to be associated with a potential energy function only depending on the pairwise distance and \({\varPhi }\), and the evolution is driven by minimization of the energy function. In particular, \(\epsilon \) represents the depth of the potential well, r is the distance between the particles, and \(r_m\) is the distance at which the potential reaches its minimum. At \(r_m\), the potential function has the value \(-\epsilon \). The \(r^{-p}\) term, which is the repulsive term, describes Pauli repulsion at short ranges due to overlapping electron orbitals, and the \(r^{-q}\) term, which is the attractive long-range term. The corresponding system has wide applications in molecular dynamics and material sciences where \(\phi \) models atom–atom interactions. Note that \(\phi \) is singular at \(r=0\): we truncate it at \(r_{\text {trunc}}\) by connecting it with an exponential function of the form \(a\exp (-br^{12})\) so that it has a continuous derivative on \({\mathbb {R}}^+\).

In this system, the particle–particle interactions are all short-range repulsions and long-range attractions. The short-range repulsion force prevents the particles to collide and long-range attractions keep the particles in the flock. In the deterministic setting, the system evolves to equilibrium configurations very quickly, which are crystal-like structure, whose pairwise distance corresponds to the local minimizers of the associated energy function. Tables 5 and 6 summarize the system and learning parameters.

Note that the true kernel \(\phi \) is not compactly supported. But in our simulations, we observe the dynamics up to a time T which is a fraction of the equilibrium time. Since the particles only explore a bounded region due to the large-range attraction, \(\rho _T\) is essentially compactly supported on a bounded region (see the histogram background of Fig. 6), on which \(\phi \) is in our admission space.

We use piecewise linear functions on n uniform partitions of the learning interval to approximate the true kernel \(\phi \). With \(M=32,\) Fig. 6 shows that we have already obtained faithful approximations to the true interaction kernel, except for on regions are close 0. Increasing number of observations improves the accuracy of estimators at locations near 0, which seems to be very helpful for the system with larger noise level.

In terms of the trajectory prediction, we use the learned interaction kernels \({\widehat{\phi }}\) in Fig. 2. We summarize the results in Fig.  7 and Table 7. In the experiments, we study two cases, one with small random noise where the particles still form an equilibrium configuration, and then, this configuration has small fluctuation in the space; the other one with medium level of random noise, where the random noise begins to break the formation of a fixed equilibrium configuration and we see the transition between different configurations. We see that in both cases, our estimators produce good prediction of the true dynamics in both training and future time interval.

Table 5 (Stochastic LJ) Parameters for the Lennard-Jones kernel
Table 6 (Stochastic LJ) Parameters for the system

We plot the convergence rate of estimators in terms of M in the right plot of Fig. 8. In this case, we have \(\inf _{\varphi \in {\mathcal {H}}_n}\Vert \varphi -\phi \Vert _{\infty } \le \text {Lip}[\phi ']n^{-2}\). We choose a choice of dimension n proportional to \((\frac{M}{\log M})^{\frac{1}{5}}\), our numerical results show that the learning rate is around \(M^{-0.39}\), which matches the optimal min–max rate \(M^{-\frac{2}{5}}\) stated in Theorem 3.2.

We also study the convergence of the estimators as the length of the trajectory T increases. In this example with \(\sigma = 0.35\), the estimated auto-correlation time is about \(\tau = 10\) time units. Therefore, we use relatively long trajectories up to \(T=1200\) time units, contributing up to about 120 effective samples. We set the dimension of the hypothesis space to be \(n=4(\frac{MT/\mathrm{d}t}{\log (MT/\mathrm{d}t)})^{\frac{1}{5}} \) for each pair (MT), where \(\mathrm{d}t\) is the time step size of the Euler–Maruyama scheme. The right plot of Fig. 8 shows that the rate is 0.39, indicating the equivalence between a single long trajectory and multiple short trajectories for inference.

Fig. 6
figure 6

Stochastic Lennard-Jones dynamics: comparison between true and estimated interaction kernels with different values of M, together with histograms (shaded regions) for \(\rho _T\) and \(\rho _T^{M}\). In black: the true interaction kernel. In blue: the mean of estimators in 10 independent trials, with dash lines representing the standard deviation. From top to bottom: learning from \(M=2^7,2^{10}\) trajectories for kernels in systems with \(\sigma =0.05\) (left) and \(\sigma =0.25\) (right). The standard deviation bars on the estimated interaction kernels become smaller if M increases and \(\sigma \) decreases. More details of the estimation errors can be found in Fig. 8 a) (color figure online)

Fig. 7
figure 7

(Stochastic Lennard-Jones dynamics) In each panel: true trajectory \(\varvec{X}_t\) (left column) and learned trajectory \({\widehat{\varvec{X}}}_t\) (right column) obtained with the true kernel \(\phi \) and the estimated kernel \({\widehat{\phi }}\) from \(M=128\) and 1024 trajectories, for an initial condition in the training data (top row) and an initial condition randomly chosen (bottom row). The black dot at \(t = 0.5\) divides the “training" interval [0, 0.5] from the “prediction" interval [0.5,20]. The trajectory prediction errors are small in all cases. The statistics of the errors are presented in Table 7

Table 7 (Stochastic Lennard-Jones dynamics) trajectory errors: ICs used in the training set (first two rows), new ICs randomly drawn from \(\mu _0\) (second set of two rows)
Fig. 8
figure 8

Stochastic Lennard–Jones: learning rates for continuous-time observations. Left: the learning rate of the estimators in terms of M is 0.39 when \(\sigma = 0.05\) and is 0.41 when \(\sigma = 0.25\), close to the theoretical optimal min–max rate 2/5 (shown in the black dot line). Right: the convergence rate of the estimators in terms of MT, when both M and T increase. The colors of points are assigned according to M. The rate is still close to the optimal min–max rate 2/5, showing the equivalence of learning from a single long trajectory with multiple short trajectories when the underlying process is ergodic (color figure online)

Fig. 9
figure 9

Stochastic Lennard-Jones: discretization error due to discrete-time observation. Left: The learning rate of estimators in terms of different observation gap \({\varDelta }t= k \mathrm{d}t\) for \(k=5:5:30\). The learning rate becomes flat, due to the bias induced by discretization of the likelihood function on coarse time grids. Right: the log–log plot of the absolute error of the estimator in terms of observation gap \({\varDelta }t= k \mathrm{d}t\) for \(k=5:5:45\), for systems with different levels of random noises in terms of \(\sigma \), computed with \(M=1024\), \(T=0.5\) and \(\mathrm{d}t = 0.001\) fixed. The orders of the absolute error in both \(\sigma \) and \({\varDelta }t\) are close to the theoretical order \(\sigma O(({\varDelta }t)^{1/2})\). The slopes of the lines are calculated using points whose x coordinate fall in the range \([-1,0]\)

Next, we investigate the effects of the scale of the random noise on learning. We observe phenomenon similar to those in Example 1. Figure 6 shows that the estimators for the system with \(\sigma =0.25\) oscillate more than the one with \(\sigma =0.05\) at locations near 0. The random noise also did not affect the learning rates, suggested by the left plot of Fig. 8. As the random noise increases, absolute \(L^2(\rho _T)\) error of estimators also increase, suggesting that coercivity constant is getting smaller.

At last, we study the effects of discretization error induced by discrete observations. As the observation gap increases, the discretization errors flatten the learning rate curve of M, see left plot of Fig. 8. Similar to Example 1, the right plot of Fig. 8 shows that the absolute error of the estimator is of order close to the theoretical order \(\sigma O(({\varDelta }t)^{1/2})\) (Fig. 9).

5.3 Conclusions from the Numerical Experiments

Numerical results show that in case of continuous-time observations, the algorithm effectively estimates the interaction kernel, achieves the near-optimal learning rate in M, is robust to different magnitudes of the random noise, and the system with the estimated kernels accurately predicts trajectories. In case of discrete-time observations, the estimator has an estimation error of order \({\varDelta }t^{1/2}\), due to the discretization error in the approximation of the likelihood ratio. These numerical results are in full agreement with the learning theory in Sects. 34:

  • In case of continuous-time observations, the estimators in 10 trials are faithful approximations of the true interaction kernels, with a mean close to the truth. The standard deviation of the estimators decreases as the sample size increases and gets larger as the diffusion constant increases.

  • The estimator from data achieves the min–max learning rate \((\log {M}/ M)^{s/(2s+1)}\) in Theorem 3.2 by the appropriate choice of the hypothesis spaces and their dimension as a function of M. For \(\phi \) in \(C^{k+\alpha }\) with \(k+\alpha \ge 2\), the learning rate is around \(M^{-\frac{1}{3}}\) when using piecewise constant estimators (\(s=1\)) and the learning rate is around \(M^{-\frac{2}{5}}\) using the piecewise linear estimators (\(s=2\)), which is the mini–max optimal rate for the case \(k+\alpha =2\).

  • The estimators predict transient dynamics well in the training time interval, and the results validate Proposition 2.1: the trajectory discrepancy is controlled by \(L^2({\rho _T})\) error of estimators, demonstrating the effectiveness of distances in \(L^2(\rho _T)\) in quantifying the performance of estimators. In addition, the estimators even predict in a remarkably accurate fashion the collective behavior of particles in larger future time intervals, indicating that the bound in Proposition 2.1 may be overly pessimistic in some cases. Our intuition is that this benign phenomenon benefits from the large support of \(\rho _T\), encouraged by the randomness of the initial conditions and presence of stochastic noise.

  • In case of discrete-time observations with observation gap \({\varDelta }t\), the estimation error of the estimator is of order \({\varDelta }t^{1/2}\) and depends linearly on \(\sigma \), the square root of the diffusion constant. Therefore, as \({\varDelta }t\) increases, the discretization error dominates the estimation error, consistently with the learning theory in Sect. 4, which leads to bounding the estimation error of the estimator by \(M^{-\frac{s}{2s+1}}+\sigma {\mathcal {O}}({\varDelta }t^{1/2})\).

  • When the length T of the trajectories increases, the optimal learning rate (in M) is still achieved. The estimation errors of the estimator exhibit a convergence rate around \((\frac{\log (MT)}{MT})^{s/(2s+1)}\) with \(s=1,2\) respectively, demonstrating an equivalence of “information” between few long trajectories and many short trajectories initiated at suitably random initial conditions, as discussed above in Sect. 2.3.

6 Final Remarks and Future Work

There are many venues in which the present work could be extended.

The first notable extension is to heterogeneous particle systems with multiple types of particles, which arise in many applications. In this case, one assumes that there are different interaction kernels, modeling the non-symmetric interactions between different types of particles. Examples of these systems are considered in [41] in the deterministic case, with the theoretical analysis achieved in [40], where the coercivity condition is generalized to the multiple-particle-types setting, and (near-)optimal convergence rates of the estimators were established. We believe a similar extension is possible in the stochastic case, combining the ideas of this work and [40].

Another notable extension is to second-order differential systems of interacting particles or systems with possible external potentials, where interaction kernels of more general forms than those considered here arise. In the deterministic case, [41] considers examples of such systems, with a forthcoming theoretical analysis. In the stochastic case, the extension would require significant effort, especially if important cases of systems with degenerate diffusion (e.g., stochastic Langevin) were considered. We also remark again that in this work we do not observe velocities, as done in the works just cited in the case of deterministic systems: here we fully take into account the discretization (in time) error, and if we let \(\sigma \rightarrow 0\), the results here would imply similar results in the deterministic case. Extending these considerations to second-order systems would be valuable.

Further work is also needed to formalize the considerations we put forward in Sect. 2.3 regarding ergodic systems, and design robust and optimal algorithms in the regimes of observation a long trajectory or many independent trajectories.

We assume in this work that all particles are observed. A desirable extension is to the case of partial observations of a subset of particles or macroscopic observations of the population density, which is a practical concern when the system is large with millions of particles in high dimension. Since it is an ill-posed inverse problem to recover the missing trajectories of unobserved particles [54], a new formulation based on the corresponding mean field equations [27, 28, 43] is under investigation.

In this work, we assume that the noise coefficient is a known constant: there has been of course significant work in estimating the noise coefficient, for example in the case of interacting particle systems see the recent work [26] and references therein, and for the case of model reduction for Langevin equations with state-dependent diffusion coefficient [19].