Convergence of least squares estimators in the adaptive Wynn algorithm for some classes of nonlinear regression models

Freise, Fritjof; Gaffke, Norbert; Schwabe, Rainer

doi:10.1007/s00184-020-00803-0

Convergence of least squares estimators in the adaptive Wynn algorithm for some classes of nonlinear regression models

Open access
Published: 08 February 2021

Volume 84, pages 851–874, (2021)
Cite this article

Download PDF

You have full access to this open access article

Metrika Aims and scope Submit manuscript

Convergence of least squares estimators in the adaptive Wynn algorithm for some classes of nonlinear regression models

Download PDF

Fritjof Freise¹,
Norbert Gaffke² &
Rainer Schwabe²

1920 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The paper continues the authors’ work (Freise et al. The adaptive Wynn-algorithm in generalized linear models with univariate response. arXiv:1907.02708, 2019) on the adaptive Wynn algorithm in a nonlinear regression model. In the present paper the asymptotics of adaptive least squares estimators under the adaptive Wynn algorithm is studied. Strong consistency and asymptotic normality are derived for two classes of nonlinear models: firstly, for the class of models satisfying a condition of ‘saturated identifiability’, which was introduced by Pronzato (Metrika 71:219–238, 2010); secondly, a class of generalized linear models. Further essential assumptions are compactness of the experimental region and of the parameter space together with some natural continuity assumptions. For asymptotic normality some further smoothness assumptions and asymptotic homoscedasticity of random errors are needed and the true parameter point is required to be an interior point of the parameter space.

Asymptotic Distribution of Least Squares Estimators for Linear Models with Dependent Errors: Regular Designs

Article 01 October 2018

Asymptotic and Finite-Sample Properties in Statistical Estimation

On Existence of Explicit Asymptotically Normal Estimators in Nonlinear Regression Problems

1 Introduction

The classical algorithm of Wynn (1970) for D-optimal design in linear regression models has motivated a particular scheme for sequential adaptive design in nonlinear regression models, see Freise (2016), Pronzato (2010) and Freise et al. (2019). We refer to this scheme as ‘the adaptive Wynn algorithm’. In a previous paper (Freise et al. 2019) of the authors the asymptotics of the sequences of designs and maximum likelihood estimators under the adaptive Wynn algorithm was studied, for the important class of generalized linear models with univariate response. In the present paper the asymptotics of least squares estimators (LSEs) under the adaptive Wynn algorithm is studied, firstly, for the class of nonlinear models satisfying a condition of ‘saturated identifiability’ and, secondly, for a class of generalized linear models. As a main result, strong consistency of the adaptive LSEs is shown and, as a consequence, the asymptotic D-optimality of the generated design sequence is obtained (almost surely). Here ‘D-optimality’ means local D-optimality at the true parameter point. Moreover, asymptotic normality of the adaptive LSEs is obtained, where the asymptotic covariance matrix is given by the inverse of the locally D-optimal information matrix at the true parameter point. This shows in particular, that compared to the nonadaptive LSEs under any fixed design the adaptive LSEs from the adaptive Wynn algorithm are asymptotically more efficient in the sense of a smaller determinant of the asymptotic covariance matrix, unless of course, a fixed design is used which is locally D-optimal at the true parameter point. However, the true parameter point is unknown, thus preventing the use of that design. In contrast to the classical concept of a fixed design, the sequential adaptive method provided by the adaptive Wynn algorithm is not affected by the unknown parameter point: asymptotically the true parameter point can be identified and the adaptive designs become D-optimal. Note that other, more practically motivated adaptive procedures for design and estimation restrict to a finite number of adaptation stages, e.g. only two stages as in Dette et al. (2013) and Lane et al. (2014). Those papers addressed problems on asymptotic efficiency of adaptive maximum likelihood estimators for particular two-stage adaptive procedures, when the sample sizes of the stages go to infinity. However, the present paper is exclusively concerned with the adaptive Wynn algorithm which is an infinite-stage adaptive procedure.

Next we give an outline of our framework and the adaptive Wynn algorithm. Suppose a nonlinear regression model with a real valued mean response $\mu (x,\theta )$, $x\in \mathcal{X}$, $\theta \in \varTheta $, where $\mathcal{X}$ and $\varTheta $ are the experimental region and the parameter space, respectively. Suppose that a family of ${\mathbb {R}}^p$-valued functions $f_\theta $, $\theta \in \varTheta $, defined on $\mathcal{X}$ has been identified such that the $p\times p$ matrix $f_\theta (x)\,f_\theta ^\mathsf{T}(x)$ is the elementary information matrix of $x\in \mathcal{X}$ at $\theta \in \varTheta $. Note that a vector $a\in {\mathbb {R}}^p$ is written as a column vector and $a^\mathsf{T}$ denotes its transposed which is thus a p-dimensional row vector. A design $\xi $ is a probability measure on $\mathcal{X}$ with finite support. That is, $\xi $ is described by its support, denoted by $\mathrm{supp}(\xi )$, which is a nonempty finite subset of $\mathcal{X}$, and by its weights $\xi (x)$ for $x\in \mathrm{supp}(\xi )$ which are positive real numbers with $\sum _{x\in \mathrm{supp}(\xi )}\xi (x)\,=1$. The information matrix of a design $\xi $ at $\theta \in \varTheta $ is defined by

(1.1)

which is a nonnegative definite $p\times p$ matrix.

In applications the family $f_\theta $, $\theta \in \varTheta $, will be related to the mean response $\mu (x,\theta )$, $x\in \mathcal{X}$, $\theta \in \varTheta $. Usually, $\varTheta \subseteq {\mathbb {R}}^p$ and $f_\theta (x)$ is given by the gradient of $\mu (x,\theta )$ w.r.t. $\theta $ for each fixed x, or by a scalar multiple of that gradient where the scalar factor is a function of $\theta $ and x, cf. Atkinson et al. (2014), Lemma 1. For particular classes of models described below, consistency of the adaptive LSEs will be achieved without any relation between the family $f_\theta $, $\theta \in \varTheta $, and the mean response $\mu (x,\theta )$, $x\in \mathcal{X}$, $\theta \in \varTheta $, whereas asymptotic normality of the adaptive LSEs will require the gradient relation with scalar factor equal to 1. Throughout we assume the following basic conditions (B1)–(B4).

(B1) The experimental region $\mathcal{X}$ is a compact metric space.

(B2) The parameter space $\varTheta $ is a compact metric space.

(B3) The real-valued mean response function $(x,\theta )\mapsto \mu (x,\theta )$, defined on the Cartesian product space $\mathcal{X}\times \varTheta $, is continuous.

(B4) The family $f_\theta $, $\theta \in \varTheta $, of ${\mathbb {R}}^p$-valued functions on $\mathcal{X}$ satisfies:

(i) for each $\theta \in \varTheta $ the image $f_\theta (\mathcal{X})$ spans ${\mathbb {R}}^p$;

(ii) the function $(x,\theta )\mapsto f_\theta (x)$ is continuous on $\mathcal{X}\times \varTheta $.

By ${\mathbb {N}}$ and ${\mathbb {N}}_0$ we denote the set of all positive integers and all nonnegative integers, respectively. By $\delta _x$ for any $x\in \mathcal{X}$ we denote the one-point probability distribution on $\mathcal{X}$ concentrated at the point x. The adaptive Wynn algorithm collects iteratively design points $x_i\in \mathcal{X}$, $i\in {\mathbb {N}}$, while adaptively estimating $\theta $ on the basis of the current design points and observed (real valued) responses at those points. In greater detail the algorithm reads as follows.

Adaptive Wynn algorithm

(o) Initialization

A positive integer $n_{\mathrm{st}}\in {\mathbb {N}}$ and design points $x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}\in \mathcal{X}$ are chosen such that the starting design $\xi _{n_{\tiny \mathrm{st}}}=\frac{1}{n_{\mathrm{st}}}\sum _{i=1}^{n_{\tiny \mathrm{st}}} \delta _{x_i}$ has positive definite information matrices, i.e., for all $\theta \in \varTheta $ the information matrix $M(\xi _{n_{\tiny \mathrm{st}}},\theta )$ is positive definite. Observed responses $y_1,\ldots ,y_{n_{\tiny \mathrm{st}}}\in {\mathbb {R}}$ at the design points $x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}$ are taken, and an initial parameter estimate $\theta _{n_{\tiny \mathrm{st}}}\in \varTheta $ is calculated,

$$\begin{aligned} \theta _{n_{\tiny \mathrm{st}}}={\widehat{\theta }}_{n_{\tiny \mathrm{st}}}(x_1,y_1,\ldots ,x_{n_{\tiny \mathrm{st}}},y_{n_{\tiny \mathrm{st}}})\,\in \varTheta . \end{aligned}$$

(i) Iteration

At stage $n\ge n_{\mathrm{st}}$ the current data is given by the points $x_1,\ldots ,x_n\in \mathcal{X}$ which form the design $\xi _n=\frac{1}{n}\sum _{i=1}^n\delta _{x_i}$, and by the observed responses $y_1,\ldots ,y_n\in {\mathbb {R}}$ at $x_1,\ldots ,x_n$, respectively, along with a parameter estimate $\theta _n\in \varTheta $,

$$\begin{aligned} \theta _n={\widehat{\theta }}_n(x_1,y_1,\ldots ,x_n,y_n)\,\in \varTheta . \end{aligned}$$

(1.2)

The iteration rule is given by

$$\begin{aligned} x_{n+1}\,=\,\arg \ \max _{x\in \mathcal{X}}f_{\theta _n}^\mathsf{T}(x)\,M^{-1}(\xi _n,\theta _n)\, f_{\theta _n}(x). \end{aligned}$$

(1.3)

An observation $y_{n+1}\in {\mathbb {R}}$ of the response at $x_{n+1}$ is taken and a new parameter estimate $\theta _{n+1}$ based on the augmented data is computed,

$$\begin{aligned} \theta _{n+1}={\widehat{\theta }}_{n+1}(x_1,y_1,\ldots ,x_n,y_n,x_{n+1},y_{n+1})\,\in \varTheta . \end{aligned}$$

Replace n by $n+1$ and repeat the iteration step (i). $\square $

Of course, Eq. (1.3) requires the information matrix $M(\xi _n,\theta _n)$ to be positive definite at each stage $n\ge n_{\mathrm{st}}$. In fact, this is ensured by the choice of the initial design $\xi _{n_{\tiny \mathrm{st}}}$ since, obviously, the sequence of designs $\xi _n$, $n\ge n_{\mathrm{st}}$ satisfies

$$\begin{aligned}&\xi _{n+1}\,=\,\frac{n}{n+1}\,\xi _n\,+\,\frac{1}{n+1}\,\delta _{x_{n+1}}, \end{aligned}$$

(1.4)

$$\begin{aligned}&M(\xi _{n+1},\theta )\,=\,\frac{n}{n+1}\,M(\xi _n,\theta )\,+\,\frac{1}{n+1}\, f_\theta (x_{n+1})\,f_\theta ^\mathsf{T}(x_{n+1}), \quad \theta \in \varTheta , \end{aligned}$$

(1.5)

from which one concludes by induction that $M(\xi _n,\theta )$ is positive definite for all $n\ge n_{\mathrm{st}}$ and all $\theta \in \varTheta $. The existence of an initial design $\xi _{n_{\tiny \mathrm{st}}}$ as required will be shown in Section 2, Lemma 1. However, we have no general method or algorithm for constructing an initial design according to ‘step (o)’ of the algorithm. For some important classes of models there exists a saturated initial design (i.e., $n_{\mathrm{st}}=p$) which can easily be constructed, see Remark 1 in Sect. 2.

The algorithm uses, in particular, an observed response $y_i$ at each current design point $x_i$. So the generated sequence of design points, $x_i$, $i\in {\mathbb {N}}$, and the corresponding sequence of designs $\xi _n$, $n\ge n_{\mathrm{st}}$, are random sequences with a particular dependence structure caused by Eqs. (1.2) and (1.3). An appropriate stochastic model will be stated in Sect. 3 which was used in Freise et al. (2019) and goes back to Lai and Wei (1982), Lai (1994), and Chen et al. (1999). In particular, the generated sequence $x_i$, $i\in {\mathbb {N}}$, and the observed responses $y_i$, are viewed as values of random variables $X_i$, $i\in {\mathbb {N}}$, and $Y_i$, $i\in {\mathbb {N}}$, respectively, following a stochastic model which we call an ‘adaptive regression model’. Our formulation of the adaptive Wynn algorithm is a description of the paths of the stochastic process $(X_i,Y_i)$, $i\in {\mathbb {N}}$.

The estimators ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$, employed by the algorithm to produce the estimates $\theta _n$, $n\ge n_{\mathrm{st}}$, in (1.2), may be any estimators of $\theta $ such that their values are in $\varTheta $ and ${\widehat{\theta }}_n$ is a function of the data $x_1,y_1,\ldots ,x_n,y_n$ available at stage n. Such estimators will be called adaptive estimators. Later, strong consistency of ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$, will be required. In Sect. 3 we focus on adaptive LSEs ${\widehat{\theta }}_n^{(\mathrm{LS})}$, i.e.,

$$\begin{aligned} {\widehat{\theta }}_n^{(\mathrm{LS})}(x_1,y_1,\ldots ,x_n,y_n)\,=\,\arg \min _{\theta \in \varTheta } \sum _{i=1}^n \bigl \{ y_i-\mu (x_i,\theta ) \bigr \}^2. \end{aligned}$$

Note that when dealing with the adaptive LSEs we will not necessarily assume that the estimators ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$, employed by the algorithm are given by the LSEs. Below two alternative conditions on the nonlinear model will be introduced. Either condition will ensure strong consistency of the LSEs, irrespective which adaptive estimators ${\widehat{\theta }}_n$ are used in the algorithm. One condition is that of ‘saturated identifiability’ (SI), which was introduced and employed by Pronzato (2010) in the case of a finite experimental region $\mathcal{X}$.

(SI) If $z_1,\ldots ,z_p\in \mathcal{X}$ are pairwise distinct design points then the ${\mathbb {R}}^p$-valued function on $\varTheta $ given by $\theta \mapsto \bigl (\mu (z_1,\theta ),\ldots ,\mu (z_p,\theta )\bigr )^\mathsf{T}$ is an injection, i.e., if $\theta ,\theta '\in \varTheta $ and $\mu (z_j,\theta )=\mu (z_j,\theta ')$, $1\le j\le p$, then $\theta =\theta '$.

Recall that p is the dimension of the functions $f_\theta $. As mentioned above, in many applications one has $\varTheta \subseteq {\mathbb {R}}^p$ and $f_\theta (x)$ is given by the gradient of $\mu (x,\theta )$ w.r.t. $\theta $ for all $x\in \mathcal{X}$. So, in these cases, p also coincides with the dimension of the parameter vector $\theta $.

The other (alternative) condition states that the model is essentially a generalized linear model. We call this condition ($\mathrm{GLM}^*$) where the ‘star’ is to distinguish it from a similar ‘condition (GLM)’ employed in Freise et al. (2019) which addressed only to the family $f_\theta $ ignoring the mean response function $\mu $.

($\mathrm{GLM}^*$) $\varTheta \subseteq {\mathbb {R}}^p$, $\mu (x,\theta )=G\bigl (f^\mathsf{T}(x)\,\theta \bigr )$, and $f_\theta (x)=\psi (x,\theta )\,f(x)$ for all $(x,\theta )\in \mathcal{X}\times \varTheta $, where $f\,:\,\mathcal{X}\longrightarrow {\mathbb {R}}^p$ and $\psi \,:\,\mathcal{X}\times \varTheta \longrightarrow (\,0\,,\,\infty )$ are continuous functions, $G\,:\,I\longrightarrow {\mathbb {R}}$ is a differentiable function on an open interval $I\subseteq {\mathbb {R}}$ with continuous and positive derivative, $G'(u)>0$ for all $u\in I$, and I covers the values $f^\mathsf{T}(x)\,\theta $ for all $(x,\theta )\in \mathcal{X}\times \varTheta $.

Note that G is called the inverse link function. If the functions $f_\theta $ are assumed to be the gradients (w. r. t. $\theta $) of $\mu $ then the form of $\mu $ assumed in ($\mathrm{GLM}^*$) already implies $f_\theta (x)=\psi (x,\theta )\,f(x)$ with $\psi (x,\theta )=G'\bigl (f^\mathsf{T}(x)\,\theta \bigr )$ for all $(x,\theta )\in \mathcal{X}\times \varTheta $.

Example 1

(cf. Pronzato 2010, Examples 2 and 3; Hu (1998), Examples 2 and 3) Let $p=2$, $\varTheta \subseteq (\,0\,,\,\infty )^2$, $\mathcal{X}\subseteq (\,0\,,\,\infty )$, and consider two regression models,

$$\begin{aligned} \mathrm{(a)}\ \ \mu (x,\theta )=\frac{\theta _1 x}{\theta _2+x}\,,\quad \mathrm{(b)}\ \ \mu (x,\theta )= \theta _1\exp \bigl (-\theta _2 x\bigr ), \end{aligned}$$

where $\theta =(\theta _1,\theta _2)^\mathsf{T}$. Models (a) and (b) are called the ‘Michaelis-Menten model’ and the ‘exponential decay model’, respectively. Both models (a) and (b) satisfy condition (SI) which can be seen as follows. Consider model (a). Let $z_1,z_2\in \mathcal{X}$ with $z_1<z_2$ and $\theta =(\theta _1,\theta _2)^\mathsf{T}, \theta '=(\theta _1',\theta _2')^\mathsf{T}\in \varTheta $ such that $\theta _1z_j/(\theta _2 +z_j)=\theta _1' z_j/(\theta _2'+z_j)$ for $j=1,2$. Then

$$\begin{aligned} \frac{\theta _1}{\theta _1'}=\frac{\theta _2+z_1}{\theta _2'+z_1}=\frac{\theta _2+z_2}{\theta _2'+z_2}. \end{aligned}$$

(1.6)

Since $\partial /\partial t\,\bigl \{(\theta _2+t)/(\theta _2'+t)\bigr \}=(\theta _2'-\theta _2)/(\theta _2'+t)^2$ for $t\in (\,0\,,\,\infty )$, the function $t\mapsto (\theta _2+t)/(\theta _2'+t)$ on $(\,0\,,\,\infty )$ is either increasing (if $\theta _2<\theta _2'$) or decreasing (if $\theta _2>\theta _2'$) or constant (if $\theta _2=\theta _2'$). So, by $z_1<z_2$ and (1.6), one gets $\theta _2=\theta _2'$ and, again by (1.6), $\theta _1=\theta _1'$, hence $\theta =\theta '$. So (SI) holds for model (a). Consider model (b). Let $z_1,z_2\in \mathcal{X}$ with $z_1<z_2$ and $\theta =(\theta _1,\theta _2)^\mathsf{T}, \theta '=(\theta _1',\theta _2')^\mathsf{T}\in \varTheta $ such that $\theta _1\exp \bigl (-\theta _2 z_j\bigr )=\theta _1'\exp \bigl (-\theta _2' z_j\bigr )$ for $j=1,2$. Then

$$\begin{aligned} \frac{\theta _1}{\theta _1'}=\exp \bigl \{(\theta _2-\theta _2')z_1\bigr \}=\exp \bigl \{(\theta _2-\theta _2')z_2\bigr \}, \end{aligned}$$

and by $z_1<z_2$ this yields $\theta _2'=\theta _2$, and hence $\theta _1'=\theta _1$. So $\theta '=\theta $ and (SI) has been verified for model (b). $\square $

Example 2

generalized linear regression.

Let $\varTheta \subseteq {\mathbb {R}}^p$ and $\mu (x,\theta )=G\bigl (f^\mathsf{T}(x)\,\theta \bigr )$ for all $(x,\theta )\in \mathcal{X}\times \varTheta $, where $f\,:\,\mathcal{X}\longrightarrow {\mathbb {R}}^p$ is a continuous function, and $G\,:\,I\longrightarrow {\mathbb {R}}$ is an increasing continuous function on an interval $I\subseteq {\mathbb {R}}$ with $\bigl \{f^\mathsf{T}(x)\,\theta \,:\,(x,\theta )\in \mathcal{X}\times \varTheta \bigr \}\subseteq I$. Let $z_1,\ldots ,z_p\in \mathcal{X}$ be pairwise distinct and $\theta ,\theta '\in \varTheta $. Clearly, $\mu (z_j,\theta )=\mu (z_j,\theta ')$ for all $j=1,\ldots ,p$ is equivalent to $f^\mathsf{T}(z_j)\,(\theta -\theta ')=0$ for all $j=1,\ldots ,p$. So, for the present model, condition (SI) is equivalent to the following.

If $\theta ,\theta '\in \varTheta $ and $z_1,\ldots ,z_p\in \mathcal{X}$ pairwise distinct such that $\theta -\theta '$ is orthogonal to the vectors $f(z_1),\ldots ,f(z_p)$, then $\theta =\theta '$.

Assume that the parameter space $\varTheta $ is nondegenerate in the sense that there is no hyperplane of ${\mathbb {R}}^p$ covering $\varTheta $. Then the set of differences $\{\theta -\theta '\,:\,\theta ,\theta '\in \varTheta \}$ spans ${\mathbb {R}}^p$, and hence condition (SI) is equivalent to the following condition.

(Ch) If $z_1,\ldots ,z_p\in \mathcal{X}$ are pairwise distinct then the vectors $f(z_1),\ldots ,f(z_p)$ are linearly independent.

The notation ‘Ch’ stands for Chebyshev since, as it can easily be seen, condition (Ch) holds if and only if the (real-valued) component functions $f_1,\ldots ,f_p$, say, of f constitute a Chebyshev system, i.e., every linear combination $\sum _{j=1}^pc_jf_j(x)$, with coefficients $c_j\in {\mathbb {R}}$ ($1\le j\le p$) not all equal to zero, has at most $p-1$ distinct zeros on $\mathcal{X}$, cf. Karlin (1968), p. 24. Usually, Chebyshev systems are considered to be functions of one real variable defined on an interval of the real line. The most prominent example is given by the monomials $1,x,\ldots ,x^{p-1}$ on some interval. For example, condition (Ch) or (SI), respectively, does not hold for generalized first order regression in two variables on a rectangle $\mathcal{X}=[a_1,b_1]\times [a_2,b_2]$, where $f(x)=(1,x_1,x_2)^\mathsf{T}$ for all $x=(x_1,x_2)\in \mathcal{X}$. In this example we have $p=3$. Let $z^{(1)},z^{(2)},z^{(3)}\in \mathcal{X}$ be pairwise distinct. One easily verifies that the vectors $f(z^{(1)}),f(z^{(2)}),f(z^{(3)})$ are linearly independent if and only if the three points $z^{(1)},z^{(2)},z^{(3)}$ do not lie on a common line. So (Ch) and hence (SI) do not hold. As a consequence from the present Example 2 one may guess that the class of models satisfying condition (SI) is not very large. In particular, generalized linear regression models with more than one regressors will usually not be members of that class. However, such models will be included in our derivations by the class given by condition ($\mathrm{GLM}^*$). $\square $

The employed stochastic model for the adaptive Wynn algorithm includes a martingale difference scheme for the error variables, see Sect. 3. Limit theorems for martingales can be applied: a Strong Law of Large Numbers and a Central Limit Theorem to prove strong consistency and asymptotic normality, respectively, of the adaptive LSEs, see Theorems 1 and 2. Some auxiliary results are the content of Sect. 2. The proofs of the results of Sects. 2 and 3 are presented in the Appendix.

2 Auxiliary results

Throughout we assume (B1)–(B4) as introduced in the previous section. Note, however, that (B3) will not play a role in this section. Firstly, we give a proof of the existence of an initial design as required in the algorithm.

Lemma 1

There exist an $n_{\mathrm{st}}\in {\mathbb {N}}$ and design points $x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}\in \mathcal{X}$ such that for every $\theta \in \varTheta $ the vectors $f_\theta (x_1),\ldots ,f_\theta (x_{n_{\tiny \mathrm{st}}})$ span ${\mathbb {R}}^p$. Hence, for such $x_i$, $1\le i\le n_{\mathrm{st}}$, the design $\xi _{n_{\tiny \mathrm{st}}}=\frac{1}{n_{\mathrm{st}}}\sum _{i=1}^{n_{\tiny \mathrm{st}}}\delta _{x_i}$ has the property that its information matrix $M(\xi _{n_{\tiny \mathrm{st}}},\theta )$ is positive definite for all $\theta \in \varTheta $.

Remark 1

Some popular nonlinear regression models, preferably those with a scalar regressor variable x, i.e., $\mathcal{X}\subseteq {\mathbb {R}}$, enjoy a further ‘Chebyshev property’ ($\mathrm{Ch}^*$), which was essentially assumed by Pronzato (2010) as condition $\mathbf{H}_{\mathcal{X}^-}(\mathrm{iv})$. It states that condition (Ch) from Example 2 holds for each function $f_\theta $:

($\mathrm{Ch}^*$) If $z_1,\ldots ,z_p\in \mathcal{X}$ are pairwise distinct and $\theta \in \varTheta $ then the vectors $f_\theta (z_1),\ldots ,f_\theta (z_p)$ are linearly independent.

If ($\mathrm{Ch}^*$) holds then a suitable initial design for the algorithm is provided by any saturated design, i.e., choose $n_{\mathrm{st}}=p$ and pairwise distinct design points $x_1,\ldots ,x_p\in \mathcal{X}$. Note also that in the proof of the lemma, assuming ($\mathrm{Ch}^*$), one has $U(\theta )=\varTheta $ for all $\theta \in \varTheta $ and hence $r=1$ and $n_{\mathrm{st}}=p$. As an example consider the regression models from Example 1, (a) and (b), together with the assumption that $f_\theta $ is given by the gradient of $\mu (x,\theta )$ w. r. t. $\theta $, for all $(x,\theta )\in \mathcal{X}\times \varTheta $, which yields:

$$\begin{aligned} (\mathrm{a})\ \ f_\theta (x)=\Bigl (\frac{x}{\theta _2+x}\,,\,-\frac{\theta _1x}{(\theta _2+x)^2}\Bigr )^\mathsf{T},\quad (\mathrm{b})\ \ f_\theta (x)=\exp (-\theta _2x)\,\Bigl (\,1\,,\,-\theta _1x\Bigr )^\mathsf{T}\end{aligned}$$

for all $x\in \mathcal{X}\subseteq (0\,,\,\infty )$ and $\theta =(\theta _1,\theta _2)^\mathsf{T}\in \varTheta \subseteq (0\,,\,\infty )^2$. In either case (a) or (b), it is straightforward to show that for any given $\theta \in \varTheta $ and $x,x'\in \mathcal{X}$, $x<x'$, the two vectors $f_\theta (x),f_\theta (x')$ are linearly independent and hence both models (a) and (b) satisfy ($\mathrm{Ch}^*$).

In general, however, Lemma 1 and its proof does not give a practical way of finding a value of r and thus of $n_{\mathrm{st}}=rp$, or even an upper bound on them. Another condition ensuring the existence of an initial design with $n_{\mathrm{st}}=p$ is condition (GLM) considered in Freise et al. (2019) which is the second half of condition ($\mathrm{GLM}^*$) from Sect. 1.

(GLM) $f_\theta (x)\,=\,\psi (x,\theta )\,f(x)$ for all $(x,\theta )\in \mathcal{X}\times \varTheta $, where $\psi \,:\,\mathcal{X}\times \varTheta \longrightarrow (\,0\,,\,\infty )$ and $f\,:\mathcal{X}\longrightarrow {\mathbb {R}}^p$ are continuous functions.

In fact, (GLM) together with our basic assumption (B4) (i) implies that the image $f(\mathcal{X})$ spans ${\mathbb {R}}^p$. So one can find p points $x_1,\ldots ,x_p\in \mathcal{X}$ such that the vectors $f(x_1),\ldots ,f(x_p)$ are linearly independent, and by (GLM) for every $\theta \in \varTheta $ the vectors $f_\theta (x_1),\ldots ,f_\theta (x_p)$ are linearly independent. So the saturated design $\xi _p=\frac{1}{p}\sum _{i=1}^p\delta _{x_i}$ is an appropriate initial design for the algorithm. $\square $

Let any path of the adaptive Wynn algorithm be given as described in the previous section. In particular, $x_i$, $i\in {\mathbb {N}}$, is the sequence of design points and $\xi _n=\frac{1}{n}\sum _{i=1}^n\delta _{x_i}$, $n\ge n_{\mathrm{st}}$, is the corresponding sequence of designs. For the following two lemmas no assumption on the employed adaptive estimators ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$, is needed. In other words, the sequence $\theta _n$, $n\ge n_{\mathrm{st}}$, of parameter estimates appearing in the path may be arbitrary.

We denote the distance function in the compact metric space $\mathcal{X}$ by $\mathrm{d}_\mathcal{X}(x,z)$, $x,z\in \mathcal{X}$. If $S_1$ and $S_2$ are nonempty subsets of $\mathcal{X}$ then the distance $\mathrm{d}_\mathcal{X}(S_1,S_2)$ of $S_1$ and $S_2$ is defined by $\mathrm{d}_\mathcal{X}(S_1,S_2)\,=\,\inf \{\mathrm{d}_\mathcal{X}(x,z)\,:\,x\in S_1,\ z\in S_2\}$. In case that $S_1=\{x\}$ is a singleton we write $\mathrm{d}_\mathcal{X}(x,S_2)$ instead of $\mathrm{d}_\mathcal{X}(\{x\},S_2)$. If S is a nonempty subset of $\mathcal{X}$ then the diameter of S is defined by $\mathrm{diam}(S)\,=\,\sup \{\mathrm{d}_\mathcal{X}(x,z)\,:\,x,z\in S\}$.

Lemma 2

Suppose $p\ge 2$. Let $\varepsilon >0$ be given. Then there exist $d>0$ and $n_0\ge n_{\mathrm{st}}$ such that

$$\begin{aligned} \xi _n(S)\le \frac{1}{p}+\varepsilon \ \hbox { for all } \emptyset \not =S\subseteq \mathcal{X}\hbox { with }\mathrm{diam}(S)\le d\hbox { and all } n\ge n_0. \end{aligned}$$

Lemma 3

Suppose $p\ge 2$. There exist $n_0\ge n_{\mathrm{st}}$, $\pi _0>0$, and $d_0>0$ such that the following holds.

For each $n\ge n_0$ there are p subsets $S_{1,n},S_{2,n},\ldots ,S_{p,n}$ of $\mathcal{X}$ such that

$\xi _n(S_{j,n})\ge \pi _0$, $1\le j\le p$, $\mathrm{diam}(S_{j,n})\le d_0$, $1\le j\le p$, and

$\mathrm{d}_\mathcal{X}(S_{j,n},S_{k,n})\ge d_0$, $1\le j< k\le p$.

Remark 2

In the case that $\mathcal{X}$ is finite it is easily seen that in Lemma 3 the subsets $S_{1,n},\ldots ,S_{p,n}$ can be chosen to be singletons for all $n\ge n_0$. So, in this case, the lemma yields the result of Lemma 2 of Pronzato (2010). $\square $

3 Convergence of least squares estimators

For an analysis of the adaptive Wynn algorithm, the generated sequence $x_i$, $i\in {\mathbb {N}}$, of design points and the observed (real valued) responses $y_i$, are viewed as values of random variables $X_i$, $i\in {\mathbb {N}}$, and $Y_i$, $i\in {\mathbb {N}}$, respectively, whose dependence structure is described by the following two assumptions (A1) and (A2), see Lai and Wei (1982), Lai (1994) and Chen et al. (1999), see also (Freise et al. 2019), The model thereby stated might be called an ‘adaptive regression model’. By ${\overline{\theta }}$ we denote the true point of the parameter space $\varTheta $ governing the data. All the random variables appearing in this section are thought to be defined on a common probability space $(\varOmega ,\mathcal{F},{\mathbb {P}}_{{\overline{\theta }}})$, where $\varOmega $ is a nonempty set, $\mathcal{F}$ is a sigma-field of subsets of $\varOmega $, and ${\mathbb {P}}_{{\overline{\theta }}}$ is a probability measure on $\mathcal{F}$ corresponding to the true parameter point ${\overline{\theta }}$. We assume, as before, the basic conditions (B1)–(B4), and now additionally the following conditions (A1) and (A2) constituting the adaptive regression model.

(A1) There is a given nondecreasing sequence of sub-sigma-fields of $\mathcal{F}$, $\mathcal{F}_0\subseteq \mathcal{F}_1\subseteq \,\ldots \,\subseteq \mathcal{F}_n\subseteq \,\ldots $ such that for each $i\in {\mathbb {N}}$ the random variable $X_i$ is $\mathcal{F}_{i-1}$-measurable and the random variable $Y_i$ is $\mathcal{F}_i$-measurable.

(A2) $Y_i\,=\,\mu (X_i,{\overline{\theta }})\,+\,e_i$ with real-valued square integrable random errors $e_i$ such that $\mathrm{E}\bigl (e_i\,\big |\,\mathcal{F}_{i-1}\bigr )\,=0 \ \hbox {\,a.s.}$ for all $i\in {\mathbb {N}}$, and $\sup _{i\in {\mathbb {N}}}\mathrm{E}\bigl (e_i^2\,\big |\,\mathcal{F}_{i-1}\bigr )\,<\infty \ \ \hbox {a.s.}$

As before, ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$, are the adaptive estimators employed by the algorithm, now viewed as random variables, ${\widehat{\theta }}_n={\widehat{\theta }}_n(X_1,Y_1,\ldots ,X_n,Y_n)$. Of course, a desirable property of these estimators would be strong consistency, i.e., almost sure convergence to ${\overline{\theta }}$ (as $n\rightarrow \infty $), for short ${\widehat{\theta }}_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}$.

Remark 3

As shown in our previous paper (Freise et al. 2019), Corollary 3.2, if the estimators ${\widehat{\theta }}_n$ are strongly consistent, then the sequence $\xi _n$, $n\ge n_{\mathrm{st}}$, of (random) designs generated by the algorithm is almost surely asymptotically D-optimal, in the sense that $M(\xi _n,{\widehat{\theta }}_n)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,M(\xi ^*_{{\overline{\theta }}},{\overline{\theta }})$, where $\xi ^*_{{\overline{\theta }}}$ is a locally D-optimal design at ${\overline{\theta }}$. In fact, the conclusion of that corollary is stronger: if the estimators ${\widehat{\theta }}_n$ are strongly consistent then $M(\xi _n,{\widetilde{\theta }}_n)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,M(\xi ^*_{{\overline{\theta }}},{\overline{\theta }})$ holds for every strongly consistent sequence of $\varTheta $-valued estimators ${\widetilde{\theta }}_n$. $\square $

The next result yields strong consistency of the adaptive LSEs ${\widehat{\theta }}_n^{(\mathrm LS)}$ for any adaptive estimators ${\widehat{\theta }}_n$ employed by the algorithm, provided that condition (SI) or condition ($\mathrm{GLM}^*$) holds.

Theorem 1

Assume that condition (SI) or condition ($\mathrm{GLM}^*$) is satisfied. Then, irrespective of the employed sequence of adaptive estimators ${\widehat{\theta }}_n$ in the algorithm, the sequence of adaptive LSEs ${\widehat{\theta }}_n^{(\mathrm LS)}$ is strongly consistent: ${\widehat{\theta }}_n^{(\mathrm LS)}\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}$.

Remark 4

A crucial point in the proof of Theorem 1 is the result of ‘Step 2’ stating that $\inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})$ goes to infinity at least as fast as n a. s. when $n\rightarrow \infty $. This is due to the adaptive Wynn algorithm, while the results of ‘Step 1’ and ‘Step 3’ only use the model assumptions (A1) and (A2). More general adaptive sampling schemes modeled by (A1) and (A2) were addressed to in Pronzato (2009). By Theorem 1 of that paper, in case of a finite design space, strong consistency of adaptive LSEs already holds if the adaptive scheme is such that

$$\begin{aligned} \inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\Big /(\log n)^\rho \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\infty \end{aligned}$$

for all $\varepsilon >0$, for some $\rho >1$. In Theorem 1 of Pronzato (2010), see also Remark 1 in Pronzato (2009), it was claimed that in case of i. i. d. error variables $e_i$, $i\in {\mathbb {N}}$, and for a finite design space the condition may be weakened to

$$\begin{aligned} \inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\Big /(\log \log n)\,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\infty . \end{aligned}$$

However, the proof of the latter in Pronzato (2010), pp. 210–211, is doubtful since the classical Law of Iterated Logarithm is applied to random subsequences of the error variables. A proof should rather use a martingale structure and, in particular, a Law of Iterated Logarithm for martigales. Unfortunately, we have not found the appropriate arguments. $\square $

For deriving asymptotic normality of the adaptive least squares estimators further assumptions are needed. Firstly, the ‘gradient condition’ (B5) on the family of functions $f_\theta $, $\theta \in \varTheta $, and the mean response $\mu $ is added to conditions (B1)–(B4). Secondly, two additional conditions (L) and (AH) on the error variables in (A1)–(A2) are imposed, where ‘L’ stands for ‘Lindeberg’ and ‘AH’ for ‘asymptotic homoscedasticity’.

(B5) $\varTheta \subseteq {\mathbb {R}}^p$ (endowed with the usual Euclidean metric), $\mathrm{int}(\varTheta )\not =\emptyset $, where $\mathrm{int}(\varTheta )$ denotes the interior of $\varTheta $ as a subset of ${\mathbb {R}}^p$, the function $\theta \mapsto \mu (x,\theta )$ is twice differentiable on $\mathrm{int}(\varTheta )$ for each fixed $x\in \mathcal{X}$, with gradients and Hessian matrices denoted by $\nabla \mu (x,\theta )= \Bigl (\frac{\partial }{\partial \theta _1}\mu (x,\theta ),\ldots ,\frac{\partial }{\partial \theta _p}\mu (x,\theta )\Bigr )^\mathsf{T}$ and $\nabla ^2\mu (x,\theta )=\Bigl (\frac{\partial ^2}{\partial \theta _i\partial \theta _j}\mu (x,\theta )\Bigr )_{1\le i,j\le p}$, respectively, for $\theta =(\theta _1,\ldots ,\theta _p)^\mathsf{T}\in \mathrm{int}(\varTheta )$ and $x\in \mathcal{X}$. It is assumed that the functions $(x,\theta )\mapsto \nabla \mu (x,\theta )$ and $(x,\theta )\mapsto \nabla ^2\mu (x,\theta )$ are continuous on $\mathcal{X}\times \mathrm{int}(\varTheta )$ and

$$\begin{aligned} f_\theta (x)\,=\,\nabla \mu (x,\theta )\quad \hbox {for all } x\in \mathcal{X} \hbox {and all } \theta \in \mathrm{int}(\varTheta ). \end{aligned}$$

For a subset $A\subseteq \varOmega $ we denote by $ 1 1(A)$ the function on $\varOmega $ which is constantly equal to 1 on A and is constantly equal to 0 on $\varOmega {\setminus } A$.

(L) $\frac{1}{n}\sum _{i=1}^n \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\Big |\mathcal{F}_{i-1}\Bigr ) \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$ for all $\varepsilon >0$.

(AH) $\mathrm{E}\bigl (e_n^2\big |\mathcal{F}_{n-1}\bigr )\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\sigma ^2({\overline{\theta }})$ for some positive real constant $\sigma ({\overline{\theta }})$.

The following two conditions (L’) and (L”) are less technical than the Lindeberg condition (L), and each of them implies (L).

(L’) $\sup _{i\in {\mathbb {N}}}\mathrm{E}\bigl (|e_i|^\alpha \big |\mathcal{F}_{i-1}\bigr )\,<\,\infty $ a.s. for some real $\alpha >2$.

(L”) The random variables $e_i$, $i\in {\mathbb {N}}$, are identically distributed, and $e_i$, $\mathcal{F}_{i-1}$ are independent for each $i\in {\mathbb {N}}$.

In fact, from (L’), observing the trivial inequality $e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\le |e_i|^\alpha /(\varepsilon \sqrt{n})^{\alpha -2}$, it follows that

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\Big |\mathcal{F}_{i-1}\Bigr )\, \le \,\frac{1}{(\varepsilon \sqrt{n})^{\alpha -2}}\,\sup _{i\in {\mathbb {N}}}\mathrm{E}\bigl (|e_i|^\alpha \big |\mathcal{F}_{i-1}\bigr ) \ \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0. \end{aligned}$$

From (L”) it follows for all $i\in {\mathbb {N}}$

$$\begin{aligned} \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\big |\mathcal{F}_{i-1}\Bigr )= \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\Bigr )= \mathrm{E}\Bigl (e_1^2 1 1\bigl (|e_1|>\varepsilon \sqrt{n}\bigr )\Bigr ) \ \hbox { a.s.} \end{aligned}$$

Hence

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\Big |\mathcal{F}_{i-1}\Bigr ) = \mathrm{E}\Bigl (e_1^2 1 1\bigl (|e_1|>\varepsilon \sqrt{n}\bigr )\Bigr )\ \hbox {a.s.} \end{aligned}$$

and the expectation on the r.h.s. converges to zero as $n\rightarrow \infty $. Note also that (L”) implies $\mathrm{E}\bigl (e_i^2\big |\mathcal{F}_{i-1}\bigr )=\mathrm{E}\bigl (e_1^2\bigr )=\sigma ^2({\overline{\theta }})$, say. Excluding the trivial case $\sigma ^2({\overline{\theta }})=0$, we see that condition (L”) also implies condition (AH).

Remark 5

Condition (L’) was employed by Lai and Wei (1982), Theorem 1 of that paper, and by Chen et al. (1999), condition (C4) on p. 1161 of that paper. Condition (L”) meets the assumption of i.i.d. error variables of Pronzato (2010) for a particular choice of the sequence of sub-sigma-fields $\mathcal{F}_i$, $i\in {\mathbb {N}}_0$. $\square $

The k-dimensional normal distribution with expectation 0 and covariance matrix C is denoted by $\mathrm{N}(0,C)$, where C is a positive definite $k\times k$ matrix. In particular, $\mathrm{N}(0,I_k)$ is the k-dimensional standard normal distribution, where $I_k$ denotes the $k\times k$ identity matrix. For a sequence $W_n$ of ${\mathbb {R}}^k$-valued random variables, convergence in distribution of $W_n$ (as $n\rightarrow \infty $) to a k-dimensional normal distribution $\mathrm{N}(0,C)$ is abbreviated by $W_n\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\mathrm{N}(0,C)$. In the following theorem asymptotic normality of the adaptive least squares estimators ${\widehat{\theta }}_n^{\mathrm{(LS)}}$ is established. To some extent our proof is similar to that of Theorem 2 in Pronzato (2009), though the assumptions are different. Note that, by our Theorem 1, the assumptions of strong consistency of the adaptive estimators ${\widehat{\theta }}_n$ employed by the algorithm and of strong consistency of the adaptive LSEs ${\widehat{\theta }}_n^{\mathrm{(LS)}}$ are met if ${\widehat{\theta }}_n={\widehat{\theta }}_n^{\mathrm{(LS)}}$, $n\ge n_{\mathrm{st}}$, and if one of the conditions (SI) and ($\mathrm{GLM}^*$) holds.

Theorem 2

Assume conditions (B5), (L), and (AH). Moreover, assume that ${\overline{\theta }}\in \mathrm{int}(\varTheta )$ and the sequences ${\widehat{\theta }}_n$ and ${\widehat{\theta }}_n^{\mathrm{(LS)}}$ of adaptive estimators employed by the algorithm and adaptive LSEs, respectively, are strongly consistent, i.e., ${\widehat{\theta }}_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}$ and ${\widehat{\theta }}_n^{\mathrm{(LS)}}\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}$. Then, denoting by $M_*({\overline{\theta }})=M\bigl (\xi ^*_{{\overline{\theta }}},{\overline{\theta }})$ the information matrix of a locally D-optimal design at ${\overline{\theta }}$, one has

$$\begin{aligned} \sqrt{n}\,\bigl ({\widehat{\theta }}_n^\mathrm{(LS)}-{\overline{\theta }}\bigr ) \,\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\,\mathrm{N}\bigl (0,\sigma ^2({\overline{\theta }})\,M_*^{-1}({\overline{\theta }})\bigr ). \end{aligned}$$

For illustration of the achieved convergence results, we present some simulations for the Michaelis-Menten model from Example 1, case (a). For the exponemtial decay model of part (b) of Example 1 we obtained similar simulation results which will not be reported here.

Example 3: Simulation. Assume the Michaelis-Menton model with $p=2$ parameters,

$$\begin{aligned}&\mu (x,\theta )=\frac{\theta _1 x}{\theta _2+x},\\&f_\theta (x)=\nabla \mu (x,\theta )\,=\,\Bigl (\frac{x}{\theta _2+x}\,,\, -\frac{\theta _1x}{(\theta _2+x)^2}\Bigr )^\mathsf{T}\end{aligned}$$

for $x\in \mathcal{X}$ and $\theta =(\theta _1,\theta _2)^\mathsf{T}\in \varTheta $. Let the experimental region be given by the interval $\mathcal{X}=[\,0.5\,,\,5\,]$ and the parameter space be the square $\varTheta =[\,0.1\,,\,10\,]^2$. The true parameter point is chosen to be ${\overline{\theta }}=(1,1)^\mathsf{T}$. The error variables $e_i$, $i\in {\mathbb {N}}$, in (A1) are assumed to be i.i.d normally distributed with expectation zero and variance $\sigma ^2({\overline{\theta }})=0.04$. By simulations, S=10,000 (pieces of) paths $X_i^{(s)}, Y_i^{(s)}$, $1\le i\le 500$, where $s=1,\ldots ,S$, of the adaptive Wynn algorithm were generated, where the employed adaptive estimators ${\widehat{\theta }}_n$ were chosen to be the adaptive LSEs: ${\widehat{\theta }}_n={\widehat{\theta }}_n^{\mathrm{(LS)}}$ for all n. The computation of the paths of adaptive LSEs was done by using R Core Team (2020). A fixed initial design $\xi _{n_{\mathrm{st}}}$ was used: The three-point design with equal weights on the boundary points and the mid-point of the experimental interval, that is, $n_{\mathrm{st}}=3$ and $x_1=0.5$, $x_2=2.75$, $x_3=5$. Figure 1 illustrates the (almost sure) asymptotic D-optimality of the design sequence generated by the adaptive Wynn algorithm, which is ensured by Theorem 1 and Remark 3. The simulated paths $\xi _n^{(s)}$, $3\le n\le 500$, where $s=1,\ldots ,S$, yield D-efficiencies

$$\begin{aligned} \mathrm{eff}(\xi _n^{(s)})\,=\,\Bigl \{\det \bigl (M(\xi _n^{(s)},{\overline{\theta }})\bigr )\big /\det \bigl (M_*({\overline{\theta }})\bigr )\Bigr \}^{1/2}, \end{aligned}$$

where $M^*({\overline{\theta }})=M(\xi ^*_{{\overline{\theta }}},{\overline{\theta }})$ is the information matrix of the locally D-optimal design at ${\overline{\theta }}$ which is the two-point design giving equal weights 1/2 to the points 5/7 and 5, see Bates and Watts (1988), pp. 125–126, and hence

For each $n\in \{3,\ldots ,500\}$, particular quantiles of the ‘data’ $\mathrm{eff}(\xi _n^{(s)})$, $1\le s\le S$, are reported in Fig. 1: minimum, 10%-quantile, 25%-quantile, median, 75%-quantile, 90%-quantile, and maximum. For example, the 10%-quantile curve shows that after less than 50 iterations more than 90% of the simulated paths yield efficiencies at least 0.9. The asymptotic normality of the adaptive LSEs ensured by Theorem 2 suggests that n-times the mean squared error matrix of ${\widehat{\theta }}_n^{\mathrm{(LS)}}$ at ${\overline{\theta }}$ should converge to the asymptotic covasriance matrix $\sigma ^2({\overline{\theta }})\,M_*^{-1}({\overline{\theta }})$. In fact, this was observed for n-times the simulated mean squared error matrix. Figure 2 shows a plot of the (2, 2)-entry of that matrix, that is, n-times the simulated mean squared error of the second component ${\widehat{\theta }}_{2,n}^{\mathrm{(LS)}}$ of ${\widehat{\theta }}_n^{\mathrm{(LS)}}$. The (2, 2)-entry of the asymptotic covariance matrix is approximately 3.32, indicated by the horizontal line in the figure. Finally, by Fig. 3 the approximate normal distribution of ${\widehat{\theta }}_{2,n}^{\mathrm{(LS)}}$ is visualized at $n=250$ via a suitable histogram of the simulated estimates (along with a fitted normal density) and a normal qq-plot. The simulation yielded similar graphics for the first component of ${\widehat{\theta }}_n^{\mathrm{(LS)}}$.

4 Discussion

The adaptive Wynn algorithm for nonlinear regression provides a particular adaptive sampling scheme which has been motivated by the classical iterative procedure established by Wynn (1970) for generating D-optimal designs under a linear model. In the nonlinear situation the strong consistency of the adaptive estimators employed by the algorithm is crucial for ensuring that the generated adaptive design sequence is asymptotically D-optimal in the sense of local D-optimality at the true parameter point (which, of course, is unknown). Note that choosing a locally optimal design would be the best if the true parameter point was known. The focus of the present paper is on adaptive least squares estimators (LSEs), and their strong consistency has been proved for two relevant classes of nonlinear (univariate) regression models: those which have the property of ‘saturated identifiability’ (SI) and, secondly, those which satisfy a condition ($\mathrm{GLM}^*$) including the class of generalized linear models. Condition (SI) seems to be restricted to models with a real valued regressor variable as the examples in Sect. 1 have shown. Generalized linear models (GLMs) constitute a great and important class of models. However, for a GLM, maximum likelihood or weighted least squares will usually be preferable to ordinary (unweighted) least squares as studied in the present paper. We note that adaptive maximum likelihood estimation in GLMs under the adaptive Wynn algorithm was studied in our previous paper (Freise et al. 2019). A challenging question for future research is to find extensions to other classes of models than just the (SI) class and, with regard to GLMs, to include weighted least squares estimation.

On the basis of strong consistency of adaptive LSEs their asymptotic normality has been established, under further suitable assumptions. The asymptotic covariance matrix is given by the inverse information matrix of the locally D-optimal design at the true parameter point. Thus, when measuring the size of a (nonsingular) covariance matrix by its determinant, the result implies the asymptotic efficiency of the adaptive LSEs. Two assumptions may be restrictive: firstly, the information matrices are built by a local first order Taylor expansion of the response (locally at a parameter point), without any scaling adjustment for possible variance heterogeneity. Secondly, a condition of ‘asymptotic homoscedasticity’ (AH) on the random errors has been imposed, that is, their (conditional) variances are assumed to become asymptotically constant, where the asymptotic variance may depend on the true parameter point. Both assumptions correspond to ordinary (unweighted) least squares employed by the adaptive LSEs under consideration. Again, extensions of the results are desirable which employ weaker assumptions to include models with variance heterogeneity as the majority of GLMs.

While the adaptive Wynn algorithm collects one point at each step and was therefore called ‘one-step-ahead algorithm’ by Pronzato (2010), an alternative approach is to collect more points at each step. For a special model, a related concept of ‘batch sequential design’ was employed by Müller and Pötscher (1992). In a forthcoming paper (Freise et al. 2020) we study a sequential adaptive algorithm for D-optimal design which we have called a ‘p-step-ahead algorithm’, since p design points are collected at each step. An idea of that algorithm was sketched by Ford et al. (1992) in the introduction of their paper, p. 570.

References

Atkinson AC, Fedorov VV, Herzberg AM, Zhang R (2014) Elemental information matrices and optimal experimental design for generalized regression models. J Stat Plan Inference 144:81–91
Article MathSciNet Google Scholar
Bates DM, Watts DG (1988) Nonlinear regression analysis and its applications. Wilew, Ney York
Book Google Scholar
Chen K, Hu I, Ying Z (1999) Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann Stat 27:1155–1163
MathSciNet MATH Google Scholar
Dette H, Bornkamp B, Bretz F (2013) On the efficiency of two-stage response-adaptive designs. Stat Med 32:1646–1660
Article MathSciNet Google Scholar
Ford I, Torsney B, Wu CFJ (1992) The use of a canonical form in the construction of locally optimal designs for non-linear problems. J R Stat Soc B 54(2):569–583
MATH Google Scholar
Freise F (2016) On convergence of the maximum likelihood estimator in adaptive designs. Dissertation, University of Magdeburg
Freise F, Gaffke N, Schwabe R (2019) The adaptive Wynn-algorithm in generalized linear models with univariate response. arXiv:1907.02708 [math.ST]
Freise F, Gaffke N, Schwabe R (2020) A $p$-step-ahead sequential adaptive algorithm for D-optimal nonlinear regression design. Technical report, University of Magdeburg
Hall P, Heyde CC (1980) Martingale limit theory and its application. Academic Press, New York
MATH Google Scholar
Hu I (1998) On sequential designs in nonlinear problems. Biometrika 85:496–503
Article MathSciNet Google Scholar
Karlin S (1968) Total positivity. Stanford University Press, Stanford
MATH Google Scholar
Lai TL (1994) Asymptotic properties of nonlinear least squares estimates in stochastic regression models. Ann Stat 22:1917–1930
MathSciNet MATH Google Scholar
Lai TL, Wei CZ (1982) Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. Ann Stat 10:154–166
Article MathSciNet Google Scholar
Lane A, Yao P, Flournoy N (2014) Information in a two-stage adaptive optimal design. J Stat Plan Inference 144:173–187
Article MathSciNet Google Scholar
Müller WG, Pötscher BM (1992) Batch sequential design for a nonlinear estimation problem. In: Fedorov VV, Vuchkov TN (eds) Model oriented data-analysis. Physika-Verlag, Heidelberg, pp 77–87
Google Scholar
Pronzato L (2009) Asymptotic properties of nonlinear estimates in stochastic models with finite design space. Stat Probab Lett 79:2307–2313
Article MathSciNet Google Scholar
Pronzato L (2010) One-step ahead adaptive $D$-optimal design on a finite design space is asymptotically optimal. Metrika 71:219–238
Article MathSciNet Google Scholar
R Core Team (2020) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Wu C-F (1981) Asymptotic theory of nonlinear least-squares estimation. Ann Stat 9:501–513
MathSciNet MATH Google Scholar
Wynn H (1970) The sequential generation of $D$-optimum experimental designs. Ann Math Stat 5:1655–1664
Article MathSciNet Google Scholar

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department of Biometry, Epidemiology and Information Processing, University of Veterinary Medicine Hannover, 30559, Hannover, Germany
Fritjof Freise
Faculty of Mathematics, University of Magdeburg, 39016, Magdeburg, Germany
Norbert Gaffke & Rainer Schwabe

Authors

Fritjof Freise
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Gaffke
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Schwabe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Norbert Gaffke.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proofs

Proof of Lemma 1

By (B4)-(i), for each $\theta \in \varTheta $ there exist p design points $z_1(\theta ),\ldots ,z_p(\theta )\in \mathcal{X}$ such that the vectors $f_\theta \bigl (z_1(\theta )\bigr ),\ldots , f_\theta \bigl (z_p(\theta )\bigr )$ are linearly independent. By (B2) and (B4) (ii), for each $\theta \in \varTheta $ the set

$$\begin{aligned} U(\theta )\,=\,\Bigl \{\tau \in \varTheta \,:\,\det \,\bigl [f_\tau \bigl (z_1(\theta )\bigr ),\ldots , f_\tau \bigl (z_p(\theta )\bigr )\bigr ]\not =0\Bigr \} \end{aligned}$$

is an open set in the (compact) metric space $\varTheta $, and $\theta \in U(\theta )$. Hence, trivially, $\varTheta =\bigcup _{\theta \in \varTheta }U(\theta )$, and by (B2) there is an $r\in {\mathbb {N}}$ and points $\theta _1,\ldots ,\theta _r\in \varTheta $ such that $\varTheta =\bigcup _{j=1}^rU(\theta _j)$. Denote $x_{ij}=z_i(\theta _j)$, $1\le i\le p$, $1\le j\le r$. Then, for every $\tau \in \varTheta $ the set of vectors

$$\begin{aligned} \bigl \{f_\tau (x_{ij})\,:\,1\le i\le p,\ 1\le j\le r\bigr \} \end{aligned}$$

spans ${\mathbb {R}}^p$. In fact, for any given $\tau \in \varTheta $ there is some $j_0\in \{1,\ldots ,r\}$ with $\tau \in U(\theta _{j_0})$ hence $\det \,\bigl [f_\tau (x_{1j_0}),\ldots ,f_\tau (x_{pj_0})\bigr ]\not =0$, i.e., the vectors $f_\tau (x_{1j_0}),\ldots ,f_\tau (x_{pj_0})$ constitute a basis of ${\mathbb {R}}^p$. So, for $n_{\mathrm{st}}=pr$ and $x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}$ being a relabelled family of the points $x_{ij}$, $1\le i\le p$, $1\le j\le r$, the vectors $f_\tau (x_1),\ldots ,f_\tau (x_{n_{\tiny \mathrm{st}}})$ span ${\mathbb {R}}^p$, and hence the information matrix $M(\xi _{n_{\tiny \mathrm{st}}},\tau )=\frac{1}{n_{\mathrm{st}}}\sum _{i=1}^{n_{\tiny \mathrm{st}}}f_\tau (x_i)\,f^\mathsf{T}_\tau (x_i)$ is positive definite. $\square $

Proof of Lemma 2

Without loss of generality we may assume $\varepsilon <1-p^{-1}$. In Freise et al. (2019) we introduced the positive real constants

$$\begin{aligned} \gamma =\sup _{x\in \mathcal{X},\, \theta \in \varTheta }\Vert f_\theta (x)\Vert \ \quad \hbox {and}\quad \kappa = \inf _{\Vert v\Vert =1,\, \theta \in \varTheta }\ \max _{x\in \mathcal{X}}\ \bigl \{v^\mathsf{T}f_\theta (x)\bigr \}^2, \end{aligned}$$

(A.1)

and Lemma 2.3 of that paper stated the following.

$$\begin{aligned}&\hbox {If } 0<\eta <1-p^{-1/2}, n\ge n_{\mathrm{st}}, \hbox { and } S\subseteq \mathcal{X} \hbox {are given such that}\nonumber \\&\quad \Vert f_{\theta _n}(x)-f_{\theta _n}(z)\Vert \le \eta \kappa /\gamma \hbox { for all } x,z\in S \hbox { and } \xi _n(S)>(1-\eta )^{-2}p^{-1},\nonumber \\&\quad \hbox {then } x_{n+1}\not \in S. \end{aligned}$$

(A.2)

Choose $\eta :=1-\bigl (1+p\varepsilon /2\bigr )^{-1/2}$. Then $0<\eta <1-p^{-1/2}$ and $(1-\eta )^{-2}p^{-1}=p^{-1}+\varepsilon /2$. By (B1), (B2), and (B4) the function $(x,\theta )\mapsto f_\theta (x)$ is uniformly continuous on its compact domain $\mathcal{X}\times \varTheta $. So there exists a $d>0$ such that

$$\begin{aligned} \hbox {if } x,z\in \mathcal{X} \hbox { and } \mathrm{d}_\mathcal{X}(x,z)\le d \hbox { then } \ \Vert f_\theta (x)-f_\theta (z)\Vert \le \eta \kappa /\gamma \ \forall \ \theta \in \varTheta . \end{aligned}$$

(A.3)

We show that d fulfills the requirement of the lemma. Let $\emptyset \not =S\subseteq \mathcal{X}$ with $\mathrm{diam}(S)\le d$. By (A.3) and (A.2) the sequence $\xi _n(S)$, $n\ge n_{\mathrm{st}}$, has the property that for all $n\ge n_{\mathrm{st}}$,

$$\begin{aligned} \xi _{n+1}(S)= & {} \frac{n}{n+1}\xi _n(S)\quad \hbox { if }\xi _n(S)>\frac{1}{p}+\frac{\varepsilon }{2},\\ \xi _{n+1}(S)\le & {} \xi _n(S)+\frac{1}{n+1} \ \hbox { if }\xi _n(S)\le \frac{1}{p}+\frac{\varepsilon }{2}. \end{aligned}$$

An application of Lemma 2.1 in Freise et al. (2019) to the sequence $\beta _n:=\xi _n(S)$, $n\ge n_{\mathrm{st}}$, and $\beta :=\frac{1}{p}+\frac{\varepsilon }{2}$, ${\widetilde{\beta }}:=\frac{1}{p}+\varepsilon $ yields that

$$\begin{aligned} \xi _n(S)\le \frac{1}{p}+\varepsilon \ \hbox { for all }n\ge n_0:=\big \lceil {(\frac{1}{p}+\frac{\varepsilon }{2})^{-1}}\big \rceil \cdot \max \bigl \{n_{\mathrm{st}},\lceil 2/\varepsilon \rceil \bigr \}, \end{aligned}$$

where $\lceil a\rceil $, for $a\in {\mathbb {R}}$, denotes the smallest integer greater than or equal to a. Since $n_0$ does not depend on the particular set S the result follows. $\square $

Proof of Lemma 3

Fix an $\varepsilon $ with $0<\varepsilon <\{p(p-1)\}^{-1}$. Choose $d>0$ and $n_0\ge n_{\mathrm{st}}$ according to Lemma 2. By compactness of $\mathcal{X}$ there is a positive integer q and nonempty subsets $R_1,\ldots ,R_q$ of $\mathcal{X}$ such that

$$\begin{aligned} \mathcal{X}\,=\,\bigcup _{\ell =1}^qR_\ell \ \quad \hbox { and }\quad \ \mathrm{diam}(R_\ell )\le d/3\ \hbox { for all } \ell =1,\ldots ,q. \end{aligned}$$

We show that $n_0$, $\pi _0:=\{p^{-1}-(p-1)\varepsilon \}/q$, and $d_0:=d/3$ satisfy the requirements of the assertion. To this end let $n\ge n_0$ be given. We construct inductively subsets $S_{j,n}$, $1\le j\le p$, as required.

$\underline{j=1}$ : Clearly, $\sum _{\ell =1}^q\xi _n(R_\ell )\ge 1$. Choose $\ell _n\in \{1,\ldots ,q\}$ achieving the maximum value of $\xi _n(R_\ell )$, $1\le \ell \le q$, and set $S_{1,n}:=R_{\ell _n}$. Then $\xi _n(S_{1,n})=\max _{1\le \ell \le q}\xi _n(R_\ell )\ge 1/q\ge \pi _0$ and $\mathrm{diam}(S_{1,n})=\mathrm{diam}(R_{\ell _n})\le d_0$.

Induction step: Let an $r\in \{1,\ldots ,p-1\}$ be given along with subsets $S_{1,n},\ldots ,S_{r,n}$ of $\mathcal{X}$ such that $\xi _n(S_{j,n})\ge \pi _0$ and $\mathrm{diam}(S_{j,n})\le d_0$, $1\le j\le r$, and $\mathrm{d}_\mathcal{X}(S_{j,n},S_{k,n})\ge d_0$, $1\le j<k\le r$. Let ${\overline{S}}_{j,n}:=\bigl \{x\in \mathcal{X}\,:\,\mathrm{d}_\mathcal{X}(x,S_{j,n})\le d_0\bigr \}$, $1\le j\le r$. As it is easily seen, $\mathrm{diam}({\overline{S}}_{j,n})\le 3d_0=d$ and hence $\xi _n({\overline{S}}_{j,n})\le \frac{1}{p}+\varepsilon $. So for $T_{r,n}:=\bigcup _{j=1}^r{\overline{S}}_{j,n}$ one has $\xi _n(T_{r,n})\le r\,\bigl (\frac{1}{p}+\varepsilon \bigr )$, and hence

$$\begin{aligned} \xi _n(\mathcal{X}{\setminus } T_{r,n})\ge 1- r\,\left( \frac{1}{p}+\varepsilon \right) \ge 1- (p-1)(\frac{1}{p}+\varepsilon )=p^{-1}-(p-1)\varepsilon . \end{aligned}$$

Observing that $\mathcal{X}{\setminus } T_{r,n}=\bigcup _{\ell =1}^q(R_\ell {\setminus } T_{r,n})$ one gets

$$\begin{aligned} p^{-1}-(p-1)\varepsilon \,\le \sum _{\ell =1}^q \xi _n(R_\ell {\setminus } T_{r,n}). \end{aligned}$$

Choose $\ell _n\in \{1,\ldots ,q\}$ which achieves the maximum value of $\xi _n(R_\ell {\setminus } T_{r,n})$, $1\le \ell \le q$, and set $S_{r+1,n}:= R_{\ell _n}{\setminus } T_{r,n}$. Then $\xi _n(S_{r+1,n})=\max _{1\le \ell \le q}\xi _n\bigl (R_\ell {\setminus } T_{r,n}\bigr )\ge \{p^{-1}-(p-1)\varepsilon \}/q=\pi _0$ and $\mathrm{diam}(S_{r+1,n})\le \mathrm{diam}(R_{\ell _n})\le d_0$. Moreover for each $j=1,\ldots ,r$, since $S_{r+1,n}\cap {\overline{S}}_{j,n}=\emptyset $, one has $\mathrm{d}_\mathcal{X}(x,S_{j,n})>d_0$ for all $x\in S_{r+1,n}$ and hence $\mathrm{d}_\mathcal{X}(S_{r+1,n},S_{j,n})\ge d_0$. So we have subsets $S_{1,n},\ldots ,S_{r,n},S_{r+1,n}$ such that

$$\begin{aligned}&\xi _n(S_{j,n})\ge \pi _0\ \hbox { and }\ \mathrm{diam}(S_{j,n})\le d_0,\ \ 1\le j\le r+1,\\&\mathrm{d}_\mathcal{X}(S_{j,n},S_{k,n})\ge d_0,\ \ 1\le j<k\le r+1. \end{aligned}$$

This completes the inductive construction and the proof of the lemma. $\square $

Proof of Theorem 1

Define for all $n\in {\mathbb {N}}$ and $\theta \in \varTheta $ random variables

$$\begin{aligned} S_n(\theta ):=\sum _{i=1}^n\bigl \{Y_i-\mu (X_i,\theta )\bigr \}^2\quad \hbox {and}\quad D_n(\theta ,{\overline{\theta }})\,:=\,\sum _{i=1}^n\bigl \{\mu (X_i,\theta )-\mu (X_i,{\overline{\theta }})\bigr \}^2. \end{aligned}$$

The proof is divided into three steps. For $\varepsilon >0$ we denote $C({\overline{\theta }},\varepsilon ):= \bigl \{\theta \in \varTheta \,:\,\mathrm{d}_\varTheta (\theta ,{\overline{\theta }})\ge \varepsilon \bigr \}$, where $\mathrm{d}_\varTheta $ denotes the distance function in $\varTheta $.

Step 1. Show that for all $\varepsilon >0$ with $C({\overline{\theta }},\varepsilon )\not =\emptyset $,

$$\begin{aligned} \left| \,\frac{1}{n}\left\{ \inf _{\theta \in C({\overline{\theta }},\varepsilon )} S_n(\theta )\,-S_n({\overline{\theta }})\right\} \, -\,\frac{1}{n}\inf _{\theta \in C({\overline{\theta }},\varepsilon )} D_n(\theta ,{\overline{\theta }})\,\right| \ \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\,0. \end{aligned}$$

Step 2. Show that for all $\varepsilon >0$ with $C({\overline{\theta }},\varepsilon )\not =\emptyset $,

$$\begin{aligned} \liminf _{n\rightarrow \infty }\left\{ \frac{1}{n}\, \inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\right\} \ >\,0\ \ \hbox {a.s.} \end{aligned}$$

Step 3. Conclude from the results of Step 1 and Step 2 that for all $\varepsilon >0$ with $C({\overline{\theta }},\varepsilon )\not =\emptyset $,

$$\begin{aligned} \inf _{\theta \in C({\overline{\theta }},\varepsilon )}S_n(\theta )\,-\,S_n({\overline{\theta }})\ \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\,\infty . \end{aligned}$$

(A.4)

From (A.4), applying Lemma 1 of Wu (1981), one gets ${\widehat{\theta }}_n^{(\mathrm LS)}\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}$.

Ad Step 1. As in Pronzato (2010), p. 230, one calculates

$$\begin{aligned}&S_n(\theta )-S_n({\overline{\theta }})\,=\,D_n(\theta ,{\overline{\theta }})\,+\,2W_n(\theta ,{\overline{\theta }}),\ \ \hbox {where}\\&W_n(\theta ,{\overline{\theta }})\,:=\,\sum _{i=1}^n \bigl (\mu (X_i,{\overline{\theta }})-\mu (X_i,\theta )\bigr )\,e_i. \end{aligned}$$

It follows that

$$\begin{aligned} \left| \,\frac{1}{n}\left\{ \inf _{\theta \in C({\overline{\theta }},\varepsilon )} S_n(\theta )\,-S_n({\overline{\theta }})\right\} \, -\,\frac{1}{n}\inf _{\theta \in C({\overline{\theta }},\varepsilon )} D_n(\theta ,{\overline{\theta }})\,\right| \le \ \frac{2}{n}\,\sup _{\theta \in \varTheta }\left| W_n(\theta ,{\overline{\theta }})\right| . \end{aligned}$$

Applying Lemma 3.1, part (c), in Freise et al. (2019) with $h(x,\theta )=\mu (x,{\overline{\theta }})-\mu (x,\theta )$, $(x,\theta )\in \mathcal{X}\times \varTheta $, the result of Step 1 follows.

Ad Step 2 in case that condition (SI) holds.

Consider any path $x_i,y_i$, $i\in {\mathbb {N}}$, and $\theta _n$, $n\ge n_{\mathrm{st}}$ of the sequences $X_i,Y_i$, $i\in {\mathbb {N}}$, and ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$. Firstly, consider the simple case $p=1$. Then condition (SI) implies that $\mu (x,\theta )\not =\mu (x,{\overline{\theta }})$ for all $\theta \in C({\overline{\theta }},\varepsilon )$, and hence by (B3)

$$\begin{aligned} c_\varepsilon \,:=\,\inf _{x\in \mathcal{X}}\bigl \{\mu (x,\theta )-\mu (x,{\overline{\theta }})\bigr \}^2\ >0. \end{aligned}$$

It follows that $\frac{1}{n}\inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\,\ge c_\varepsilon $ for all n and, in particular, its limit inferior is positive. Now let $p\ge 2$. According to Lemma 3, choose $n_0\ge n_{\mathrm{st}}$, $\pi _0>0$, $d_0>0$, and subsets $S_{1,n},\ldots , S_{p,n}\subseteq \mathcal{X}$ for all $n\ge n_0$. Define a subset of the p-fold product space $\mathcal{X}^p$ by

$$\begin{aligned} \varDelta \,:=\,\bigl \{(z_1,\ldots ,z_p)\in \mathcal{X}^p\,:\,\mathrm{d}_\mathcal{X}(z_j,z_k)\ge d_0,\ \ 1\le j<k\le p\bigr \}. \end{aligned}$$

By (SI), $\sum _{j=1}^p\bigl (\mu (z_j,\theta )-\mu (z_j,{\overline{\theta }})\bigr )^2>0$ for all $(z_1,\ldots ,z_p)\in \varDelta $ and all $\theta \not ={\overline{\theta }}$. By (B1) the set $\varDelta $ is compact and by (B2) the set $C({\overline{\theta }},\varepsilon )$ is compact. So, together with (B3), one concludes that the following infimum $c_\varepsilon $ is positive,

$$\begin{aligned} c_\varepsilon :=\inf \left\{ \sum _{j=1}^p\left\{ \mu (z_j,\theta )-\mu (z_j,{\overline{\theta }})\right\} ^2\,:\, (z_1,\ldots ,z_p)\in \varDelta ,\ \theta \in C({\overline{\theta }},\varepsilon )\right\} . \end{aligned}$$

For all $n\ge n_0$ and all permutations $\sigma $ of $\{1,\ldots ,p\}$ the Cartesian product $S_n^\sigma :=S_{\sigma (1),n}\times S_{\sigma (2),n}\times \ldots \times S_{\sigma (p),n}$ is a subset of $\varDelta $, hence $R_n:=\bigcup _\sigma S_n^\sigma \subseteq \varDelta $. Note that $S_n^\sigma \cap S_n^\tau =\emptyset $ for any two different permutations $\sigma $ and $\tau $. Consider the p-fold product measure $\xi _n^p$. Then, for all $\sigma $ and all $n\ge n_0$ one has $\xi _n^p(S_n^\sigma )=\prod _{j=1}^p\xi _n(S_{\sigma (j),n})\ge \pi _0^p$ and hence $\xi _n(R_n)\ge p!\,\pi _0^p$. So

$$\begin{aligned}&\int _{\mathcal{X}^p}\sum _{j=1}^p\bigl \{\mu (z_j,\theta )-\mu (z_j,{\overline{\theta }})\bigr \}^2\,\mathrm{d}\xi _n^p(z_1,\ldots ,z_p) \\&\quad \ge c_\varepsilon \,p!\,\pi _0^p \ \hbox { for all } n\ge n_0\hbox { and }\theta \in C({\overline{\theta }},\varepsilon ). \end{aligned}$$

The integral on the l.h.s. of that inequality is equal to

$$\begin{aligned} p\,\int _{\mathcal{X}}\bigl \{\mu (z,\theta )-\mu (z,{\overline{\theta }})\bigr \}^2\,\mathrm{d}\xi _n(z)\,=\,\frac{p}{n} \sum _{i=1}^n \bigl \{\mu (x_i,\theta )-\mu (x_i,{\overline{\theta }})\bigr \}^2. \end{aligned}$$

It follows that

$$\begin{aligned}&\frac{1}{n}\,\inf _{\theta \in C({\overline{\theta }},\varepsilon )} \sum _{i=1}^n\bigl \{\mu (x_i,\theta )-\mu (x_i,{\overline{\theta }})\bigr \}^2\,\\&\quad \ge \, c_\varepsilon \,(p-1)!\,\pi _0^p\ \quad \forall \ n\ge n_0, \end{aligned}$$

which implies that the limit inferior of the l.h.s. of that inequality is positive.

${\underline{\hbox {Ad Step 2 in case that condition (GLM}^{*}\hbox {)}}}$ holds.

Again, consider any path $x_i,y_i$, $i\in {\mathbb {N}}$, and $\theta _n$, $n\ge n_{\mathrm{st}}$ of the sequences $X_i,Y_i$, $i\in {\mathbb {N}}$, and ${\widehat{\theta }}_n$, $n\ge n_{\mathrm{st}}$. By continuity of f and compactness of $\mathcal{X}\times \varTheta $, there is a compact subinterval $J\subseteq I$ such that $\{f^\mathsf{T}(x)\,\theta \,:\,(x,\theta )\in \mathcal{X}\times \varTheta \}\subseteq J$. Since G is differentiable on I with continuous and positive derivative $G'$, one gets from the mean value theorem that $|G(u)-G(v)|\ge b|u-v|$ for all $u,v\in J$ with $b:=\min _{w\in J} G'(w)\,>0$. So, for all $i\in {\mathbb {N}}$ and all $\theta \in C({\overline{\theta }},\varepsilon )$,

$$\begin{aligned} \big |\mu (x_i,\theta )-\mu (x_i,{\overline{\theta }})\big |\,=\,\big |G\bigl (f^\mathsf{T}(x_i)\,\theta \bigr )- G\bigl (f^\mathsf{T}(x_i)\,{\overline{\theta }}\bigr )\big |\,\ge \,b\,\big |f^\mathsf{T}(x_i)\,(\theta -{\overline{\theta }})\big |. \end{aligned}$$

From this we get for all $\theta \in C({\overline{\theta }},\varepsilon )$, denoting $a_\theta =(\theta -{\overline{\theta }})/\Vert \theta -{\overline{\theta }}\Vert $,

$$\begin{aligned}&D_n(\theta ,{\overline{\theta }})=\sum _{i=1}^n\bigl \{\mu (x_i,\theta )-\mu (x_i,{\overline{\theta }})\bigr \}^2 \nonumber \\&\quad \ge \, b^2\varepsilon ^2\sum _{i=1}^n\bigl \{f^\mathsf{T}(x_i)\,a_\theta \bigr \}^2\,=\,b^2\varepsilon ^2n\int _{\mathcal{X}}\bigl \{f^\mathsf{T}(x)\,a_\theta \bigr \}^2\mathrm{d}\xi _n(x). \end{aligned}$$

(A.5)

By Theorem 2.6 and Lemma 2.5 of Freise et al. (2019) there exist $n_0\ge n_{\mathrm{st}}$, $\rho >0$, and $\alpha \in (0\,,\,1\,)$ such that for all $n\ge n_0$ and all normalized coefficient vectors $a\in {\mathbb {R}}^p$, $\Vert a\Vert =1$, one has $\xi _n\bigl (\bigl \{x\in \mathcal{X}\,:\,|f_{\theta _n}^\mathsf{T}(x)\,a|\le \rho \bigr \}\bigr )\le \alpha $. From $f_{\theta _n}(x)= \psi (x,\theta _n)\,f(x)$ one gets

$$\begin{aligned} \bigl \{x\in \mathcal{X}\,:\,|f_{\theta _n}^\mathsf{T}(x)\,a|>\rho \bigr \}\,\subseteq \, \bigl \{x\in \mathcal{X}\,:\,|f^\mathsf{T}(x)\,a|>\rho /\psi _\mathrm{max}\bigr \}, \end{aligned}$$

where $\psi _\mathrm{max}:=\max _{(x,\theta )\in \mathcal{X}\times \varTheta }\psi (x,\theta )$ which is positive and finite. Note that

$$\begin{aligned} \xi _n\Bigl (\bigl \{x\in \mathcal{X}\,:\,|f^\mathsf{T}(x)\,a|>\rho /\psi _\mathrm{max}\bigr \}\Bigr )\,\ge \,1-\alpha \ \hbox { for all }n\ge n_0. \end{aligned}$$

From this and from (A.5) it follows that for all $n\ge n_0$ and all $\theta \in C({\overline{\theta }},\varepsilon )$,

$$\begin{aligned} D_n(\theta ,{\overline{\theta }})\,\ge \,b^2\varepsilon ^2n\,(\rho /\psi _\mathrm{max})^2(1-\alpha ), \end{aligned}$$

and hence

$$\begin{aligned} \frac{1}{n}\inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\,\ge \, b^2\varepsilon ^2(\rho /\psi _\mathrm{max})^2(1-\alpha ), \end{aligned}$$

and the result of Step 2 follows.

Ad Step 3. By the results of Step 1 and Step 2,

$$\begin{aligned} \liminf _{n\rightarrow \infty }\frac{1}{n}\left\{ \inf _{\theta \in C({\overline{\theta }},\varepsilon )} S_n(\theta )\,-S_n({\overline{\theta }})\right\} \ =\ \liminf _{n\rightarrow \infty }\frac{1}{n}\inf _{\theta \in C({\overline{\theta }},\varepsilon )} D_n(\theta ,{\overline{\theta }}) \ >0\ \hbox { a.s.} \end{aligned}$$

Hence $\inf _{\theta \in C({\overline{\theta }},\varepsilon )} S_n(\theta )\,-S_n({\overline{\theta }})\,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\infty $. $\square $

Proof of Theorem 2

Choose a compact ball ${\overline{B}}$ centered at ${\overline{\theta }}$ and such that ${\overline{B}}\subseteq \mathrm{int}(\varTheta )$. By the strong consistency of the sequence of adaptive LSEs there is a random variable N with values in ${\mathbb {N}}\cup \{\infty \}$ such that $N<\infty $ a.s. and ${\widehat{\theta }}_n^\mathrm{(LS)}\in {\overline{B}}$ on $\{N\le n\}$ for all integers $n\ge n_{\mathrm{st}}$. Note that, since N is almost surely finite, $ 1 1(N\le n)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,1$ as $n\rightarrow \infty $. Recall our notation introduced earlier: $S_n(\theta )=\sum _{i=1}^n\bigl \{Y_i-\mu (X_i,\theta )\bigr \}^2$, $n\ge n_{\mathrm{st}}$, $\theta \in \varTheta $. For the gradients of $S_n(\theta )$ w.r.t. $\theta $ one obtains, using (B5),

$$\begin{aligned} \nabla S_n(\theta )\,=\,-2\sum _{i=1}^n\bigl \{Y_i-\mu (X_i,\theta )\bigr \}\,\nabla \mu (X_i,\theta ),\ \ \theta \in \mathrm{int}(\varTheta ). \end{aligned}$$

(A.6)

On $\{N\le n\}$ the gradient at ${\widehat{\theta }}_n^\mathrm{(LS)}$ is equal to zero, and hence $\nabla S_n\bigl ( {\widehat{\theta }}_n^{\mathrm{(LS)}}\bigr )-\nabla S_n({\overline{\theta }})=-\nabla S_n({\overline{\theta }})$. That equation yields, inserting from (A.6) and $Y_i=\mu (X_i,{\overline{\theta }})+e_i$ from (A2), along with some

$$\begin{aligned}&\sum _{i=1}^ne_i\nabla \mu (X_i,{\overline{\theta }}) \,=\,\sum _{i=1}^n\bigl \{\mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\mu (X_i,{\overline{\theta }})\bigr \}\,\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\nonumber \\&\quad -\,\sum _{i=1}^ne_i\,\bigl \{\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\nabla \mu (X_i,{\overline{\theta }})\bigr \} \qquad \hbox { on } \{N\le n\}. \end{aligned}$$

(A.7)

We firstly show that

$$\begin{aligned} n^{-1/2}\sigma ^{-1}({\overline{\theta }})\,M_*^{-1/2}({\overline{\theta }})\sum _{i=1}^ne_i\nabla \mu (X_i,{\overline{\theta }})\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\mathrm{N}(0,I_p). \end{aligned}$$

(A.8)

To this end, according to the Cramér-Wold device, let $v\in {\mathbb {R}}^p$, $v^\mathsf{T}v=1$, be given. Denote $Z_i:=\sigma ^{-1}({\overline{\theta }})\,v^\mathsf{T}M_*^{-1/2}({\overline{\theta }})\nabla \mu (X_i,{\overline{\theta }})$ and ${\widetilde{e}}_i:=e_iZ_i$, $i\in {\mathbb {N}}$. Abbreviating the random variables on the l.h.s. of (A.8) by $W_n$, one has $v^\mathsf{T}W_n\,=\,n^{-1/2}\sum _{i=1}^n{\widetilde{e}}_i$. The random variable $Z_i$ is $\mathcal{F}_{i-1}$-measurable for all $i\in {\mathbb {N}}$, and the $Z_i$, $i\in {\mathbb {N}}$, are uniformly bounded: $|Z_i|\le c$ for all $i\in {\mathbb {N}}$ for some positive real constant c. Hence the sequence of partial sums $\sum _{i=1}^n{\widetilde{e}}_i$, is a martingale w.r.t. $\mathcal{F}_n$, $n\in {\mathbb {N}}$, and we can apply Corollary 3.1 of Hall and Heyde (1980) which states that the following two conditions (a) and (b) together imply the distributional convergence $n^{-1/2}\sum _{i=1}^n{\widetilde{e}}_i\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\mathrm{N}(0,1)$.

(a) $\displaystyle \frac{1}{n}\sum _{i=1}^n\mathrm{E}\bigl ({\widetilde{e}}_i^2\big |\,\mathcal{F}_{i-1}\bigr )\,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,1$, (b) $\displaystyle \frac{1}{n} \sum _{i=1}^n\mathrm{E}\Bigl ({\widetilde{e}}_i^2 1 1\bigl (|{\widetilde{e}}_i|>\varepsilon \sqrt{n}\,\bigr )\big |\,\mathcal{F}_{i-1}\Bigr ) \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$ for all $\varepsilon >0$.

Condition (b) follows from condition (L) since

$$\begin{aligned} \mathrm{E}\Bigl ({\widetilde{e}}_i^2 1 1\bigl (|{\widetilde{e}}_i|>\varepsilon \sqrt{n}\bigr )\,\big |\,\mathcal{F}_{i-1}\Bigr ) \le c^2\mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>(\varepsilon /c)\sqrt{n}\bigr )\,\big |\,\mathcal{F}_{i-1}\Bigr ). \end{aligned}$$

To verify (a) we write

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^n\mathrm{E}\bigl ({\widetilde{e}}_i^2\big |\mathcal{F}_{i-1}\bigr )\,=\, \frac{1}{n}\sum _{i=1}^n\mathrm{E}\bigl (e_i^2\big |\mathcal{F}_{i-1}\bigr )\,Z_i^2\\&\quad =\,\frac{1}{n}\sum _{i=1}^n\bigl \{\mathrm{E}\bigl (e_i^2\big |\mathcal{F}_{i-1}\bigr )-\sigma ^2({\overline{\theta }})\bigr \}\,Z_i^2 \,+\,\sigma ^2({\overline{\theta }})\frac{1}{n}\sum _{i=1}^nZ_i^2. \end{aligned}$$

By (AH) and $|Z_n|\le c$ for all $n\in {\mathbb {N}}$ one has $\bigl [\mathrm{E}\bigl (e_n^2\big |\mathcal{F}_{n-1}\bigr )-\sigma ^2({\overline{\theta }})\bigr ]\,Z_n^2\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$ and hence $\frac{1}{n}\sum _{i=1}^n\bigl \{\mathrm{E}\bigl (e_i^2\big |\mathcal{F}_{i-1}\bigr )-\sigma ^2({\overline{\theta }})\bigr \}\,Z_i^2\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$. By the definition of $Z_i$, $i\in {\mathbb {N}}$, and by (B5),

$$\begin{aligned}&\sigma ^2({\overline{\theta }})\frac{1}{n}\sum _{i=1}^nZ_i^2\,=\, v^\mathsf{T}M_*^{-1/2}({\overline{\theta }})\Bigl \{\frac{1}{n}\sum _{i=1}^n\nabla \mu (X_i,{\overline{\theta }})\,\nabla ^\mathsf{T}\mu (X_i,{\overline{\theta }})\Bigr \}M_*^{-1/2}({\overline{\theta }})\,v\\&\quad =\,v^\mathsf{T}M_*^{-1/2}({\overline{\theta }})\,M(\xi _n,{\overline{\theta }})\,M_*^{-1/2}({\overline{\theta }})\,v\,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,1, \end{aligned}$$

where the final convergence is implied by $M(\xi _n,{\overline{\theta }})\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,M_*({\overline{\theta }})$, see Remark 3 above. This proves (a) and hence (A.8). Next we show that

$$\begin{aligned}&n^{-1/2}\sum _{i=1}^n\bigl \{\mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\mu (X_i,{\overline{\theta }})\bigr \}\,\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\nonumber \\&\quad =\, \bigl \{M(\xi _n,{\widehat{\theta }}_n^{\mathrm{(LS)}})\,+\,A_n\bigr \}\,\bigl \{n^{1/2}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\bigr \},\nonumber \\&\qquad \hbox {with a sequence} A_n, n\ge n_{\mathrm{st}}, \hbox { of random } p\times p \hbox { matrices such that } A_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0.\nonumber \\ \end{aligned}$$

(A.9)

By the mean value theorem, for each n there are (random) points ${\widetilde{\theta }}_{i,n}$, $1\le i\le n$, on the line segment joining ${\widehat{\theta }}_n^{\mathrm{(LS)}}$ and ${\overline{\theta }}$ such that

$$\begin{aligned} \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\mu (X_i,{\overline{\theta }})\,=\,\nabla ^\mathsf{T}\mu (X_i,{\widetilde{\theta }}_{i,n})\,\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr ). \end{aligned}$$

So we can write, again using (B5),

$$\begin{aligned}&n^{-1/2}\sum _{i=1}^n\bigl \{\mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\mu (X_i,{\overline{\theta }})\bigr \}\,\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\\&\quad =\, n^{-1}\sum _{i=1}^n\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\,\nabla ^\mathsf{T}\mu (X_i,{\widetilde{\theta }}_{i,n})\, \bigl \{n^{1/2}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\bigr \}\\&\quad =\,\Bigl [M(\xi _n,{\widehat{\theta }}_n^{\mathrm{(LS)}})\,+\,\frac{1}{n}\sum _{i=1}^n\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\, \bigl \{\nabla \mu (X_i,{\widetilde{\theta }}_{i,n})-\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\bigr \}^\mathsf{T}\Bigr ]\\&\qquad \,\bigl \{n^{1/2}({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }})\bigr \}. \end{aligned}$$

For $A_n:=\frac{1}{n}\sum _{i=1}^n\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\, \bigl \{\nabla \mu (X_i,{\widetilde{\theta }}_{i,n})-\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\bigr \}^\mathsf{T}$ we get, using the Frobenius norm in the space of $p\times p$ matrices, i.e., $\Vert A\Vert _{\mathrm{F}} = \bigl [\mathrm{trace}(AA^\mathsf{T})\bigr ]^{1/2}$,

$$\begin{aligned}&\Vert A_n\Vert _\mathrm{F}\,\le \,\frac{1}{n}\sum _{i=1}^n\big \Vert \nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\, \bigl \{\nabla \mu (X_i,{\widetilde{\theta }}_{i,n})-\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\bigr \}^\mathsf{T}\big \Vert _{\mathrm{F}}\\&\quad =\frac{1}{n}\sum _{i=1}^n \big \Vert \nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\big \Vert \cdot \big \Vert \nabla \mu (X_i,{\widetilde{\theta }}_{i,n})-\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\big \Vert , \end{aligned}$$

where we have used that $\big \Vert vw^\mathsf{T}\big \Vert _{\mathrm{F}}=\Vert v\Vert \cdot \Vert w\Vert $ for $v,w\in {\mathbb {R}}^p$. By compactness of $\mathcal{X}\times {\overline{B}}$ and uniform continuity of $\nabla \mu (x,\theta )$ on $\mathcal{X}\times {\overline{B}}$, one has

$$\begin{aligned} {\overline{c}}\,:=\, \sup _{\theta \in {\overline{B}},\,x\in \mathcal{X}}\big \Vert \nabla \mu (x,\theta )\big \Vert \,<\infty . \end{aligned}$$

From $\max _{1\le i\le n}\Vert {\widehat{\theta }}_n^{\mathrm{(LS)}}-{\widetilde{\theta }}_{i,n}\Vert \le \Vert {\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\Vert \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$ (as $n\rightarrow 0$) and, again, by the uniform continuity of $\nabla \mu (x,\theta )$ on $\mathcal{X}\times {\overline{B}}$, one gets

$$\begin{aligned} \Vert A_n\Vert _{\mathrm{F}}\le {\overline{c}}\, \max _{1\le i\le n}\big \Vert \nabla \mu (X_i,{\widetilde{\theta }}_{i,n})-\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})\big \Vert \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0 \end{aligned}$$

which proves (A.9). Next we show that

$$\begin{aligned}&n^{-1/2}\sum _{i=1}^ne_i\bigl \{\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\nabla \mu (X_i,{\overline{\theta }})\bigr \}\,=\, B_n\,\bigl \{n^{-1/2}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\bigr \}, \qquad \\&\quad \hbox {with a sequence } B_n \hbox { of } p\times p \hbox { random matrices such that } B_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0.\nonumber \end{aligned}$$

(A.10)

Let $v\in {\mathbb {R}}^p$ be arbitrarily given. We can write, applying the mean value theorem,

$$\begin{aligned}&v^\mathsf{T}\Bigl [n^{-1/2}\sum _{i=1}^ne_i\bigl \{\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\nabla \mu (X_i,{\overline{\theta }})\bigr \}\Bigr ]\nonumber \\&\quad =\,n^{-1/2}\sum _{i=1}^ne_i\bigl \{v^\mathsf{T}\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-v^\mathsf{T}\nabla \mu (X_i,{\overline{\theta }})\bigr \}\nonumber \\&\quad =\,n^{-1/2}\sum _{i=1}^ne_iv^\mathsf{T}\nabla ^2\mu \bigl (X_i,{\widetilde{\theta }}_{i,n}(v)\bigr )\,\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\nonumber \\&\quad =\,\frac{1}{n}\sum _{i=1}^ne_iv^\mathsf{T}\nabla ^2\mu \bigl (X_i,{\widetilde{\theta }}_{i,n}(v)\bigr )\, \bigl \{n^{1/2}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\bigr \}, \end{aligned}$$

(A.11)

where $\nabla ^2\mu (x,\theta )$ denotes the Hessian matrix (matrix of second partial derivatives) of $\mu $ w.r.t. $\theta $ for fixed $x\in \mathcal{X}$, and ${\widetilde{\theta }}_{i,n}(v)$, $1\le i\le n$, are suitable (random) points on the line segment joining ${\widehat{\theta }}_n^{\mathrm{(LS)}}$ and ${\overline{\theta }}$. Let $b_n(v):=\frac{1}{n}\sum _{i=1}^ne_i\nabla ^2\mu \bigl (X_i,{\widetilde{\theta }}_{i,n}(v)\bigr )\,v$ and write $b_n(v)=b_n^{(1)}(v)+b_n^{(2)}(v)$, where

$$\begin{aligned} b_n^{(1)}(v):= & {} \frac{1}{n}\sum _{i=1}^ne_i\nabla ^2\mu (X_i,{\overline{\theta }})\,v\ \hbox { and }\ b_n^{(2)}(v)\\ b_n^{(2)}(v):= & {} \frac{1}{n}\sum _{i=1}^ne_i\bigl \{\nabla ^2\mu \bigl (X_i,{\widetilde{\theta }}_{i,n}(v)\bigr )\,v -\nabla ^2\mu (X_i,{\overline{\theta }})\,v\bigr \}. \end{aligned}$$

Applying Lemma 3.1 (b) in Freise et al. (2019) to each component of $b_n^{(1)}(v)$ one gets $b_n^{(1)}(v)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$. The uniform continuity of $(x,\theta )\mapsto \nabla ^2\mu (x,\theta )\,v$ on $\mathcal{X}\times {\overline{B}}$ and $\max _{1\le i\le n}\Vert {\widetilde{\theta }}_{i,n}(v)-{\overline{\theta }}\Vert \le \Vert {\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\Vert \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$ imply that $\max _{1\le i\le n}\big \Vert \nabla ^2\mu \bigl (X_i,{\widetilde{\theta }}_{i,n}(v)\bigr )\,v\,-\,\nabla ^2\mu (X_i,{\overline{\theta }})\,v\big \Vert \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$. By Lemma 3.1 (a) in Freise et al. (2019), $\limsup _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^n|e_i|\,<\infty $ a.s., and hence

$$\begin{aligned} \Vert b_n^{(2)}(v)\Vert \le \max _{1\le i\le n}\big \Vert \nabla ^2\mu \bigl (X_i,{\widetilde{\theta }}_{i,n}(v)\bigr )\,v\,-\,\nabla ^2\mu (X_i,{\overline{\theta }})\,v\big \Vert \, \frac{1}{n}\sum _{i=1}^n|e_i|\ \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\,0. \end{aligned}$$

Observing (A.11) we have thus obtained that for every $v\in {\mathbb {R}}^p$

$$\begin{aligned} v^\mathsf{T}\Bigl [n^{-1/2}\sum _{i=1}^ne_i\bigl \{\nabla \mu (X_i,{\widehat{\theta }}_n^{\mathrm{(LS)}})-\nabla \mu (X_i,{\overline{\theta }})\bigr \}\Bigr ] \,=\,b_n^\mathsf{T}(v)\,\bigl \{n^{1/2}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\bigr \}, \end{aligned}$$

where $b_n(v)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$. Specializing to the elementary unit vectors $v^{(\ell )}$, $1\le \ell \le p$, and taking the matrix $B_n$ with rows $b_n^\mathsf{T}(v^{(\ell )})$, $1\le \ell \le p$, one gets (A.10). So, by (A.7), (A.8), (A.9), and (A.10) one gets

$$\begin{aligned} \sigma ^{-1}({\overline{\theta }})\,M_*^{-1/2}({\overline{\theta }})\,\bigl \{M(\xi _n,{\widehat{\theta }}_n^{\mathrm{(LS)}})+A_n-B_n\bigr \}\,\bigl \{\sqrt{n}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\bigr \}\ \,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\,\mathrm{N}(0,I_p), \end{aligned}$$

where $A_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$ and $B_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0$. According to Remark 3 one has $M(\xi _n,{\widehat{\theta }}_n^{\mathrm{(LS)}})\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,M_*({\overline{\theta }})$, and using standard properties of convergence in distribution one gets

$$\begin{aligned} \sqrt{n}\bigl ({\widehat{\theta }}_n^{\mathrm{(LS)}}-{\overline{\theta }}\bigr )\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\mathrm{N}\bigl (0,\sigma ^2({\overline{\theta }})\,M_*^{-1}({\overline{\theta }})\bigr ). \end{aligned}$$

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Freise, F., Gaffke, N. & Schwabe, R. Convergence of least squares estimators in the adaptive Wynn algorithm for some classes of nonlinear regression models. Metrika 84, 851–874 (2021). https://doi.org/10.1007/s00184-020-00803-0

Download citation

Received: 09 March 2020
Accepted: 23 December 2020
Published: 08 February 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s00184-020-00803-0

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Convergence of least squares estimators in the adaptive Wynn algorithm for some classes of nonlinear regression models

Abstract

Similar content being viewed by others

Asymptotic Distribution of Least Squares Estimators for Linear Models with Dependent Errors: Regular Designs

Asymptotic and Finite-Sample Properties in Statistical Estimation

On Existence of Explicit Asymptotically Normal Estimators in Nonlinear Regression Problems

1 Introduction

Example 1

Example 2

2 Auxiliary results

Lemma 1

Remark 1

Lemma 2

Lemma 3

Remark 2

3 Convergence of least squares estimators

Remark 3

Theorem 1

Remark 4

Remark 5

Theorem 2

4 Discussion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

A Proofs

A Proofs

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Proof of Theorem 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation