1 Introduction

The classical algorithm of Wynn (1970) for D-optimal design in linear regression models has motivated a particular scheme for sequential adaptive design in nonlinear regression models, see Freise (2016), Pronzato (2010) and Freise et al. (2019). We refer to this scheme as ‘the adaptive Wynn algorithm’. In a previous paper (Freise et al. 2019) of the authors the asymptotics of the sequences of designs and maximum likelihood estimators under the adaptive Wynn algorithm was studied, for the important class of generalized linear models with univariate response. In the present paper the asymptotics of least squares estimators (LSEs) under the adaptive Wynn algorithm is studied, firstly, for the class of nonlinear models satisfying a condition of ‘saturated identifiability’ and, secondly, for a class of generalized linear models. As a main result, strong consistency of the adaptive LSEs is shown and, as a consequence, the asymptotic D-optimality of the generated design sequence is obtained (almost surely). Here ‘D-optimality’ means local D-optimality at the true parameter point. Moreover, asymptotic normality of the adaptive LSEs is obtained, where the asymptotic covariance matrix is given by the inverse of the locally D-optimal information matrix at the true parameter point. This shows in particular, that compared to the nonadaptive LSEs under any fixed design the adaptive LSEs from the adaptive Wynn algorithm are asymptotically more efficient in the sense of a smaller determinant of the asymptotic covariance matrix, unless of course, a fixed design is used which is locally D-optimal at the true parameter point. However, the true parameter point is unknown, thus preventing the use of that design. In contrast to the classical concept of a fixed design, the sequential adaptive method provided by the adaptive Wynn algorithm is not affected by the unknown parameter point: asymptotically the true parameter point can be identified and the adaptive designs become D-optimal. Note that other, more practically motivated adaptive procedures for design and estimation restrict to a finite number of adaptation stages, e.g. only two stages as in Dette et al. (2013) and Lane et al. (2014). Those papers addressed problems on asymptotic efficiency of adaptive maximum likelihood estimators for particular two-stage adaptive procedures, when the sample sizes of the stages go to infinity. However, the present paper is exclusively concerned with the adaptive Wynn algorithm which is an infinite-stage adaptive procedure.

Next we give an outline of our framework and the adaptive Wynn algorithm. Suppose a nonlinear regression model with a real valued mean response \(\mu (x,\theta )\), \(x\in \mathcal{X}\), \(\theta \in \varTheta \), where \(\mathcal{X}\) and \(\varTheta \) are the experimental region and the parameter space, respectively. Suppose that a family of \({\mathbb {R}}^p\)-valued functions \(f_\theta \), \(\theta \in \varTheta \), defined on \(\mathcal{X}\) has been identified such that the \(p\times p\) matrix \(f_\theta (x)\,f_\theta ^\mathsf{T}(x)\) is the elementary information matrix of \(x\in \mathcal{X}\) at \(\theta \in \varTheta \). Note that a vector \(a\in {\mathbb {R}}^p\) is written as a column vector and \(a^\mathsf{T}\) denotes its transposed which is thus a p-dimensional row vector. A design \(\xi \) is a probability measure on \(\mathcal{X}\) with finite support. That is, \(\xi \) is described by its support, denoted by \(\mathrm{supp}(\xi )\), which is a nonempty finite subset of \(\mathcal{X}\), and by its weights \(\xi (x)\) for \(x\in \mathrm{supp}(\xi )\) which are positive real numbers with \(\sum _{x\in \mathrm{supp}(\xi )}\xi (x)\,=1\). The information matrix of a design \(\xi \) at \(\theta \in \varTheta \) is defined by

(1.1)

which is a nonnegative definite \(p\times p\) matrix.

In applications the family \(f_\theta \), \(\theta \in \varTheta \), will be related to the mean response \(\mu (x,\theta )\), \(x\in \mathcal{X}\), \(\theta \in \varTheta \). Usually, \(\varTheta \subseteq {\mathbb {R}}^p\) and \(f_\theta (x)\) is given by the gradient of \(\mu (x,\theta )\) w.r.t. \(\theta \) for each fixed x, or by a scalar multiple of that gradient where the scalar factor is a function of \(\theta \) and x, cf. Atkinson et al. (2014), Lemma 1. For particular classes of models described below, consistency of the adaptive LSEs will be achieved without any relation between the family \(f_\theta \), \(\theta \in \varTheta \), and the mean response \(\mu (x,\theta )\), \(x\in \mathcal{X}\), \(\theta \in \varTheta \), whereas asymptotic normality of the adaptive LSEs will require the gradient relation with scalar factor equal to 1. Throughout we assume the following basic conditions (B1)–(B4).

(B1) The experimental region \(\mathcal{X}\) is a compact metric space.

(B2) The parameter space \(\varTheta \) is a compact metric space.

(B3) The real-valued mean response function \((x,\theta )\mapsto \mu (x,\theta )\), defined on the Cartesian product space \(\mathcal{X}\times \varTheta \), is continuous.

(B4) The family \(f_\theta \), \(\theta \in \varTheta \), of \({\mathbb {R}}^p\)-valued functions on \(\mathcal{X}\) satisfies:

(i) for each \(\theta \in \varTheta \) the image \(f_\theta (\mathcal{X})\) spans \({\mathbb {R}}^p\);

(ii) the function \((x,\theta )\mapsto f_\theta (x)\) is continuous on \(\mathcal{X}\times \varTheta \).

By \({\mathbb {N}}\) and \({\mathbb {N}}_0\) we denote the set of all positive integers and all nonnegative integers, respectively. By \(\delta _x\) for any \(x\in \mathcal{X}\) we denote the one-point probability distribution on \(\mathcal{X}\) concentrated at the point x. The adaptive Wynn algorithm collects iteratively design points \(x_i\in \mathcal{X}\), \(i\in {\mathbb {N}}\), while adaptively estimating \(\theta \) on the basis of the current design points and observed (real valued) responses at those points. In greater detail the algorithm reads as follows.

Adaptive Wynn algorithm

(o) Initialization

A positive integer \(n_{\mathrm{st}}\in {\mathbb {N}}\) and design points \(x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}\in \mathcal{X}\) are chosen such that the starting design \(\xi _{n_{\tiny \mathrm{st}}}=\frac{1}{n_{\mathrm{st}}}\sum _{i=1}^{n_{\tiny \mathrm{st}}} \delta _{x_i}\) has positive definite information matrices, i.e., for all \(\theta \in \varTheta \) the information matrix \(M(\xi _{n_{\tiny \mathrm{st}}},\theta )\) is positive definite. Observed responses \(y_1,\ldots ,y_{n_{\tiny \mathrm{st}}}\in {\mathbb {R}}\) at the design points \(x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}\) are taken, and an initial parameter estimate \(\theta _{n_{\tiny \mathrm{st}}}\in \varTheta \) is calculated,

$$\begin{aligned} \theta _{n_{\tiny \mathrm{st}}}={\widehat{\theta }}_{n_{\tiny \mathrm{st}}}(x_1,y_1,\ldots ,x_{n_{\tiny \mathrm{st}}},y_{n_{\tiny \mathrm{st}}})\,\in \varTheta . \end{aligned}$$

(i) Iteration

At stage \(n\ge n_{\mathrm{st}}\) the current data is given by the points \(x_1,\ldots ,x_n\in \mathcal{X}\) which form the design \(\xi _n=\frac{1}{n}\sum _{i=1}^n\delta _{x_i}\), and by the observed responses \(y_1,\ldots ,y_n\in {\mathbb {R}}\) at \(x_1,\ldots ,x_n\), respectively, along with a parameter estimate \(\theta _n\in \varTheta \),

$$\begin{aligned} \theta _n={\widehat{\theta }}_n(x_1,y_1,\ldots ,x_n,y_n)\,\in \varTheta . \end{aligned}$$
(1.2)

The iteration rule is given by

$$\begin{aligned} x_{n+1}\,=\,\arg \ \max _{x\in \mathcal{X}}f_{\theta _n}^\mathsf{T}(x)\,M^{-1}(\xi _n,\theta _n)\, f_{\theta _n}(x). \end{aligned}$$
(1.3)

An observation \(y_{n+1}\in {\mathbb {R}}\) of the response at \(x_{n+1}\) is taken and a new parameter estimate \(\theta _{n+1}\) based on the augmented data is computed,

$$\begin{aligned} \theta _{n+1}={\widehat{\theta }}_{n+1}(x_1,y_1,\ldots ,x_n,y_n,x_{n+1},y_{n+1})\,\in \varTheta . \end{aligned}$$

Replace n by \(n+1\) and repeat the iteration step (i). \(\square \)

Of course, Eq. (1.3) requires the information matrix \(M(\xi _n,\theta _n)\) to be positive definite at each stage \(n\ge n_{\mathrm{st}}\). In fact, this is ensured by the choice of the initial design \(\xi _{n_{\tiny \mathrm{st}}}\) since, obviously, the sequence of designs \(\xi _n\), \(n\ge n_{\mathrm{st}}\) satisfies

$$\begin{aligned}&\xi _{n+1}\,=\,\frac{n}{n+1}\,\xi _n\,+\,\frac{1}{n+1}\,\delta _{x_{n+1}}, \end{aligned}$$
(1.4)
$$\begin{aligned}&M(\xi _{n+1},\theta )\,=\,\frac{n}{n+1}\,M(\xi _n,\theta )\,+\,\frac{1}{n+1}\, f_\theta (x_{n+1})\,f_\theta ^\mathsf{T}(x_{n+1}), \quad \theta \in \varTheta , \end{aligned}$$
(1.5)

from which one concludes by induction that \(M(\xi _n,\theta )\) is positive definite for all \(n\ge n_{\mathrm{st}}\) and all \(\theta \in \varTheta \). The existence of an initial design \(\xi _{n_{\tiny \mathrm{st}}}\) as required will be shown in Section 2, Lemma 1. However, we have no general method or algorithm for constructing an initial design according to ‘step (o)’ of the algorithm. For some important classes of models there exists a saturated initial design (i.e., \(n_{\mathrm{st}}=p\)) which can easily be constructed, see Remark 1 in Sect. 2.

The algorithm uses, in particular, an observed response \(y_i\) at each current design point \(x_i\). So the generated sequence of design points, \(x_i\), \(i\in {\mathbb {N}}\), and the corresponding sequence of designs \(\xi _n\), \(n\ge n_{\mathrm{st}}\), are random sequences with a particular dependence structure caused by Eqs. (1.2) and (1.3). An appropriate stochastic model will be stated in Sect. 3 which was used in Freise et al. (2019) and goes back to Lai and Wei (1982), Lai (1994), and Chen et al. (1999). In particular, the generated sequence \(x_i\), \(i\in {\mathbb {N}}\), and the observed responses \(y_i\), are viewed as values of random variables \(X_i\), \(i\in {\mathbb {N}}\), and \(Y_i\), \(i\in {\mathbb {N}}\), respectively, following a stochastic model which we call an ‘adaptive regression model’. Our formulation of the adaptive Wynn algorithm is a description of the paths of the stochastic process \((X_i,Y_i)\), \(i\in {\mathbb {N}}\).

The estimators \({\widehat{\theta }}_n\), \(n\ge n_{\mathrm{st}}\), employed by the algorithm to produce the estimates \(\theta _n\), \(n\ge n_{\mathrm{st}}\), in (1.2), may be any estimators of \(\theta \) such that their values are in \(\varTheta \) and \({\widehat{\theta }}_n\) is a function of the data \(x_1,y_1,\ldots ,x_n,y_n\) available at stage n. Such estimators will be called adaptive estimators. Later, strong consistency of \({\widehat{\theta }}_n\), \(n\ge n_{\mathrm{st}}\), will be required. In Sect. 3 we focus on adaptive LSEs \({\widehat{\theta }}_n^{(\mathrm{LS})}\), i.e.,

$$\begin{aligned} {\widehat{\theta }}_n^{(\mathrm{LS})}(x_1,y_1,\ldots ,x_n,y_n)\,=\,\arg \min _{\theta \in \varTheta } \sum _{i=1}^n \bigl \{ y_i-\mu (x_i,\theta ) \bigr \}^2. \end{aligned}$$

Note that when dealing with the adaptive LSEs we will not necessarily assume that the estimators \({\widehat{\theta }}_n\), \(n\ge n_{\mathrm{st}}\), employed by the algorithm are given by the LSEs. Below two alternative conditions on the nonlinear model will be introduced. Either condition will ensure strong consistency of the LSEs, irrespective which adaptive estimators \({\widehat{\theta }}_n\) are used in the algorithm. One condition is that of ‘saturated identifiability’ (SI), which was introduced and employed by Pronzato (2010) in the case of a finite experimental region \(\mathcal{X}\).

(SI) If \(z_1,\ldots ,z_p\in \mathcal{X}\) are pairwise distinct design points then the \({\mathbb {R}}^p\)-valued function on \(\varTheta \) given by \(\theta \mapsto \bigl (\mu (z_1,\theta ),\ldots ,\mu (z_p,\theta )\bigr )^\mathsf{T}\) is an injection, i.e., if \(\theta ,\theta '\in \varTheta \) and \(\mu (z_j,\theta )=\mu (z_j,\theta ')\), \(1\le j\le p\), then \(\theta =\theta '\).

Recall that p is the dimension of the functions \(f_\theta \). As mentioned above, in many applications one has \(\varTheta \subseteq {\mathbb {R}}^p\) and \(f_\theta (x)\) is given by the gradient of \(\mu (x,\theta )\) w.r.t. \(\theta \) for all \(x\in \mathcal{X}\). So, in these cases, p also coincides with the dimension of the parameter vector \(\theta \).

The other (alternative) condition states that the model is essentially a generalized linear model. We call this condition (\(\mathrm{GLM}^*\)) where the ‘star’ is to distinguish it from a similar ‘condition (GLM)’ employed in Freise et al. (2019) which addressed only to the family \(f_\theta \) ignoring the mean response function \(\mu \).

(\(\mathrm{GLM}^*\)) \(\varTheta \subseteq {\mathbb {R}}^p\), \(\mu (x,\theta )=G\bigl (f^\mathsf{T}(x)\,\theta \bigr )\), and \(f_\theta (x)=\psi (x,\theta )\,f(x)\) for all \((x,\theta )\in \mathcal{X}\times \varTheta \), where \(f\,:\,\mathcal{X}\longrightarrow {\mathbb {R}}^p\) and \(\psi \,:\,\mathcal{X}\times \varTheta \longrightarrow (\,0\,,\,\infty )\) are continuous functions, \(G\,:\,I\longrightarrow {\mathbb {R}}\) is a differentiable function on an open interval \(I\subseteq {\mathbb {R}}\) with continuous and positive derivative, \(G'(u)>0\) for all \(u\in I\), and I covers the values \(f^\mathsf{T}(x)\,\theta \) for all \((x,\theta )\in \mathcal{X}\times \varTheta \).

Note that G is called the inverse link function. If the functions \(f_\theta \) are assumed to be the gradients (w. r. t. \(\theta \)) of \(\mu \) then the form of \(\mu \) assumed in (\(\mathrm{GLM}^*\)) already implies \(f_\theta (x)=\psi (x,\theta )\,f(x)\) with \(\psi (x,\theta )=G'\bigl (f^\mathsf{T}(x)\,\theta \bigr )\) for all \((x,\theta )\in \mathcal{X}\times \varTheta \).

Example 1

(cf. Pronzato 2010, Examples 2 and 3; Hu (1998), Examples 2 and 3) Let \(p=2\), \(\varTheta \subseteq (\,0\,,\,\infty )^2\), \(\mathcal{X}\subseteq (\,0\,,\,\infty )\), and consider two regression models,

$$\begin{aligned} \mathrm{(a)}\ \ \mu (x,\theta )=\frac{\theta _1 x}{\theta _2+x}\,,\quad \mathrm{(b)}\ \ \mu (x,\theta )= \theta _1\exp \bigl (-\theta _2 x\bigr ), \end{aligned}$$

where \(\theta =(\theta _1,\theta _2)^\mathsf{T}\). Models (a) and (b) are called the ‘Michaelis-Menten model’ and the ‘exponential decay model’, respectively. Both models (a) and (b) satisfy condition (SI) which can be seen as follows. Consider model (a). Let \(z_1,z_2\in \mathcal{X}\) with \(z_1<z_2\) and \(\theta =(\theta _1,\theta _2)^\mathsf{T}, \theta '=(\theta _1',\theta _2')^\mathsf{T}\in \varTheta \) such that \(\theta _1z_j/(\theta _2 +z_j)=\theta _1' z_j/(\theta _2'+z_j)\) for \(j=1,2\). Then

$$\begin{aligned} \frac{\theta _1}{\theta _1'}=\frac{\theta _2+z_1}{\theta _2'+z_1}=\frac{\theta _2+z_2}{\theta _2'+z_2}. \end{aligned}$$
(1.6)

Since \(\partial /\partial t\,\bigl \{(\theta _2+t)/(\theta _2'+t)\bigr \}=(\theta _2'-\theta _2)/(\theta _2'+t)^2\) for \(t\in (\,0\,,\,\infty )\), the function \(t\mapsto (\theta _2+t)/(\theta _2'+t)\) on \((\,0\,,\,\infty )\) is either increasing (if \(\theta _2<\theta _2'\)) or decreasing (if \(\theta _2>\theta _2'\)) or constant (if \(\theta _2=\theta _2'\)). So, by \(z_1<z_2\) and (1.6), one gets \(\theta _2=\theta _2'\) and, again by (1.6), \(\theta _1=\theta _1'\), hence \(\theta =\theta '\). So (SI) holds for model (a). Consider model (b). Let \(z_1,z_2\in \mathcal{X}\) with \(z_1<z_2\) and \(\theta =(\theta _1,\theta _2)^\mathsf{T}, \theta '=(\theta _1',\theta _2')^\mathsf{T}\in \varTheta \) such that \(\theta _1\exp \bigl (-\theta _2 z_j\bigr )=\theta _1'\exp \bigl (-\theta _2' z_j\bigr )\) for \(j=1,2\). Then

$$\begin{aligned} \frac{\theta _1}{\theta _1'}=\exp \bigl \{(\theta _2-\theta _2')z_1\bigr \}=\exp \bigl \{(\theta _2-\theta _2')z_2\bigr \}, \end{aligned}$$

and by \(z_1<z_2\) this yields \(\theta _2'=\theta _2\), and hence \(\theta _1'=\theta _1\). So \(\theta '=\theta \) and (SI) has been verified for model (b). \(\square \)

Example 2

generalized linear regression.

Let \(\varTheta \subseteq {\mathbb {R}}^p\) and \(\mu (x,\theta )=G\bigl (f^\mathsf{T}(x)\,\theta \bigr )\) for all \((x,\theta )\in \mathcal{X}\times \varTheta \), where \(f\,:\,\mathcal{X}\longrightarrow {\mathbb {R}}^p\) is a continuous function, and \(G\,:\,I\longrightarrow {\mathbb {R}}\) is an increasing continuous function on an interval \(I\subseteq {\mathbb {R}}\) with \(\bigl \{f^\mathsf{T}(x)\,\theta \,:\,(x,\theta )\in \mathcal{X}\times \varTheta \bigr \}\subseteq I\). Let \(z_1,\ldots ,z_p\in \mathcal{X}\) be pairwise distinct and \(\theta ,\theta '\in \varTheta \). Clearly, \(\mu (z_j,\theta )=\mu (z_j,\theta ')\) for all \(j=1,\ldots ,p\) is equivalent to \(f^\mathsf{T}(z_j)\,(\theta -\theta ')=0\) for all \(j=1,\ldots ,p\). So, for the present model, condition (SI) is equivalent to the following.

If \(\theta ,\theta '\in \varTheta \) and \(z_1,\ldots ,z_p\in \mathcal{X}\) pairwise distinct such that \(\theta -\theta '\) is orthogonal to the vectors \(f(z_1),\ldots ,f(z_p)\), then \(\theta =\theta '\).

Assume that the parameter space \(\varTheta \) is nondegenerate in the sense that there is no hyperplane of \({\mathbb {R}}^p\) covering \(\varTheta \). Then the set of differences \(\{\theta -\theta '\,:\,\theta ,\theta '\in \varTheta \}\) spans \({\mathbb {R}}^p\), and hence condition (SI) is equivalent to the following condition.

(Ch) If \(z_1,\ldots ,z_p\in \mathcal{X}\) are pairwise distinct then the vectors \(f(z_1),\ldots ,f(z_p)\) are linearly independent.

The notation ‘Ch’ stands for Chebyshev since, as it can easily be seen, condition (Ch) holds if and only if the (real-valued) component functions \(f_1,\ldots ,f_p\), say, of f constitute a Chebyshev system, i.e., every linear combination \(\sum _{j=1}^pc_jf_j(x)\), with coefficients \(c_j\in {\mathbb {R}}\) (\(1\le j\le p\)) not all equal to zero, has at most \(p-1\) distinct zeros on \(\mathcal{X}\), cf. Karlin (1968), p. 24. Usually, Chebyshev systems are considered to be functions of one real variable defined on an interval of the real line. The most prominent example is given by the monomials \(1,x,\ldots ,x^{p-1}\) on some interval. For example, condition (Ch) or (SI), respectively, does not hold for generalized first order regression in two variables on a rectangle \(\mathcal{X}=[a_1,b_1]\times [a_2,b_2]\), where \(f(x)=(1,x_1,x_2)^\mathsf{T}\) for all \(x=(x_1,x_2)\in \mathcal{X}\). In this example we have \(p=3\). Let \(z^{(1)},z^{(2)},z^{(3)}\in \mathcal{X}\) be pairwise distinct. One easily verifies that the vectors \(f(z^{(1)}),f(z^{(2)}),f(z^{(3)})\) are linearly independent if and only if the three points \(z^{(1)},z^{(2)},z^{(3)}\) do not lie on a common line. So (Ch) and hence (SI) do not hold. As a consequence from the present Example 2 one may guess that the class of models satisfying condition (SI) is not very large. In particular, generalized linear regression models with more than one regressors will usually not be members of that class. However, such models will be included in our derivations by the class given by condition (\(\mathrm{GLM}^*\)). \(\square \)

The employed stochastic model for the adaptive Wynn algorithm includes a martingale difference scheme for the error variables, see Sect. 3. Limit theorems for martingales can be applied: a Strong Law of Large Numbers and a Central Limit Theorem to prove strong consistency and asymptotic normality, respectively, of the adaptive LSEs, see Theorems 1 and 2. Some auxiliary results are the content of Sect. 2. The proofs of the results of Sects. 2 and 3 are presented in the Appendix.

2 Auxiliary results

Throughout we assume (B1)–(B4) as introduced in the previous section. Note, however, that (B3) will not play a role in this section. Firstly, we give a proof of the existence of an initial design as required in the algorithm.

Lemma 1

There exist an \(n_{\mathrm{st}}\in {\mathbb {N}}\) and design points \(x_1,\ldots ,x_{n_{\tiny \mathrm{st}}}\in \mathcal{X}\) such that for every \(\theta \in \varTheta \) the vectors \(f_\theta (x_1),\ldots ,f_\theta (x_{n_{\tiny \mathrm{st}}})\) span \({\mathbb {R}}^p\). Hence, for such \(x_i\), \(1\le i\le n_{\mathrm{st}}\), the design \(\xi _{n_{\tiny \mathrm{st}}}=\frac{1}{n_{\mathrm{st}}}\sum _{i=1}^{n_{\tiny \mathrm{st}}}\delta _{x_i}\) has the property that its information matrix \(M(\xi _{n_{\tiny \mathrm{st}}},\theta )\) is positive definite for all \(\theta \in \varTheta \).

Remark 1

Some popular nonlinear regression models, preferably those with a scalar regressor variable x, i.e., \(\mathcal{X}\subseteq {\mathbb {R}}\), enjoy a further ‘Chebyshev property’ (\(\mathrm{Ch}^*\)), which was essentially assumed by Pronzato (2010) as condition \(\mathbf{H}_{\mathcal{X}^-}(\mathrm{iv})\). It states that condition (Ch) from Example 2 holds for each function \(f_\theta \):

(\(\mathrm{Ch}^*\)) If \(z_1,\ldots ,z_p\in \mathcal{X}\) are pairwise distinct and \(\theta \in \varTheta \) then the vectors \(f_\theta (z_1),\ldots ,f_\theta (z_p)\) are linearly independent.

If (\(\mathrm{Ch}^*\)) holds then a suitable initial design for the algorithm is provided by any saturated design, i.e., choose \(n_{\mathrm{st}}=p\) and pairwise distinct design points \(x_1,\ldots ,x_p\in \mathcal{X}\). Note also that in the proof of the lemma, assuming (\(\mathrm{Ch}^*\)), one has \(U(\theta )=\varTheta \) for all \(\theta \in \varTheta \) and hence \(r=1\) and \(n_{\mathrm{st}}=p\). As an example consider the regression models from Example 1, (a) and (b), together with the assumption that \(f_\theta \) is given by the gradient of \(\mu (x,\theta )\) w. r. t. \(\theta \), for all \((x,\theta )\in \mathcal{X}\times \varTheta \), which yields:

$$\begin{aligned} (\mathrm{a})\ \ f_\theta (x)=\Bigl (\frac{x}{\theta _2+x}\,,\,-\frac{\theta _1x}{(\theta _2+x)^2}\Bigr )^\mathsf{T},\quad (\mathrm{b})\ \ f_\theta (x)=\exp (-\theta _2x)\,\Bigl (\,1\,,\,-\theta _1x\Bigr )^\mathsf{T}\end{aligned}$$

for all \(x\in \mathcal{X}\subseteq (0\,,\,\infty )\) and \(\theta =(\theta _1,\theta _2)^\mathsf{T}\in \varTheta \subseteq (0\,,\,\infty )^2\). In either case (a) or (b), it is straightforward to show that for any given \(\theta \in \varTheta \) and \(x,x'\in \mathcal{X}\), \(x<x'\), the two vectors \(f_\theta (x),f_\theta (x')\) are linearly independent and hence both models (a) and (b) satisfy (\(\mathrm{Ch}^*\)).

In general, however, Lemma 1 and its proof does not give a practical way of finding a value of r and thus of \(n_{\mathrm{st}}=rp\), or even an upper bound on them. Another condition ensuring the existence of an initial design with \(n_{\mathrm{st}}=p\) is condition (GLM) considered in Freise et al. (2019) which is the second half of condition (\(\mathrm{GLM}^*\)) from Sect. 1.

(GLM) \(f_\theta (x)\,=\,\psi (x,\theta )\,f(x)\) for all \((x,\theta )\in \mathcal{X}\times \varTheta \), where \(\psi \,:\,\mathcal{X}\times \varTheta \longrightarrow (\,0\,,\,\infty )\) and \(f\,:\mathcal{X}\longrightarrow {\mathbb {R}}^p\) are continuous functions.

In fact, (GLM) together with our basic assumption (B4) (i) implies that the image \(f(\mathcal{X})\) spans \({\mathbb {R}}^p\). So one can find p points \(x_1,\ldots ,x_p\in \mathcal{X}\) such that the vectors \(f(x_1),\ldots ,f(x_p)\) are linearly independent, and by (GLM) for every \(\theta \in \varTheta \) the vectors \(f_\theta (x_1),\ldots ,f_\theta (x_p)\) are linearly independent. So the saturated design \(\xi _p=\frac{1}{p}\sum _{i=1}^p\delta _{x_i}\) is an appropriate initial design for the algorithm. \(\square \)

Let any path of the adaptive Wynn algorithm be given as described in the previous section. In particular, \(x_i\), \(i\in {\mathbb {N}}\), is the sequence of design points and \(\xi _n=\frac{1}{n}\sum _{i=1}^n\delta _{x_i}\), \(n\ge n_{\mathrm{st}}\), is the corresponding sequence of designs. For the following two lemmas no assumption on the employed adaptive estimators \({\widehat{\theta }}_n\), \(n\ge n_{\mathrm{st}}\), is needed. In other words, the sequence \(\theta _n\), \(n\ge n_{\mathrm{st}}\), of parameter estimates appearing in the path may be arbitrary.

We denote the distance function in the compact metric space \(\mathcal{X}\) by \(\mathrm{d}_\mathcal{X}(x,z)\), \(x,z\in \mathcal{X}\). If \(S_1\) and \(S_2\) are nonempty subsets of \(\mathcal{X}\) then the distance \(\mathrm{d}_\mathcal{X}(S_1,S_2)\) of \(S_1\) and \(S_2\) is defined by \(\mathrm{d}_\mathcal{X}(S_1,S_2)\,=\,\inf \{\mathrm{d}_\mathcal{X}(x,z)\,:\,x\in S_1,\ z\in S_2\}\). In case that \(S_1=\{x\}\) is a singleton we write \(\mathrm{d}_\mathcal{X}(x,S_2)\) instead of \(\mathrm{d}_\mathcal{X}(\{x\},S_2)\). If S is a nonempty subset of \(\mathcal{X}\) then the diameter of S is defined by \(\mathrm{diam}(S)\,=\,\sup \{\mathrm{d}_\mathcal{X}(x,z)\,:\,x,z\in S\}\).

Lemma 2

Suppose \(p\ge 2\). Let \(\varepsilon >0\) be given. Then there exist \(d>0\) and \(n_0\ge n_{\mathrm{st}}\) such that

$$\begin{aligned} \xi _n(S)\le \frac{1}{p}+\varepsilon \ \hbox { for all } \emptyset \not =S\subseteq \mathcal{X}\hbox { with }\mathrm{diam}(S)\le d\hbox { and all } n\ge n_0. \end{aligned}$$

Lemma 3

Suppose \(p\ge 2\). There exist \(n_0\ge n_{\mathrm{st}}\), \(\pi _0>0\), and \(d_0>0\) such that the following holds.

For each \(n\ge n_0\) there are p subsets \(S_{1,n},S_{2,n},\ldots ,S_{p,n}\) of \(\mathcal{X}\) such that

\(\xi _n(S_{j,n})\ge \pi _0\), \(1\le j\le p\), \(\mathrm{diam}(S_{j,n})\le d_0\), \(1\le j\le p\), and

\(\mathrm{d}_\mathcal{X}(S_{j,n},S_{k,n})\ge d_0\), \(1\le j< k\le p\).

Remark 2

In the case that \(\mathcal{X}\) is finite it is easily seen that in Lemma 3 the subsets \(S_{1,n},\ldots ,S_{p,n}\) can be chosen to be singletons for all \(n\ge n_0\). So, in this case, the lemma yields the result of Lemma 2 of Pronzato (2010). \(\square \)

3 Convergence of least squares estimators

For an analysis of the adaptive Wynn algorithm, the generated sequence \(x_i\), \(i\in {\mathbb {N}}\), of design points and the observed (real valued) responses \(y_i\), are viewed as values of random variables \(X_i\), \(i\in {\mathbb {N}}\), and \(Y_i\), \(i\in {\mathbb {N}}\), respectively, whose dependence structure is described by the following two assumptions (A1) and (A2), see Lai and Wei (1982), Lai (1994) and Chen et al. (1999), see also (Freise et al. 2019), The model thereby stated might be called an ‘adaptive regression model’. By \({\overline{\theta }}\) we denote the true point of the parameter space \(\varTheta \) governing the data. All the random variables appearing in this section are thought to be defined on a common probability space \((\varOmega ,\mathcal{F},{\mathbb {P}}_{{\overline{\theta }}})\), where \(\varOmega \) is a nonempty set, \(\mathcal{F}\) is a sigma-field of subsets of \(\varOmega \), and \({\mathbb {P}}_{{\overline{\theta }}}\) is a probability measure on \(\mathcal{F}\) corresponding to the true parameter point \({\overline{\theta }}\). We assume, as before, the basic conditions (B1)–(B4), and now additionally the following conditions (A1) and (A2) constituting the adaptive regression model.

(A1) There is a given nondecreasing sequence of sub-sigma-fields of \(\mathcal{F}\), \(\mathcal{F}_0\subseteq \mathcal{F}_1\subseteq \,\ldots \,\subseteq \mathcal{F}_n\subseteq \,\ldots \) such that for each \(i\in {\mathbb {N}}\) the random variable \(X_i\) is \(\mathcal{F}_{i-1}\)-measurable and the random variable \(Y_i\) is \(\mathcal{F}_i\)-measurable.

(A2) \(Y_i\,=\,\mu (X_i,{\overline{\theta }})\,+\,e_i\) with real-valued square integrable random errors \(e_i\) such that \(\mathrm{E}\bigl (e_i\,\big |\,\mathcal{F}_{i-1}\bigr )\,=0 \ \hbox {\,a.s.}\) for all \(i\in {\mathbb {N}}\), and \(\sup _{i\in {\mathbb {N}}}\mathrm{E}\bigl (e_i^2\,\big |\,\mathcal{F}_{i-1}\bigr )\,<\infty \ \ \hbox {a.s.}\)

As before, \({\widehat{\theta }}_n\), \(n\ge n_{\mathrm{st}}\), are the adaptive estimators employed by the algorithm, now viewed as random variables, \({\widehat{\theta }}_n={\widehat{\theta }}_n(X_1,Y_1,\ldots ,X_n,Y_n)\). Of course, a desirable property of these estimators would be strong consistency, i.e., almost sure convergence to \({\overline{\theta }}\) (as \(n\rightarrow \infty \)), for short \({\widehat{\theta }}_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}\).

Remark 3

As shown in our previous paper (Freise et al. 2019), Corollary 3.2, if the estimators \({\widehat{\theta }}_n\) are strongly consistent, then the sequence \(\xi _n\), \(n\ge n_{\mathrm{st}}\), of (random) designs generated by the algorithm is almost surely asymptotically D-optimal, in the sense that \(M(\xi _n,{\widehat{\theta }}_n)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,M(\xi ^*_{{\overline{\theta }}},{\overline{\theta }})\), where \(\xi ^*_{{\overline{\theta }}}\) is a locally D-optimal design at \({\overline{\theta }}\). In fact, the conclusion of that corollary is stronger: if the estimators \({\widehat{\theta }}_n\) are strongly consistent then \(M(\xi _n,{\widetilde{\theta }}_n)\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,M(\xi ^*_{{\overline{\theta }}},{\overline{\theta }})\) holds for every strongly consistent sequence of \(\varTheta \)-valued estimators \({\widetilde{\theta }}_n\). \(\square \)

The next result yields strong consistency of the adaptive LSEs \({\widehat{\theta }}_n^{(\mathrm LS)}\) for any adaptive estimators \({\widehat{\theta }}_n\) employed by the algorithm, provided that condition (SI) or condition (\(\mathrm{GLM}^*\)) holds.

Theorem 1

Assume that condition (SI) or condition (\(\mathrm{GLM}^*\)) is satisfied. Then, irrespective of the employed sequence of adaptive estimators \({\widehat{\theta }}_n\) in the algorithm, the sequence of adaptive LSEs \({\widehat{\theta }}_n^{(\mathrm LS)}\) is strongly consistent: \({\widehat{\theta }}_n^{(\mathrm LS)}\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}\).

Remark 4

A crucial point in the proof of Theorem 1 is the result of ‘Step 2’ stating that \(\inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\) goes to infinity at least as fast as n a. s. when \(n\rightarrow \infty \). This is due to the adaptive Wynn algorithm, while the results of ‘Step 1’ and ‘Step 3’ only use the model assumptions (A1) and (A2). More general adaptive sampling schemes modeled by (A1) and (A2) were addressed to in Pronzato (2009). By Theorem 1 of that paper, in case of a finite design space, strong consistency of adaptive LSEs already holds if the adaptive scheme is such that

$$\begin{aligned} \inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\Big /(\log n)^\rho \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\infty \end{aligned}$$

for all \(\varepsilon >0\), for some \(\rho >1\). In Theorem 1 of Pronzato (2010), see also Remark 1 in Pronzato (2009), it was claimed that in case of i. i. d. error variables \(e_i\), \(i\in {\mathbb {N}}\), and for a finite design space the condition may be weakened to

$$\begin{aligned} \inf _{\theta \in C({\overline{\theta }},\varepsilon )}D_n(\theta ,{\overline{\theta }})\Big /(\log \log n)\,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\infty . \end{aligned}$$

However, the proof of the latter in Pronzato (2010), pp. 210–211, is doubtful since the classical Law of Iterated Logarithm is applied to random subsequences of the error variables. A proof should rather use a martingale structure and, in particular, a Law of Iterated Logarithm for martigales. Unfortunately, we have not found the appropriate arguments. \(\square \)

For deriving asymptotic normality of the adaptive least squares estimators further assumptions are needed. Firstly, the ‘gradient condition’ (B5) on the family of functions \(f_\theta \), \(\theta \in \varTheta \), and the mean response \(\mu \) is added to conditions (B1)–(B4). Secondly, two additional conditions (L) and (AH) on the error variables in (A1)–(A2) are imposed, where ‘L’ stands for ‘Lindeberg’ and ‘AH’ for ‘asymptotic homoscedasticity’.

(B5) \(\varTheta \subseteq {\mathbb {R}}^p\) (endowed with the usual Euclidean metric), \(\mathrm{int}(\varTheta )\not =\emptyset \), where \(\mathrm{int}(\varTheta )\) denotes the interior of \(\varTheta \) as a subset of \({\mathbb {R}}^p\), the function \(\theta \mapsto \mu (x,\theta )\) is twice differentiable on \(\mathrm{int}(\varTheta )\) for each fixed \(x\in \mathcal{X}\), with gradients and Hessian matrices denoted by \(\nabla \mu (x,\theta )= \Bigl (\frac{\partial }{\partial \theta _1}\mu (x,\theta ),\ldots ,\frac{\partial }{\partial \theta _p}\mu (x,\theta )\Bigr )^\mathsf{T}\) and \(\nabla ^2\mu (x,\theta )=\Bigl (\frac{\partial ^2}{\partial \theta _i\partial \theta _j}\mu (x,\theta )\Bigr )_{1\le i,j\le p}\), respectively, for \(\theta =(\theta _1,\ldots ,\theta _p)^\mathsf{T}\in \mathrm{int}(\varTheta )\) and \(x\in \mathcal{X}\). It is assumed that the functions \((x,\theta )\mapsto \nabla \mu (x,\theta )\) and \((x,\theta )\mapsto \nabla ^2\mu (x,\theta )\) are continuous on \(\mathcal{X}\times \mathrm{int}(\varTheta )\) and

$$\begin{aligned} f_\theta (x)\,=\,\nabla \mu (x,\theta )\quad \hbox {for all } x\in \mathcal{X} \hbox {and all } \theta \in \mathrm{int}(\varTheta ). \end{aligned}$$

For a subset \(A\subseteq \varOmega \) we denote by \( 1 1(A)\) the function on \(\varOmega \) which is constantly equal to 1 on A and is constantly equal to 0 on \(\varOmega {\setminus } A\).

(L) \(\frac{1}{n}\sum _{i=1}^n \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\Big |\mathcal{F}_{i-1}\Bigr ) \,\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0\) for all \(\varepsilon >0\).

(AH) \(\mathrm{E}\bigl (e_n^2\big |\mathcal{F}_{n-1}\bigr )\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,\sigma ^2({\overline{\theta }})\) for some positive real constant \(\sigma ({\overline{\theta }})\).

The following two conditions (L’) and (L”) are less technical than the Lindeberg condition (L), and each of them implies (L).

(L’) \(\sup _{i\in {\mathbb {N}}}\mathrm{E}\bigl (|e_i|^\alpha \big |\mathcal{F}_{i-1}\bigr )\,<\,\infty \) a.s. for some real \(\alpha >2\).

(L”) The random variables \(e_i\), \(i\in {\mathbb {N}}\), are identically distributed, and \(e_i\), \(\mathcal{F}_{i-1}\) are independent for each \(i\in {\mathbb {N}}\).

In fact, from (L’), observing the trivial inequality \(e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\le |e_i|^\alpha /(\varepsilon \sqrt{n})^{\alpha -2}\), it follows that

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\Big |\mathcal{F}_{i-1}\Bigr )\, \le \,\frac{1}{(\varepsilon \sqrt{n})^{\alpha -2}}\,\sup _{i\in {\mathbb {N}}}\mathrm{E}\bigl (|e_i|^\alpha \big |\mathcal{F}_{i-1}\bigr ) \ \,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,0. \end{aligned}$$

From (L”) it follows for all \(i\in {\mathbb {N}}\)

$$\begin{aligned} \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\big |\mathcal{F}_{i-1}\Bigr )= \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\Bigr )= \mathrm{E}\Bigl (e_1^2 1 1\bigl (|e_1|>\varepsilon \sqrt{n}\bigr )\Bigr ) \ \hbox { a.s.} \end{aligned}$$

Hence

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n \mathrm{E}\Bigl (e_i^2 1 1\bigl (|e_i|>\varepsilon \sqrt{n}\bigr )\,\Big |\mathcal{F}_{i-1}\Bigr ) = \mathrm{E}\Bigl (e_1^2 1 1\bigl (|e_1|>\varepsilon \sqrt{n}\bigr )\Bigr )\ \hbox {a.s.} \end{aligned}$$

and the expectation on the r.h.s. converges to zero as \(n\rightarrow \infty \). Note also that (L”) implies \(\mathrm{E}\bigl (e_i^2\big |\mathcal{F}_{i-1}\bigr )=\mathrm{E}\bigl (e_1^2\bigr )=\sigma ^2({\overline{\theta }})\), say. Excluding the trivial case \(\sigma ^2({\overline{\theta }})=0\), we see that condition (L”) also implies condition (AH).

Remark 5

Condition (L’) was employed by Lai and Wei (1982), Theorem 1 of that paper, and by Chen et al. (1999), condition (C4) on p. 1161 of that paper. Condition (L”) meets the assumption of i.i.d. error variables of Pronzato (2010) for a particular choice of the sequence of sub-sigma-fields \(\mathcal{F}_i\), \(i\in {\mathbb {N}}_0\). \(\square \)

The k-dimensional normal distribution with expectation 0 and covariance matrix C is denoted by \(\mathrm{N}(0,C)\), where C is a positive definite \(k\times k\) matrix. In particular, \(\mathrm{N}(0,I_k)\) is the k-dimensional standard normal distribution, where \(I_k\) denotes the \(k\times k\) identity matrix. For a sequence \(W_n\) of \({\mathbb {R}}^k\)-valued random variables, convergence in distribution of \(W_n\) (as \(n\rightarrow \infty \)) to a k-dimensional normal distribution \(\mathrm{N}(0,C)\) is abbreviated by \(W_n\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\mathrm{N}(0,C)\). In the following theorem asymptotic normality of the adaptive least squares estimators \({\widehat{\theta }}_n^{\mathrm{(LS)}}\) is established. To some extent our proof is similar to that of Theorem 2 in Pronzato (2009), though the assumptions are different. Note that, by our Theorem 1, the assumptions of strong consistency of the adaptive estimators \({\widehat{\theta }}_n\) employed by the algorithm and of strong consistency of the adaptive LSEs \({\widehat{\theta }}_n^{\mathrm{(LS)}}\) are met if \({\widehat{\theta }}_n={\widehat{\theta }}_n^{\mathrm{(LS)}}\), \(n\ge n_{\mathrm{st}}\), and if one of the conditions (SI) and (\(\mathrm{GLM}^*\)) holds.

Theorem 2

Assume conditions (B5), (L), and (AH). Moreover, assume that \({\overline{\theta }}\in \mathrm{int}(\varTheta )\) and the sequences \({\widehat{\theta }}_n\) and \({\widehat{\theta }}_n^{\mathrm{(LS)}}\) of adaptive estimators employed by the algorithm and adaptive LSEs, respectively, are strongly consistent, i.e., \({\widehat{\theta }}_n\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}\) and \({\widehat{\theta }}_n^{\mathrm{(LS)}}\,{\mathop {\longrightarrow }\limits ^\mathrm{a.s.}}\,{\overline{\theta }}\). Then, denoting by \(M_*({\overline{\theta }})=M\bigl (\xi ^*_{{\overline{\theta }}},{\overline{\theta }})\) the information matrix of a locally D-optimal design at \({\overline{\theta }}\), one has

$$\begin{aligned} \sqrt{n}\,\bigl ({\widehat{\theta }}_n^\mathrm{(LS)}-{\overline{\theta }}\bigr ) \,\,{\mathop {\longrightarrow }\limits ^\mathrm{d}}\,\,\mathrm{N}\bigl (0,\sigma ^2({\overline{\theta }})\,M_*^{-1}({\overline{\theta }})\bigr ). \end{aligned}$$

For illustration of the achieved convergence results, we present some simulations for the Michaelis-Menten model from Example 1, case (a). For the exponemtial decay model of part (b) of Example 1 we obtained similar simulation results which will not be reported here.

Example 3: Simulation. Assume the Michaelis-Menton model with \(p=2\) parameters,

$$\begin{aligned}&\mu (x,\theta )=\frac{\theta _1 x}{\theta _2+x},\\&f_\theta (x)=\nabla \mu (x,\theta )\,=\,\Bigl (\frac{x}{\theta _2+x}\,,\, -\frac{\theta _1x}{(\theta _2+x)^2}\Bigr )^\mathsf{T}\end{aligned}$$

for \(x\in \mathcal{X}\) and \(\theta =(\theta _1,\theta _2)^\mathsf{T}\in \varTheta \). Let the experimental region be given by the interval \(\mathcal{X}=[\,0.5\,,\,5\,]\) and the parameter space be the square \(\varTheta =[\,0.1\,,\,10\,]^2\). The true parameter point is chosen to be \({\overline{\theta }}=(1,1)^\mathsf{T}\). The error variables \(e_i\), \(i\in {\mathbb {N}}\), in (A1) are assumed to be i.i.d normally distributed with expectation zero and variance \(\sigma ^2({\overline{\theta }})=0.04\). By simulations, S=10,000 (pieces of) paths \(X_i^{(s)}, Y_i^{(s)}\), \(1\le i\le 500\), where \(s=1,\ldots ,S\), of the adaptive Wynn algorithm were generated, where the employed adaptive estimators \({\widehat{\theta }}_n\) were chosen to be the adaptive LSEs: \({\widehat{\theta }}_n={\widehat{\theta }}_n^{\mathrm{(LS)}}\) for all n. The computation of the paths of adaptive LSEs was done by using R Core Team (2020). A fixed initial design \(\xi _{n_{\mathrm{st}}}\) was used: The three-point design with equal weights on the boundary points and the mid-point of the experimental interval, that is, \(n_{\mathrm{st}}=3\) and \(x_1=0.5\), \(x_2=2.75\), \(x_3=5\). Figure 1 illustrates the (almost sure) asymptotic D-optimality of the design sequence generated by the adaptive Wynn algorithm, which is ensured by Theorem 1 and Remark 3. The simulated paths \(\xi _n^{(s)}\), \(3\le n\le 500\), where \(s=1,\ldots ,S\), yield D-efficiencies

$$\begin{aligned} \mathrm{eff}(\xi _n^{(s)})\,=\,\Bigl \{\det \bigl (M(\xi _n^{(s)},{\overline{\theta }})\bigr )\big /\det \bigl (M_*({\overline{\theta }})\bigr )\Bigr \}^{1/2}, \end{aligned}$$

where \(M^*({\overline{\theta }})=M(\xi ^*_{{\overline{\theta }}},{\overline{\theta }})\) is the information matrix of the locally D-optimal design at \({\overline{\theta }}\) which is the two-point design giving equal weights 1/2 to the points 5/7 and 5, see Bates and Watts (1988), pp. 125–126, and hence

For each \(n\in \{3,\ldots ,500\}\), particular quantiles of the ‘data’ \(\mathrm{eff}(\xi _n^{(s)})\), \(1\le s\le S\), are reported in Fig. 1: minimum, 10%-quantile, 25%-quantile, median, 75%-quantile, 90%-quantile, and maximum. For example, the 10%-quantile curve shows that after less than 50 iterations more than 90% of the simulated paths yield efficiencies at least 0.9. The asymptotic normality of the adaptive LSEs ensured by Theorem 2 suggests that n-times the mean squared error matrix of \({\widehat{\theta }}_n^{\mathrm{(LS)}}\) at \({\overline{\theta }}\) should converge to the asymptotic covasriance matrix \(\sigma ^2({\overline{\theta }})\,M_*^{-1}({\overline{\theta }})\). In fact, this was observed for n-times the simulated mean squared error matrix. Figure 2 shows a plot of the (2, 2)-entry of that matrix, that is, n-times the simulated mean squared error of the second component \({\widehat{\theta }}_{2,n}^{\mathrm{(LS)}}\) of \({\widehat{\theta }}_n^{\mathrm{(LS)}}\). The (2, 2)-entry of the asymptotic covariance matrix is approximately 3.32, indicated by the horizontal line in the figure. Finally, by Fig. 3 the approximate normal distribution of \({\widehat{\theta }}_{2,n}^{\mathrm{(LS)}}\) is visualized at \(n=250\) via a suitable histogram of the simulated estimates (along with a fitted normal density) and a normal qq-plot. The simulation yielded similar graphics for the first component of \({\widehat{\theta }}_n^{\mathrm{(LS)}}\).

Fig. 1
figure 1

Quantile curves of the simulated D-efficiencies. From bottom to top: minimum (dots and dashes), 10%-quantile (dotted), 25%-quantile (dashed), median (solid), 75%-quantile (dashed), 90%-quantile (dotted), maximum (dots and dashes)

Fig. 2
figure 2

Plot of n-times the simulated variance of \({\widehat{\theta }}_{2,n}^{\mathrm{(LS)}}\). The horizontal line indicates the asymptotic variance (\(\approx 3.32\))

Fig. 3
figure 3

Step \(n=250\): histogram estimate and fitted normal density (left) and normal QQ-plot (right) of \({\widehat{\theta }}_{2,n}^{\mathrm{(LS)}}\), from the simulation

4 Discussion

The adaptive Wynn algorithm for nonlinear regression provides a particular adaptive sampling scheme which has been motivated by the classical iterative procedure established by Wynn (1970) for generating D-optimal designs under a linear model. In the nonlinear situation the strong consistency of the adaptive estimators employed by the algorithm is crucial for ensuring that the generated adaptive design sequence is asymptotically D-optimal in the sense of local D-optimality at the true parameter point (which, of course, is unknown). Note that choosing a locally optimal design would be the best if the true parameter point was known. The focus of the present paper is on adaptive least squares estimators (LSEs), and their strong consistency has been proved for two relevant classes of nonlinear (univariate) regression models: those which have the property of ‘saturated identifiability’ (SI) and, secondly, those which satisfy a condition (\(\mathrm{GLM}^*\)) including the class of generalized linear models. Condition (SI) seems to be restricted to models with a real valued regressor variable as the examples in Sect. 1 have shown. Generalized linear models (GLMs) constitute a great and important class of models. However, for a GLM, maximum likelihood or weighted least squares will usually be preferable to ordinary (unweighted) least squares as studied in the present paper. We note that adaptive maximum likelihood estimation in GLMs under the adaptive Wynn algorithm was studied in our previous paper (Freise et al. 2019). A challenging question for future research is to find extensions to other classes of models than just the (SI) class and, with regard to GLMs, to include weighted least squares estimation.

On the basis of strong consistency of adaptive LSEs their asymptotic normality has been established, under further suitable assumptions. The asymptotic covariance matrix is given by the inverse information matrix of the locally D-optimal design at the true parameter point. Thus, when measuring the size of a (nonsingular) covariance matrix by its determinant, the result implies the asymptotic efficiency of the adaptive LSEs. Two assumptions may be restrictive: firstly, the information matrices are built by a local first order Taylor expansion of the response (locally at a parameter point), without any scaling adjustment for possible variance heterogeneity. Secondly, a condition of ‘asymptotic homoscedasticity’ (AH) on the random errors has been imposed, that is, their (conditional) variances are assumed to become asymptotically constant, where the asymptotic variance may depend on the true parameter point. Both assumptions correspond to ordinary (unweighted) least squares employed by the adaptive LSEs under consideration. Again, extensions of the results are desirable which employ weaker assumptions to include models with variance heterogeneity as the majority of GLMs.

While the adaptive Wynn algorithm collects one point at each step and was therefore called ‘one-step-ahead algorithm’ by Pronzato (2010), an alternative approach is to collect more points at each step. For a special model, a related concept of ‘batch sequential design’ was employed by Müller and Pötscher (1992). In a forthcoming paper (Freise et al. 2020) we study a sequential adaptive algorithm for D-optimal design which we have called a ‘p-step-ahead algorithm’, since p design points are collected at each step. An idea of that algorithm was sketched by Ford et al. (1992) in the introduction of their paper, p. 570.