1 Introduction

Let \(\{F_{\theta }\), \(\theta \in {\varTheta }\}\) be a family of absolutely continuous distribution functions on the real line and denote the corresponding densities by \(\{f_{\theta }\), \(\theta \in {\varTheta }\}\). For any convex function \(\phi \) on the positive half real line, the quantity

$$\begin{aligned} S_\phi (\theta )=\int _{-\infty }^{\infty }\phi \left( \frac{f_{\theta }(x)}{f_{\theta _0}(x)}\right) f_{\theta _0}(x)\mathrm{d}x \end{aligned}$$

is called the \(\phi \)-divergence between the distributions \(F_{\theta }\) and \(F_{\theta _0}\). The \(\phi \)-divergences, introduced by Csiszár (1963) as information-type measures, have several statistical applications including estimation. Although Csiszár (1977) describes how this measure can also be used for discrete distributions, we are concerned with the case of absolutely continuous distributions in the present paper.

Let \(\xi _1,\ldots ,\xi _{n-1}\) be a sequence of independent and identically distributed (i.i.d.) random variables (r.v.’s) from \(F_{\theta _0}\), \(\theta _0\in {\varTheta }\). The goal is to estimate the unknown parameter \(\theta _0\). If \(\phi (x)=-\log x\), then \(S_\phi (\theta )\) is known as the Kullback–Leibler divergence between \(F_{\theta }\) and \(F_{\theta _0}\). In this case, a consistent estimator of this \(S_\phi (\theta )\) is given by

$$\begin{aligned} \frac{1}{n-1}\sum _{i=1}^{n-1}\phi \left( \frac{f_{\theta }(\xi _i)}{f_{\theta _0}(\xi _i)}\right) =-\frac{1}{n-1}\sum _{i=1}^{n-1}\log \frac{f_{\theta }(\xi _i)}{f_{\theta _0}(\xi _i)} . \end{aligned}$$
(1)

Minimization of this statistic with respect to \(\theta \) is equivalent to maximization of the log-likelihood function, \(\sum _{i=1}^n\log f_{\theta }(\xi _i)\). Thus, by finding a value of \(\theta \in {\varTheta }\) that minimizes (1), we obtain the well-known maximum likelihood estimator (MLE) of \(\theta _0\). Note, in order to minimize the right-hand side of (1) with respect to \(\theta \), we do not need to know the value of \(\theta _0\). On the other hand, for convex functions other than \(\phi (x)=-\log x\), a minimization of the left-hand side of (1) with respect to \(\theta \) would require the knowledge of \(\theta _0\), the parameter that is to be estimated. Thus, for general convex \(\phi \) functions, it is not obvious how to approximate \(S_\phi (\theta )\) in order to obtain a statistic that can be used for estimating the parameters.

One solution to this problem was provided by Beran (1977), who proposed that \(f_{\theta _0}\) could be estimated by a suitable nonparametric estimator \({\hat{f}}\) (e.g., a kernel estimator) in the first stage, and in the second stage, the estimator of \(\theta _0\) should be chosen as any parameter value \(\theta \in {\varTheta }\) that minimizes the approximation

$$\begin{aligned} {{\hat{S}}}_{\phi }(\theta )=\int _{-\infty }^{\infty }{\hat{f}}(x)\phi \left( \frac{f_{\theta }(x)}{{\hat{f}}(x)}\right) \mathrm{d}x \end{aligned}$$
(2)

of \(S_\phi (\theta )\). In the estimation method suggested by Beran (1977), the function \(\phi (x)= \frac{1}{2}|1-\sqrt{x}|^2\) was used, and this particular case of \(\phi \)-divergence is known as the squared Hellinger distance. Read and Cressie (1988, p. 124) put forward a similar idea based on power divergences, and it should be noted that the family of power divergences is a subclass of the family of \(\phi \)-divergences. The general approach of estimating parameters by minimizing a distance (or a divergence) between a nonparametric density estimate and the model density over the parameter space has been further extended by subsequent authors, and many of these procedures combine strong robustness features with asymptotic efficiency. See Basu et al. (2011) and the references therein for details.

Here, we propose an alternate approach obtained by approximating \(S_\phi (\theta )\), using the sample spacings. Let \(\xi _{(1)} \le \cdots \le \xi _{(n-1)}\) denote the ordered sample of \(\xi _1,\cdots ,\xi _{n-1}\), and let \(\xi _{(0)}=-\,\infty \) and \(\xi _{(n)}=\infty \). For an integer \(m=m(n)\), sufficiently smaller than n, we put \(k=n/m\). Without loss of generality, when stating asymptotic results, we may assume that \(k=k(n)\) is an integer and define non-overlapping mth-order spacings as

$$\begin{aligned} D_{j,m}(\theta )=F_{\theta }(\xi _{(jm)})-F_{\theta }(\xi _{((j-1)m)}),\,\,\,j=1,\ldots ,k. \end{aligned}$$

Let

$$\begin{aligned} S_{\phi ,n}(\theta )=\frac{1}{k}\sum _{j=1}^{k}\phi (kD_{j,m}(\theta )),\,\,\,\theta \in {\varTheta }. \end{aligned}$$
(3)

In Eq. (3), the reciprocal of the argument of \(\phi \) is related to a nonparametric histogram density estimator considered in Prakasa Rao (1983, Section 2.4). More precisely, \(g_n(x)=(kD_{j,m}(\theta ))^{-1}\), for \(F_{\theta }(\xi _{((j-1)m)})\le x<F_{\theta }(\xi _{(jm)})\), is an estimator of the density of the r.v. \(F_{\theta }(\xi _1)\), i.e., of \(g(x)=f_{\theta _0}(F_\theta ^{-1}(x))/f_{\theta }(F_\theta ^{-1}(x))\), where \(F_\theta ^{-1}(x)=\inf \{u:F_\theta (u)\ge x\}\). When both k and m increase with the sample size, then, for large n,

$$\begin{aligned} kD_{j,m}(\theta )\approx \frac{f_\theta (\xi _{(j-1)m+\lfloor m/2\rfloor })}{f_{\theta _0}(\xi _{(j-1)m+\lfloor m/2\rfloor })},\,\,\,j=1,\ldots ,k, \end{aligned}$$
(4)

where \(\lfloor m/2 \rfloor \) is the largest integer smaller than or equal to m / 2. Thus, intuitively, as \(k,m\rightarrow \infty \) as \(n\rightarrow \infty \), \(S_{\phi ,n}(\theta )\) should converge in probability to \(S_\phi (\theta )\). Furthermore, since \(\phi \) is a convex function, by Jensen’s inequality, we have \(S_\phi (\theta )\ge S_\phi (\theta _0)=\phi (1)\). This suggests that if the distribution \(F_{\theta }\) is a smooth function in \(\theta \), an argument minimizing \(S_{\phi ,n}(\theta )\), \(\theta \in {\varTheta }\), should be close to the true value of \(\theta _0\), and hence be a reasonable estimator.

An argument \(\theta ={\hat{\theta }}_{\phi ,n}\) which minimizes \(S_{\phi ,n}(\theta )\), \(\theta \in {\varTheta }\), will be referred to as a Generalized Spacings Estimator (GSE) of \(\theta _0\). When convenient, a root of the equation \(({\mathrm{d}}/{\mathrm{d}}\theta )S_{\phi ,n}(\theta )=0\) will also be referred to as a GSE.

By using different functions \(\phi \) and different values of m, we get various criteria for statistical estimation. The ideas behind this proposed family of estimation methods generalize the ideas behind the maximum spacing (MSP) method, as introduced by Ranneby (1984); the same method was introduced from a different point of view by Cheng and Amin (1983). Ranneby derived the MSP method from an approximation of the Kullback–Leibler divergence between \(F_{\theta }\) and \(F_{\theta _0}\), i.e., \(S_\phi (\theta )\) with \(\phi (x)=-\log x\) and defined the MSP estimator as any parameter value in \({\varTheta }\) that minimizes (3) with \(\phi (x)=-\log x\) and \(m=1\). This estimator has been shown, under general conditions, to be consistent (Ranneby 1984; Ekström 1996, 1998; Shao and Hahn 1999) and asymptotically efficient (Shao and Hahn 1994; Ghosh and Jammalamadaka 2001). Based on the maximum entropy principle, Kapur and Kesavan (1992) proposed to estimate \(\theta _0\) by selecting the value of \(\theta \in {\varTheta }\) that minimizes (3) with \(\phi (x)=x\log x\) and \(m=1\). With this particular choice of \(\phi \)-function, \(S_\phi (\theta )\) becomes the Kullback–Leibler divergence between \(F_{\theta _0}\) and \(F_{\theta }\) (rather than \(F_{\theta }\) and \(F_{\theta _0}\)). We refer to Ekström (2008) for a survey of estimation methods based on spacings and the Kullback–Leibler divergence. Ekström (1997, 2001) and Ghosh and Jammalamadaka (2001) considered a subclass of the estimation methods proposed in the current paper, namely the GSEs with \(m=1\). Under general regularity conditions, it turns out that this subclass produces estimators that are consistent and asymptotically normal and that the MSP estimator, which corresponds to the special case when \(\phi (x)=-\log x\), has the smallest asymptotic variance in this subclass (Ekström 1997; Ghosh and Jammalamadaka 2001). Estimators based on overlapping (rather than non-overlapping) mth-order spacings was considered in Ekström (1997, 2008), where small Monte Carlo studies indicated, in an asymptotic sense, that larger orders of spacings are always better (when \(\phi (x)\) is a convex function other than the negative log function). Menéndez et al. (1997, 2001a, b) and Mayoral et al. (2003) consider minimum divergence estimators based on spacings that are related to our GSEs. In the asymptotics, they use only a fixed number of spacings, and their results suggest that GSEs will not be asymptotically fully efficient when k in (3) is held fixed.

In the present paper, it is shown that GSEs are consistent and asymptotically normal under general conditions. In contrast to the aforementioned papers, we allow both the number of spacings, k, and the order of spacings, m, to increase to infinity with the sample size. We show that if both of them do tend to infinity, then there exists a class of asymptotically efficient GSEs that we call the Minimum Power Divergence Estimators (MPDEs). In contrast, if m is held fixed, then the only asymptotically optimal GSE is the one based on \(\phi (x)=-\log x\). The main results are stated in the next section, followed by a simulation study assessing (i) the performance of these estimators for different m and n and (ii) the robustness of these estimators under contamination. In the latter case, we also assess a suggested data-driven rule for choosing the order of spacings, m. Detailed proofs are to be found in “Appendix” and online Supplementary Material.

2 Main results

In this section, we state the main results and the assumptions that are needed. Unless otherwise stated, it will henceforth be assumed that \(\xi _1,\ldots ,\xi _{n-1}\) are i.i.d. r.v.’s from \(F_{\theta _0}\), \(\theta _0\in {\varTheta }\).

We will prove the consistency of the GSEs under the following assumptions:

Assumption 1

The family of distributions \(\{F_\theta (\cdot )\), \(\theta \in {\varTheta }\}\) has common support, with continuous densities \(\{f_\theta (\cdot )\), \(\theta \in {\varTheta }\}\), and \(F_\theta (x)\ne F_{\theta _0}(x)\) for at least one x when \(\theta \ne \theta _0\).

Assumption 2

The parameter space \({\varTheta }\subset R\) contains an open interval of which \(\theta _0\) is an interior point, and \(F_\theta (\cdot )\) is differentiable with respect to \(\theta \).

Assumption 3

The function \(\phi (t)\), \(t>0\), satisfies the following conditions:

  1. (a)

    it is strictly convex;

  2. (b)

    \(\min \{0,\phi (t)\}/t\rightarrow 0\) as \(t\rightarrow \infty \);

  3. (c)

    it is bounded from above by \(\psi (t)=c_1(t^{-c_2}+t^{c_3})\) for all \(t>0\), where \(c_1,c_2\), and \(c_3\) are some nonnegative constants;

  4. (d)

    it is twice differentiable.

Assumption 3 is valid for a wide class of convex functions including the following,

$$\begin{aligned} \phi (x)=\phi _\lambda (x)=\left\{ \begin{array}{ll} \lambda ^{-1}(1+\lambda )^{-1}(x^{\lambda +1}-1), &{}\quad \text{ if } \,\,\lambda \ne -1,0, \\ -\log x, &{}\quad \text{ if } \,\,\lambda =-\,1,\\ x\log x, &{}\quad \text{ if } \,\,\lambda =0, \end{array}\right. \end{aligned}$$
(5)

where the cases \(\lambda =-\,1\) and \(\lambda =0\) are given by continuity, i.e., by noting that \(\lim _{\lambda \rightarrow 0}(x^\lambda -1)/\lambda =\log x\).

Theorem 1

Under Assumptions 13, when \(m>c_2\) is fixed, or \(m\rightarrow \infty \) such that \(m=o(n)\), the equation \(({\mathrm{d}}/{\mathrm{d}}\theta )S_{\phi ,n}(\theta )=0\) has a root \({{\hat{\theta }}}_{\phi ,n}\) with a probability tending to 1 as \(n\rightarrow \infty \), such that

$$\begin{aligned} {{\hat{\theta }}}_{\phi ,n}={\hat{\theta }}_{\phi ,n}(\xi _1,\ldots ,\xi _{n-1}){\mathop {\longrightarrow }\limits ^{p}}\theta _0. \end{aligned}$$

For the purpose of the next theorem, we will use the notation \(f(x,\theta )\) for the density \(f_\theta (x)\), and we denote its partial derivatives by

$$\begin{aligned} f_{ij} (x,\theta )=\frac{\partial ^{i+j} }{\partial x^{i} \partial \theta ^{j} }f(x,\theta ). \end{aligned}$$

Let \(W_{1},W_{2},\ldots \) be a sequence of independent standard exponentially distributed r.v.’s, \({G}_{m}=W_1+\cdots +W_m\), and \({{\bar{G}}}_m=m^{-1}G_m\).

We then have the following important result:

Theorem 2

Let \(m=o(n)\). In addition to Assumptions 1 and 2, assume the following conditions:

  1. (i)

    The function \(\phi \) is a strictly convex function and thrice continuously differentiable.

  2. (ii)

    The quantities \(\mathrm{Var}(W_{1} \phi '({\bar{G}}_{m} ))\), \(E(W_{1}^{2} \phi ''({\bar{G}}_{m} ))\), and \(E(W_{1}^{3} \phi '''({\bar{G}}_{m} ))\) exist and are bounded away from zero.

  3. (iii)

    The density function \(f(x,\theta )\), the inverse \(F_{\theta }^{-1} (x)\), and the partial derivatives \(f_{10}\) and \(f_{11} \) are continuous in x at \(\theta =\theta _{0}\), and \(f_{01}\), \(f_{02}\), and \(f_{03}\) are continuous in x and \(\theta \) in an open neighborhood of \(\theta _0\).

  4. (iv)

    The Fisher information,

    $$\begin{aligned} I(\theta )=\int _{-\infty }^{\infty }\left[ \frac{f_{01}(x,\theta )}{f(x,\theta )} \right] ^{2} f(x,\theta )\mathrm{d}x, \end{aligned}$$

    takes values in the interval \((0,\infty )\) for \(\theta \) in a neighborhood of \(\theta _0\).

Then, for any consistent root \(\hat{\theta }_{\phi ,n} \) of \(({\mathrm{d}}/{\mathrm{d}}\theta )S_{\phi ,n}(\theta )=0\), we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _x\left| P\left( \sqrt{n}({{\hat{\theta }}}_{\phi ,n}-\theta _0)\le x\right) -{\varPhi }\left( x\sqrt{\frac{I(\theta _{0})}{e_m(\phi )}}\right) \right| =0, \end{aligned}$$

where \({\varPhi }\) is the standard normal cumulative distribution function,

$$\begin{aligned} e_{m} (\phi )= & {} \sigma _{\phi }^{2} \left( E\left( {\bar{G}}_{m}^{2} \phi ''({\bar{G}}_{m} )\right) \right) ^{-2},\nonumber \\ \sigma _{\phi }^{2}= & {} m\mathrm{Var}\left( {\bar{G}}_{m} \phi '({\bar{G}}_{m} )\right) +(2m+1)\mu _{m}^{2} -2m\mu _{m} E\left( {\bar{G}}_{m} ^{2} \phi '({\bar{G}}_{m} )\right) , \end{aligned}$$
(6)

and

$$\begin{aligned} \mu _{m}=E\left( W_{1} \phi '({\bar{G}}_{m} )\right) =E\left( {\bar{G}}_{m} \phi '({\bar{G}}_{m} )\right) . \end{aligned}$$

The theorems will be proved using proof methods related to those of, e.g., Lehmann and Casella (1998) for the MLE. A generalization to the multiparameter case is possible, much like in the case of maximum likelihood estimation. However, as in Lehmann and Casella (1998) for the MLE, this will require somewhat more complex assumptions and proofs and we refrain from attempting it here.

For the case \(m=1\) and by assuming that \(\lim _{x\rightarrow 0}\phi '(x)x^2\mathrm{e}^{-x}=\lim _{x\rightarrow \infty }\phi '(x)x^2\mathrm{e}^{-x}=0\), Ekström (1997) and Ghosh and Jammalamadaka (2001) showed that \(e_{m} (\phi )\ge 1\), with equality if and only if \(\phi (x)=a\log x+bx+c\), for some constants a, b, and c. That this inequality holds true for general m can be seen by integrating by parts, i.e., assuming that \(\lim _{x\rightarrow 0}\phi '(x/m)x^{m+1}\mathrm{e}^{-x}=\lim _{x\rightarrow \infty }\phi '(x/m)x^{m+1}\mathrm{e}^{-x}=0\), we get

$$\begin{aligned} \left( E({\bar{G}}_{m}^{2} \phi ''({\bar{G}}_{m} ))\right) ^{2}= & {} \left( mE({\bar{G}}_{m}^{2} \phi '({\bar{G}}_{m} ))-(m+1)\mu _{m} \right) ^{2}\\= & {} \left( m\, \mathrm{Cov}\left( {\bar{G}}_{m} \phi '({\bar{G}}_{m} ),{\bar{G}}_{m} \right) -\mu _{m} \right) ^{2} \\= & {} m^{2} \left( \mathrm{Cov}\left( {\bar{G}}_{m} \phi '({\bar{G}}_{m} ),{\bar{G}}_{m} \right) \right) ^2\\&\quad +\,(2m+1)\mu _{m}^{2}-2m\mu _{m} E({\bar{G}}_{m}^{2} \phi '({\bar{G}}_{m} )) \le \sigma _{\phi }^{2} , \end{aligned}$$

where the inequality on the right-hand side follows by noting that

$$\begin{aligned} \left( \mathrm{Cov}\left( {\bar{G}}_{m} \phi '({\bar{G}}_{m} ),{\bar{G}}_{m} \right) \right) ^2\le \mathrm{Var}\left( {\bar{G}}_{m} \phi '({\bar{G}}_{m} )\right) \mathrm{Var}({\bar{G}}_{m}) \end{aligned}$$

and \(\mathrm{Var}({\bar{G}}_{m}) =m^{-1} \). Hence, \(e_{m} (\phi )=\sigma _{\phi }^{2} \left( E({\bar{G}}_{m}^{2} \phi ''({\bar{G}}_{m} ))\right) ^{-2} \ge 1\), with equality if and only if \(x\phi '(x)=a+bx\) or, equivalently, if and only if \(\phi (x)=a\log x+bx+c\), where \(a<0\). It should be observed that the corresponding estimator \({{\hat{\theta }}}_{\phi ,n}\) does not depend on the chosen values of \(a<0\), b, and c. Thus, without loss of generality, we may choose \(a=-\,1\) and \(b=c=0\), i.e., for m fixed, the asymptotically optimal estimator \({{\hat{\theta }}}_{\phi ,n}\) is based on the function \(\phi (x)=-\log x\). If, however, we let \(m\rightarrow \infty \), then the asymptotically optimal estimator is no longer unique. For example, let us consider the family of power divergences (Read and Cressie 1988; Pardo 2006) given by

$$\begin{aligned} T_{\lambda }(\theta )= & {} \frac{1}{\lambda (\lambda +1)}\int _{-\infty }^\infty \left( \left( \frac{f_\theta (x)}{f_{\theta _0}(x)}\right) ^{\lambda +1}-1\right) f_{\theta _0}(x)\mathrm{d}x, \end{aligned}$$

where the cases \(\lambda =-\,1\) and \(\lambda =0\) are given by continuity, i.e.,

$$\begin{aligned} T_{-1}(\theta )=\lim _{\lambda \rightarrow -1}T_{\lambda }(\theta )= \int _{-\infty }^{\infty }\left( -\log \left( \frac{f_\theta (x)}{f_{\theta _0}(x)}\right) \right) f_{\theta _0}(x)\mathrm{d}x \end{aligned}$$

and

$$\begin{aligned} T_{0}(\theta )=\lim _{\lambda \rightarrow 0}T_{\lambda }(\theta )= \int _{-\infty }^{\infty }\left( \frac{f_\theta (x)}{f_{\theta _0}(x)}\log \left( \frac{f_\theta (x)}{f_{\theta _0}(x)}\right) \right) f_{\theta _0}(x)\mathrm{d}x, \end{aligned}$$

where \(T_{-1}(\theta )\) is the Kullback–Leibler divergence and \(T_{0}(\theta )\) is the reversed Kullback–Leibler divergence (cf. Pardo and Pardo 2000).

If we set \(\phi (x)=\phi _\lambda (x)\), which is defined in Eq. (5), then note that \(T_{\lambda }(\theta )=S_{\phi _\lambda }(\theta )\), i.e., the family of power divergences is a subclass of the family of \(\phi \)-divergences. The divergence \(T_{\lambda }(\theta )=S_{\phi _\lambda }(\theta )\) is estimated by \(S_{\phi _\lambda ,n}(\theta )\), and Theorem 1 establishes the existence of a consistent root of the equation \(({\mathrm{d}}/{\mathrm{d}}\theta )S_{\phi _\lambda ,n}(\theta )=0\). The next result asserts that any such sequence is asymptotically normal and efficient when \(k,m\rightarrow \infty \) as \(n\rightarrow \infty \).

Corollary 1

Let \(m\rightarrow \infty \) such that \(m=o(n)\). Suppose that Assumptions 1 and 2 and conditions (iii) and (iv) of Theorem 2 hold true and that the function \(\phi _\lambda \) is defined by (5) for some \(\lambda \in (-\infty ,\infty )\), then for any consistent root \({\hat{\theta }}_{\lambda ,n} \) of \(({\mathrm{d}}/{\mathrm{d}}\theta )S_{\phi _\lambda ,n}(\theta )=0\), we have

$$\begin{aligned} \sqrt{n} ({\hat{\theta }}_{\lambda ,n} -\theta _{0} ){\mathop {\longrightarrow }\limits ^{d}}N\left( 0,\frac{1}{I(\theta _{0} )} \right) \text{ as } n\rightarrow \infty . \end{aligned}$$

Such a sequence \({\hat{\theta }}_{\lambda ,n} \) of roots is typically provided by \(\arg \min _{\theta \in {\varTheta }}S_{\phi _\lambda ,n}(\theta )\), and in this case, the estimator may be referred to as a Minimum Power Divergence Estimator (MPDE). Another family of divergence measures was provided by Rényi (1961), and extended in Liese and Vajda (1987). These are given by

$$\begin{aligned} R_{\alpha }(\theta )= & {} \frac{1}{\alpha (\alpha -1)}\log \int _{-\infty }^\infty f_\theta (x)^\alpha f_{\theta _0}(x)^{1-\alpha }\mathrm{d}x , \end{aligned}$$

where the cases \(\alpha =0\) and \(\alpha =1\) are given by continuity, i.e., \(R_{0}(\theta )=\lim _{\alpha \rightarrow 0}R_{\alpha }(\theta )=T_{-1}(\theta )\) and \(R_{1}(\theta )=\lim _{\alpha \rightarrow 1}R_{\alpha }(\theta )=T_{0}(\theta ).\) Thus, \(R_{0}(\theta )\) and \(R_{1}(\theta )\) are Kullback–Leibler divergences and belong to the family of power divergences. When \(\alpha \ne 0,1\), \(R_{\alpha }(\theta )\) may be estimated by \(\alpha ^{-1}(\alpha -1)^{-1}\log S_{\phi _\alpha ,n}(\theta )\), where \(\phi _\alpha (x)=x^\alpha \), and if \(\lambda =\alpha -1\) then note that \(\arg \min _{\theta \in {\varTheta }}\alpha ^{-1}(\alpha -1)^{-1}\log S_{\phi _\alpha ,n}(\theta )=\arg \min _{\theta \in {\varTheta }}S_{\phi _\lambda ,n}(\theta )\). Thus, each MPDE may also be regarded as a minimum Rényi divergence estimator.

The Hellinger distance between \(F_\theta \) and \(F_{\theta _0}\) is given by

$$\begin{aligned} H(\theta )= & {} \left( \frac{1}{2}\int _{-\infty }^\infty \left( f_\theta (x)^{1/2}-f_{\theta _0}(x)^{1/2}\right) ^2\mathrm{d}x\right) ^{1/2} , \end{aligned}$$

and may be estimated by \((1+S_{\phi ,n}(\theta ))^{1/2}\), where \(\phi (x)=-x^{1/2}\). In this case, we have, for \(\lambda =-\,1/2\), \(\arg \min _{\theta \in {\varTheta }}(1+S_{\phi ,n}(\theta ))^{1/2}=\arg \min _{\theta \in {\varTheta }}S_{\phi _\lambda ,n}(\theta )\). Thus, the MPDE with \(\lambda =-\,1/2\) may also be referred to as a minimum Hellinger distance estimator.

It is desired to have statistical estimation procedures that perform well when an assumed parametric model is correctly specified, thereby attaining high efficiency at the assumed model. A problem is that assumed parametric assumptions are almost never literally true. Thus, in addition, it is desired to have estimation procedures that are relatively insensitive to small departures from the model assumptions and that somewhat larger deviations from the model assumptions do not cause a “catastrophe” (Huber and Ronchetti 2009). Procedures satisfying these features are called robust. Due to its relationship with the estimator suggested by Beran (1977), we conjecture that our minimum Hellinger distance estimator, i.e., the MPDE with \(\lambda =-\,1/2\), is robust (in the sense of Beran 1977 and Lindsay 1994). In addition, by arguments put forward in Lindsay (1994), we conjecture that GSEs based on bounded \(\phi \) functions are robust with respect to contaminations of the original data (cf. Mayoral et al. 2003). In the next section, we consider the MPDE with \(\lambda =-\,1/2\) and apply Monte Carlo simulations to compare its performance with those of the MLE and the MPDEs with \(\lambda = -\,1\) and \(-\,0.9\).

3 A simulation study

In this section, we explore the finite sample properties of the Minimum Power Divergence Estimators (MPDEs) \({\hat{\theta }}_{\lambda ,n}=\arg \min _{\theta \in {\varTheta }}S_{\phi _\lambda ,n}(\theta )\), where \(\phi _\lambda \) is defined in Eq. (5).

First, we consider estimating the mean \(\theta \) of a \(N(\theta , 1)\) distribution and compare the root mean square errors (RMSEs) of various MPDEs, with the RMSE of the MLE. These are shown in Fig. 1. MPDEs are computed for \(\lambda =-\,1,-\,0.9\), and \(-\,0.5\), and for all values of m, the order of the spacings, which are divisors of n. We define a relative RMSE of an MPDE to be its RMSE divided by the RMSE of the MLE. Each RMSE was estimated from 1000 Monte Carlo samples with \(n=840\) from the \(N(\theta ,1)\) distribution with \(\theta =0\). We present relative RMSEs for m up to 150, because when m is larger than that the relative RMSEs tend to be quite large in comparison. From Fig. 1, we see that MPDEs with \(\lambda \) equal to \(-\,1\) or \(-\,0.9\) are about as good as the MLE for comparatively small values of m (and less well for larger m). The MPDE with \(\lambda =-\,0.5\) is not quite as good in terms of RMSE. For example, with an optimally chosen m, it had an RMSE about 0.5% larger than that of the MLE. The simulation results indicate that the optimal choice of m is 15, 15, and 14 for \(\lambda =-\,1,-\,0.9\), and \(-\,0.5\), respectively. (Although for \(\lambda =-\,1\) and \(-\,0.9\) there are about ten other candidates, respectively, that perform almost as good.)

Fig. 1
figure 1

RMSEs for different MPDEs relative the RMSE of the MLE when estimating the mean of a normal distribution. The relative RMSE of an MPDE is its RMSE divided by the RMSE of the MLE (color figure online)

Fig. 2
figure 2

RMSEs for the MLE and for different MPDEs when estimating the mean of a normal distribution under contamination, where \(\varepsilon \) denotes the level of contamination. Note, black plot symbols are often (partly) hidden behind green plot symbols and red plot symbols behind blue ones (color figure online)

Next, we consider the issue of stability/robustness of these estimators. It is known that the MSP estimator (Cheng and Amin 1983; Ranneby 1984), i.e., the MPDE with \(\lambda =-\,1\) and \(m=1\), much like the MLE, suffers from lack of stability under even small deviations from the underlying model, i.e., the distributions of the MSP and ML estimators can be greatly perturbed if the assigned model is only approximately true. This is demonstrated in a simulation study by Nordahl (1992) and by Fujisawa and Eguchi (2008) in a numerical study on the MLE. As in Nordahl (1992), we will assume a proportion \(1-\varepsilon \) of the data is generated from a normal distribution, while a proportion \(\varepsilon \) is generated by some unknown mechanism that produces “outliers.” For example, measurements are made, which are 95% of the time correct, while 5% of the time operator reading/writing errors are made or the recording instrument malfunctions. Therefore, we assume that a random sample \(\xi _1,\ldots ,\xi _{n-1}\) is generated from an \(\varepsilon \)-contaminated normal distribution \(G( x- \theta _0)\), where

$$\begin{aligned} G(x)=(1-\varepsilon ){\varPhi }(x)+\varepsilon H( x), \end{aligned}$$

in which H(x) denotes an arbitrary distribution that models the outliers and \(\varepsilon \) is the level of contamination. Of interest is to estimate \(\theta _0\), the mean of the observations in the case when no recording errors occur. If \(n\varepsilon \) is rather small, we may have few observations from H, making it difficult to assess the model for H. In such a case, instead of modeling the mixture distribution G, one may (wrongly) assume that all observations come from \({\varPhi }(x-\theta )\) with \(\theta =\theta _0\), and then use an estimation method that provides a good estimate of \(\theta _0\) even in presence of outliers coming from H. That is, robust estimation aims at finding an estimator \({{\hat{\theta }}}\) that efficiently estimates \(\theta _0\) even though the data are contaminated by an outlier distribution H (Fujisawa and Eguchi 2008).

Fig. 3
figure 3

RMSEs for the one-step MHDE and for two MPDEs, when estimating the mean of a normal distribution under contamination and where \(\varepsilon \) is the level of contamination. Note, red and blue plot symbols are sometimes (partly) hidden behind black plot symbols (color figure online)

In our Monte Carlo simulation, we used \(H(x) = {\varPhi }((x - \rho )/\tau )\), \(\tau > 0\), with \(\rho =10\) and \(\tau =1\). For each \(\varepsilon =0.0,0.1,\ldots ,0.4\), we generated 1000 Monte Carlo samples with \(n=840\) from \(G(x- \theta _0)\) with \(\theta _0=0\), and for every sample, the MLE of \(\theta _0\) was computed using the model \(F_\theta (x) = {\varPhi }(x-\theta )\). MPDEs for this case were also computed for \(\lambda =-\,1,-\,0.9\), and \(-\,0.5\) and for the previously found optimal values of m for the respective values of \(\lambda \). For each level of contamination, we used the 1000 samples for computing estimated RMSEs for the respective estimators of \(\theta _0\). The resulting RMSEs are shown in Fig. 2. In case of contamination, we see that the MPDEs with \(\lambda \) equal to \(-\,0.9\) or \(-\,0.5\) are superior to the MLE and the MPDE with \(\lambda =-\,1\). In other words, the MLE and MPDEs such as the MSP estimator, which all can be derived from the Kullback–Leibler divergence, perform poorly under contamination, and other MPDEs are to be preferred.

Looking at Fig. 1, it is clear that the choice of m, the order of spacings, is important for the quality of MPDE estimators. We now propose a data-based approach for choosing m for MPDEs (and more generally for GSEs). No asymptotic optimality is claimed for the approach. The main purpose is rather to provide sensible answers for finite sample sizes. For a given \(\lambda \) (or \(\phi \)-function), let \({{\hat{\theta }}}_m\) denote the MPDE (or GSE) using the order of spacings m. The suggested approach is given by the following algorithm:

  • Step 1 Compute \({{\hat{\theta }}}_{1}\).

  • Step 2 For r in \(1,\ldots ,R\): Draw a bootstrap sample \(x_{r,1}^\star ,\ldots ,x_{r,n-1}^\star \) from \(F_{{{\hat{\theta }}}_{1}}\). For some set of positive integers, \({{\mathcal {M}}}\), compute \({{\hat{\theta }}}_{r,m}^\star \) for each \(m\in {{\mathcal {M}}}\), where \({{\hat{\theta }}}_{r,m}^\star \) denotes the rth bootstrap replicate of \({{\hat{\theta }}}_m\).

  • Step 3 Choose \(m_{\mathrm{opt}}={{{\,\mathrm{arg\,min}\,}}}_{m\in {{\mathcal {M}}}}\frac{1}{R}\sum _{r=1}^R\left( {{\hat{\theta }}}_{r,m}^\star -{{\hat{\theta }}}_{1}\right) ^2\).

Under the same settings as in Fig. 2 and with m chosen according to the above algorithm, with \({{\mathcal {M}}}\) defined as the set of divisors of n, we consider two MPDEs, with \(\lambda =-\,0.9\) and \(-\,0.5\), respectively. We compare these with Karunamuni and Wu’s (2011) one-step minimum Hellinger distance estimator (MHDE), obtained from a one-step Newton–Raphson approximation to the solution of the equation \({{\hat{S}}}_\phi '(\theta )=0\), where \({{\hat{S}}}_\phi (\theta )\) is defined as in (2), with \(\phi (x)= \frac{1}{2}|1-\sqrt{x}|^2\) and \({{\hat{f}}}\) a kernel density density estimator. Karunamuni and Wu (ibid.) show that their one-step MHDE has the same asymptotic behavior as Beran’s (1977) MHDE, as long as the initial estimator in the Newton–Raphson algorithm is reasonably good and that it retains excellent robustness properties of the MHDEs. In our simulations, we used the median as the initial estimator of \(\theta _0\), and the kernel estimator was based on the Epanechnikov kernel with bandwidth chosen to be \((15e)^{1/5}(\pi /32)^{1/10}{{\hat{\sigma }}} n^{-1/5}\), where \({\hat{\sigma }}=\text{ median }\{|\xi _i-\text{ median }\{\xi _j\}|\}/{\varPhi }(3/4)\) (cf. Basu et al. 2011, pp. 108–109). The resulting RMSEs are shown in Fig. 3. When \(\varepsilon =0\), the MPDE with \(\lambda =-\,0.9\) is the winner in terms of RMSE (In comparison, the MPDE estimators with \(\lambda =-\,0.9\) and \(-\,0.5\), and the one-step MHDE had RMSEs that were about 0.0, 1.8, and 0.4 percent larger than that of the MLE, respectively). For \(\varepsilon =0.1\), the one-step MHDE performs somewhat better than the two MPDEs, but for \(\varepsilon =0.2, 0.3,\) and 0.4, the most efficient estimator is the MPDE with \(\lambda =-\,0.5\). Under heavy contamination, i.e., for levels of contamination equal to or larger than 0.3, both MPDEs are clearly more robust than the one-step MHDE.

By Corollary 1, MPDEs are asymptotically normal and efficient under a general set of regularity conditions. When applying an MPDE, a particular value of \(\lambda \) needs to be chosen. In our simulations, we considered three choices, \(\lambda =-\,1,-\,0.9\), and \(-\,0.5\). Much like the MLE, the MPDE with \(\lambda =-\,1\) can be greatly perturbed if the assigned model is only approximately true. In the choice between \(\lambda = -\,0.9\) and \(\lambda = -\,0.5\), the former appears to provide better estimates if there is no contamination, while the latter seems to give more robust estimators when some contamination is suspected.

4 Concluding remarks

In this paper, we propose classes of estimators, called Generalized Spacings Estimators or GSEs, based on non-overlapping higher-order spacings and show that under some regularity conditions, they are consistent and asymptotically normal. Within these classes, we demonstrate the existence of asymptotically efficient estimators, called MPDEs. Through simulations, we demonstrate that they perform well also in moderate sample sizes relative to the MLEs. However, unlike the MLEs, some of these spacings estimators are quite robust under contamination. In this article, we also propose a data-driven choice for the order of spacings, m, based on bootstrapping, and the Monte Carlo studies indicate that this practical way of choosing m leads to MPDEs which perform comparatively well and even much better at higher levels of contamination, than the one-step MHDEs proposed in the literature. Moreover, the GSEs suggested here can be suitably extended and used in more general situations. For example, by using mth nearest neighbor balls as a multidimensional analogue to univariate mth-order spacings, our proposed classes of estimators can be extended to multivariate observations (Kuljus and Ranneby (2015) studied this problem for \(m=1\)), but specifics need further exploration. Another possibility is to define GSEs based on overlapping spacings of order m. The derivation of the asymptotic distribution of such estimators is an open problem.