1 Introduction

We consider the two-sample model introduced by Abedin (2018). Specifically, suppose there is a sample from a two-component mixture population \(H=(1-\lambda )F+\lambda G\), where the mixing proportion \(\lambda \) is strictly between 0 and 1, and F, G and H are cumulative distribution functions (c.d.f.) such that \(F\ne G\) and F and G satisfy the stochastic dominance constraint \(F\ge G\). In addition, we have a separate sample identified as from the first component F. As a result, the data and model are formularized as

$$\begin{aligned} \begin{array}{ccl} X_{1},\dots ,X_{m} &{} \overset{\mathrm{i.i.d.}}{\sim } &{} f(x),\\ Y_{1},\dots ,Y_{n} &{} \overset{\mathrm{i.i.d.}}{\sim } &{} h(x)=(1-\lambda )f(x)+\lambda g(x), \ \ \ x\in \mathbb {R}, \end{array} \end{aligned}$$
(1)

where f, g and h are the corresponding probability density functions (p.d.f.) of F, G and H, respectively. Here f, g and \(\lambda \) are all unknown. The problem of our interest is to make inferences for \(\lambda \), treating f and g as nuisance parameters, and to estimate the likelihood of a new observation coming from a particular component.

A motivating example that made model (1) arise is the problem to identify differentially expressed genes in case-control studies (e.g., contract vs. not contract COVID-19) of genetic data, such as in Chen and Wu (2013), Kharchenko et al. (2014), Lu et al. (2015) and Ficklin et al. (2017), among many others. In order to solve the problem, frequently a proposed test is applied repeatedly to each single gene. The distribution of those thousands of test statistic values is thus a two-component mixture \((1-\lambda )f+\lambda g\), where f is the p.d.f. of the statistic under the null hypothesis that a particular gene is not differentially expressed while g is that under the alternative hypothesis that a particular gene is differentially expressed, and \(\lambda \) is the proportion of differentially expressed genes. Usually f is much easier to derive theoretically than g, otherwise if f is unknown in practice, in many studies pathologists or experts can confidently identify some genes that are not differentially expressed so that we have a sample (test statistic values of those genes) from f. The latter case is exactly model (1). The dominance constraint \(F\ge G\) is very intuitive in many cases where the statistic values of marker genes are likely larger (or smaller) than those of non-marker genes (e.g., Student’s t, ANOVA test statistic). In this motivating example one may argue that the genes, and thus the values of a particular statistic, are not i.i.d. as assumed in model (1). Nevertheless, the distribution of the statistic over all genes is assumed nonparametrically unknown which will provide enough flexibility to weaken, if not remove, the effect of dependence. By Bayes’ rule the probability of a gene with test statistic value y being a marker gene is given by

$$\begin{aligned} p(y) \ := \ P (\mathrm{marker \ gene}|y) \ = \ \frac{\lambda g(y)}{(1-\lambda ) f(y)+\lambda g(y)}. \end{aligned}$$
(2)

Once \(\lambda \), f and g are estimated, one can estimate according to (2) the probability that a particular gene is differentially expressed.

Besides the motivating example, model (1) could also be used to model many other real data structure. For more examples readers are referred to Abedin (2018) and Wu and Abedin (2021) which, to our best knowledge, are the only works in literature on model (1). In these two works, the authors proposed and studied two estimators. The first one is based on c.d.f. estimations with use of the dominance inequality, while the second is a maximum likelihood estimator (MLE) based on multinomial approximation. Another work on a model closely related to model (1) is given in Smith and Vounatsou (1997). However their model doesn’t assume the stochastic dominance constraint but instead makes a generally stronger assumptions on the p.d.f.s f and g.

This paper is organized in the following way. In Sect. 2, we introduce a semiparametric mixture model in order to accommodate the stochastic dominance constraint. A maximum empirical likelihood estimator (MELE) and a minimum Hellinger distance estimator (MHDE) of the unknown parameters are proposed in Sects. 3 and 4, respectively. Their asymptotic properties such as consistency and asymptotic normality are also presented in these two sections. Section 5 is devoted to testing the validity of the proposed semiparametric model with use of the MELE and MHDE. The finite-sample performance of the two proposed estimators and the goodness-of-fit tests are assessed in Sect. 6 via thorough Monte Carlo simulation studies, while the implementation of the proposed methods are demonstrated in Sect. 7 through two real data examples. Finally the concluding remarks are presented in Sect. 8. Some conditions for the theoretical results are deferred to Appendix, while the derivations and proofs of the theorems and lemmas are given in a separate supplementary document to save space.

2 A semiparametric modelling

Abedin (2018) and Wu and Abedin (2021) proposed and studied two consistent estimators for model (1). However, due to the non-identifiability of the model in general, it is difficult to obtain an estimator with good asymptotic properties such as asymptotic normality. Therefore, we introduce in what follows a semiparametric model which will be proved identifiable and will ensure the stochastic dominance.

Let Z denote a binary response variable and Y the associated covariate. Then the logistic regression model is given by

$$\begin{aligned} P(Z=1|Y=y) \ = \ \frac{\exp \left[\alpha ^{*}+\beta ^\top r(y)\right]}{1+\exp \left[\alpha ^{*}+\beta ^\top r(y)\right]}, \end{aligned}$$

where \(r(y)=(r_{1}(y),\ldots ,r_{p}(y))^\top \) is a given \(p\times 1\) vector of functions of y, \(\alpha ^{*}\) is the intercept parameter and \(\beta =(\beta _{1},\ldots ,\beta _{p})^\top \) is the \(p \times 1\) coefficient parameter vector. In case-control studies data are collected retrospectively. For example, a random sample of subjects with disease \(Z=1\) (‘case’) and a separate random sample of subjects without disease \(Z=0\) (‘control’) are selected with Y observed for each subject. Let \(\pi =P(Z=1)=1-P(Z=0)\). Let f(y) and g(y) denote the conditional p.d.f.s of Y given \(Z=0\) and \(Z=1\) respectively, then it follows from the Bayes’ rule that

$$\begin{aligned} g(y) \ =\ \exp \left[\alpha +\beta ^\top r(y)\right] f(y), \end{aligned}$$
(3)

where \(\alpha =\alpha ^{*}+\log [(1-\pi )/\pi ]\). Now model (1) is reduced to the semiparametric mixture model

$$\begin{aligned} \begin{array}{ccl} X_{1},\dots ,X_{m} &{} \overset{\mathrm{i.i.d.}}{\sim } &{} f(x),\\ Y_{1},\dots ,Y_{n} &{} \overset{\mathrm{i.i.d.}}{\sim } &{} h_\theta (x) \ := \ h(x) \ = \ \displaystyle \left\{ (1-\lambda )+\lambda \exp \left[ \alpha +\beta ^\top r(x)\right] \right\} f(x), \end{array} \end{aligned}$$
(4)

with \(f\in {\mathcal {F}}\) and the parameter vector of interest \(\theta =\left(\lambda ,\alpha ,\beta ^\top \right)^\top \in \varTheta \subseteq {\mathbb {R}}^{p+2}\) such that \(\int e^{\alpha +\beta ^\top r(x)} f(x){\text {d}}x=1\), where

$$\begin{aligned} \begin{array}{c} \displaystyle \varTheta =\left\{(\lambda ,\alpha ,\beta ^\top )^\top : \lambda \in (0,1), \beta \ne 0 \right\}, \\ \displaystyle {\mathcal {F}}:=\left\{f: f\ge 0, \int f(x){\text {d}}x=1, \int \exp \left[\beta ^\top r(x)\right]f(x){\text {d}}x<\infty , f \mathrm{\ is \ continuous}\right\}. \end{array} \end{aligned}$$
(5)

The relationship (3) between two p.d.f.s was first proposed by Anderson (1972). It essentially assumes that the log-likelihood ratio of the two p.d.f.s is linear in the observations. With \(r(x)=x\) or \(r(x)=(x,x^{2})^\top \), it has wide applications in logistic discriminant analysis (Anderson, 1972, 1979) and case-control studies (Prentice and Pyke, 1979; Breslow and Day, 1980). For \(r(x)=x\), (3) encompasses many common distributions, including two exponential distributions with different means and two normal distributions with common variance but different means. Model (3) with \(r(x)=(x,x^2)^\top \) also coincides with the exponential family of densities considered in Efron and Tibshirani (1996). Moreover, model (3) can be viewed as a biased sampling model with the ‘tilt’ weight function \(\exp \left[\alpha +\beta ^\top r(x)\right]\) depending on the unknown parameters \(\alpha \) and \(\beta \). Note that the test of equality of f and g can be regarded as a special case of model (4) with \(\beta =0\).

Qin and Zhang (1997) discussed a goodness-of-fit test for logistic regression based on case-control data where the first sample comes from the control group f and independently the second sample comes from the case group g. They proposed a Kolmogorov–Smirnov type statistic to test the validity of (3) with \(r(y)=y\). When data from both the mixture and the two individual components satisfying (3) are available, Qin (1999) developed an empirical likelihood ratio-based statistic for constructing confidence intervals of the mixing proportion. For the same model and data structure, Zhang (2002) proposed an EM algorithm to calculate the MELE while (Zhang, 2006) proposed a score statistic to test the mixing proportion. Chen and Wu (2013) employed (3) to model differentially expressed genes of acute lymphoblastic leukemia patients and acute myeloid leukemia patients.

For model (4) with \(r(y)=y\), if \(\beta >0\) then we can easily check that p(y) in (2), the probability of y being from g, is a monotonic increasing function. Further we can prove in the next theorem that if \(\beta >0\), then the stochastic dominance constraint \(F\ge G\) is implied by (3). We call \((1,r(y))^\top \) linearly independent on the support, say \(\varOmega \), of f if \((1,r(y))^\top \), as a vector of functions of y, is linearly independent over \(\varOmega \).

Theorem 1

If \((1,r(y))^\top \) is linearly independent on the support of f, then model (4) with parameter space (5) is identifiable. Specially, (4) is identifiable when \(r(y)=y\); if further \(\beta >0\), then \(F\ge G\).

Even though Theorem 1 tells us that the condition (3) is stronger than the original stochastic dominance constraint, the resulting semiparametric mixture model (4) is identifiable and has better interpretation than the nonparametric mixture model (1). In addition, the estimation of (4) may possess better asymptotic properties than those of (1). One may also consider higher order polynomial for r(y) and find sufficient conditions for \(F\ge G\). For model simplicity, we focus on model (4) with \(r(x)=x\) and \(\beta >0\) (to ensure the dominance \(F\ge G\)) throughout this paper. We first propose two estimators for this model.

3 MELE of the parameters

Let \((T_1,\ldots ,T_{m+n})^\top = ( X_1,\ldots ,X_m,Y_1,\ldots ,Y_n)^\top \) be the pooled data. In MELE, F is assumed a step function with \(p_i=dF(T_i)\) (jump at \(T_i\)) such that \(\sum _{i=1}^{m+n}p_i=1\). As a result, \(dH(Y_j)=p_{m+j}\left[ (1-\lambda )+\lambda e^{\alpha +\beta Y_j}\right] \). Consequently, the empirical likelihood function of (4) with \(r(y)=y\) and \(\beta >0\) is

$$\begin{aligned} L(\lambda ,\alpha , \beta ) \ = \ \prod _{i=1}^m dF(X_i)\prod _{j=1}^n dH(Y_j) \ = \ \prod _{i=1}^{m+n}p_i \prod _{j=1}^n \left[ (1-\lambda )+\lambda e^{\alpha +\beta Y_j}\right] , \end{aligned}$$

subject to constraints \(\beta > 0\), \(0<\lambda <1\), \(p_i\ge 0\), \(\sum \nolimits _{i=1}^{m+n}p_i=1\), and \(\sum \nolimits _{i=1}^{m+n}p_{i} e^{\alpha +\beta T_i}=1\). The constraint \(\beta >0\) won’t limit the use of model (4) and proposed estimation methods, as we can always switch F with G if \(\beta <0\) in model (4). Though in this work we assume both F and G are two continuous populations, we can see that the above likelihood is also valid for discrete populations. To find the MELE, we use the Lagrange multipliers and maximize

$$\begin{aligned} \sum _{i=1}^{m+n}\log p_i + \sum _{j=1}^{n} \log \left[ (1-\lambda )+\lambda e^{\alpha +\beta Y_j}\right] -t_1 \left[ \sum _{i=1}^ {m+n} p_i-1 \right] -t_2\left[ \sum _{i=1}^{m+n}p_{i} e^{\alpha +\beta T_i}-1 \right] . \end{aligned}$$

By taking partial derivatives we obtain (details in the supplementary document)

$$\begin{aligned} p_{i} \ = \ \frac{1}{(m+n)\left[ 1+\rho _N \lambda \left( e^{\alpha +\beta T_{i}}-1\right) \right] }, \end{aligned}$$
(6)

where \(\rho _N=n/(m+n)\) with \(N=m+n\). Therefore ignoring a constant, the empirical log-likelihood function is

$$\begin{aligned} l(\lambda ,\alpha ,\beta ) \ \propto \ \sum _{j=1}^{n} \log \left[(1-\lambda )+\lambda e^{\alpha +\beta Y_{j}}\right]-\sum _{i=1}^{m+n} \log \left[1+\rho _N \lambda (e^{\alpha +\beta T_{i}} -1)\right]. \end{aligned}$$
(7)

Let \({\hat{\theta }}_{\text {MELE}}=({\hat{\lambda }}_{\text {MELE}},{\hat{\alpha }}_{\text {MELE}},\hat{\beta }_{\text {MELE}})^\top \) denote the MELE of \(\theta \), i.e., the maximizer of the log-likelihood function l in (7). In our numerical studies, we used Optim in the R-package to find \(\hat{\theta }_{\text {MELE}}\). Then the MELE of p(y) in (2) is

$$\begin{aligned} {\hat{p}}_{\text {MELE}} (y)= \frac{{\hat{\lambda }}_{ \text {MELE}} \exp [{\hat{\alpha }}_{ \text {MELE}}+{\hat{\beta }}_{ \text {MELE}} y]}{(1-{\hat{\lambda }}_{ \text {MELE}})+{\hat{\lambda }}_{ \text {MELE}} \exp [{\hat{\alpha }}_{ \text {MELE}}+{\hat{\beta }}_{ \text {MELE}} y]}. \end{aligned}$$
(8)

We now examine the asymptotic properties of the MELE \({\hat{\theta }}_{\text {MELE}}\). Define

$$\begin{aligned} S_{N} \ = \ -\frac{1}{N} \begin{pmatrix} \displaystyle \frac{\partial ^{2}l}{\partial \lambda ^2} &{}&{} \displaystyle \frac{\partial ^{2}l}{\partial \lambda \partial \alpha } &{}&{}\displaystyle \frac{\partial ^{2}l}{\partial \lambda \partial \beta } \\ \\ \displaystyle \frac{\partial ^{2}l}{\partial \alpha \partial \lambda } &{}&{} \displaystyle \frac{\partial ^{2}l}{\partial \alpha ^{2}} &{}&{} \displaystyle \frac{\partial ^{2}l}{\partial \alpha \partial \beta } \\ \\ \displaystyle \frac{\partial ^{2}l}{\partial \beta \partial \lambda } &{}&{} \displaystyle \frac{\partial ^{2}l}{\partial \alpha \partial \beta } &{}&{} \displaystyle \frac{\partial ^{2}l}{\partial \beta ^{2}} \end{pmatrix} \end{aligned}$$
(9)

with l given in (7), and let

$$\begin{aligned} S \ = \ \int \left( \frac{\partial w_1(y)}{\partial \theta } \right) \left( \frac{\partial w_1(y)}{\partial \theta } \right) ^\top \frac{f}{w_1w_2}(y){\text {d}}y \ = \ \begin{pmatrix} S_{11} &{}&{} S_{12} &{}&{} S_{13}\\ S_{12}&{}&{} S_{22} &{}&{} S_{23} \\ S_{13} &{}&{} S_{23} &{}&{} S_{33}\\ \end{pmatrix}, \end{aligned}$$
(10)

where

$$\begin{aligned} \begin{array}{l} \displaystyle S_{11} \ = \ \int (e^{\alpha +\beta y}-1)^2 \cdot \frac{f}{w_1w_2}(y) \; {\text {d}}y, \\ \displaystyle S_{22} \ = \ \lambda ^2 \int e^{2\alpha +2\beta y} \cdot \frac{f}{w_1w_2}(y) \; {\text {d}}y, \\ \displaystyle S_{33} \ = \ \lambda ^2 \int y^2 e^{2\alpha +2\beta y} \cdot \frac{f}{w_1w_2}(y) \; {\text {d}}y, \\ \displaystyle S_{12} \ = \ \lambda \int e^{\alpha +\beta y}(e^{\alpha +\beta y}-1 ) \cdot \frac{f}{w_1w_2}(y)\; {\text {d}}y, \\ \displaystyle S_{13} \ = \ \lambda \int y e^{\alpha +\beta y}(e^{\alpha +\beta y}-1) \cdot \frac{f}{w_1w_2}(y)\; {\text {d}}y, \\ \displaystyle S_{23} \ = \ \displaystyle \lambda ^2 \int ye^{2\alpha +2\beta y} \cdot \frac{f}{w_1w_2}(y)\; {\text {d}}y \\ \end{array} \end{aligned}$$

with

$$\begin{aligned} w_1(y) \ = \ 1-\lambda +\lambda e^{\alpha +\beta y} \ \ \mathrm{and \ } \ \ w_2(y) \ = \ 1-\rho \lambda +\rho \lambda e^{\alpha +\beta y}. \end{aligned}$$
(11)

Since \(f\in {\mathcal {F}}\) and \(r(x)=x\), implying \(\int e^{\beta y}f(y){\text {d}}y<\infty \), it is easy to show that all the \(S_{ij}\)’s defined above are finite.

Lemma 1

Assume \(\rho _{N}\rightarrow \rho \) as \(N\rightarrow \infty \), with \(0<\rho <1\). Then \(S_{N} \overset{P}{\longrightarrow }\rho (1-\rho ) S\) as \(N\rightarrow \infty \), where \(S_N\) and S are defined in (9) and (10) respectively and \( \overset{P}{\rightarrow }\) denotes convergence in probability.

The condition \(\rho _{N}\rightarrow \rho \) with \(0<\rho <1\), as \(N\rightarrow \infty \), in Lemma 1 requires that the sample sizes m and n converge to infinity at the same order. With the results in Lemma 1, the following theorem gives the asymptotic normality of the MELE \({\hat{\theta }}_{\text {MELE}}\). Let

$$\begin{aligned} \begin{array}{lll} V &{} = &{} \displaystyle \int \left( \frac{\partial w_1(y)}{\partial \theta } \right) \left( \frac{\partial w_1(y)}{\partial \theta } \right) ^ \top \frac{f}{w_1w_2}(y){\text {d}}y - \int \frac{\partial w_1(y)}{\partial \theta } \frac{f}{w_2}(y){\text {d}}y \int \left( \frac{\partial w_1(y)}{\partial \theta }\right) ^\top \frac{f}{w_2}(y){\text {d}}y \\ &{} = &{} \displaystyle S- \int \frac{\partial w_1(y)}{\partial \theta } \frac{f}{w_2}(y){\text {d}}y \int \left( \frac{\partial w_1(y)}{\partial \theta }\right) ^\top \frac{f}{w_2}(y){\text {d}}y \\ &{} = &{} \begin{pmatrix} V_{11} &{} V_{12} &{} V_{13}\\ V_{12} &{} V_{22} &{} V_{23}\\ V_{13} &{} V_{23} &{} V_{33} \end{pmatrix}, \end{array} \end{aligned}$$
(12)

where

$$\begin{aligned} \begin{array}{l} V_{11} \ = \ \int (e^{\alpha +\beta y}-1)^2\frac{f}{w_1w_2}(y)\;{\text {d}}y-\left[ \int (e^{\alpha +\beta y}-1)\frac{f}{w_2}(y)\; {\text {d}}y\right] ^2, \\ V_{22} \ = \ \lambda ^2 \int e^{2\alpha +2\beta y}\frac{f}{w_1 w_2}(y)\; {\text {d}}y -\lambda ^2 \left[ \int e^{\alpha +\beta y}\frac{f}{w_2}(y)\; {\text {d}}y\right] ^2, \\ V_{33} \ = \ \lambda ^2 \int y^2 e^{2\alpha +2\beta y}\frac{f}{w_1 w_2}(y)\; {\text {d}}y -\lambda ^2\left[ \int ye^{\alpha +\beta y}\frac{f}{w_2}(y)\; {\text {d}}y\right] ^2, \\ V_{12} \ = \ \lambda \int e^{\alpha +\beta y} (e^{\alpha +\beta y}-1)\frac{f}{w_1w_2}(y)\;{\text {d}}y - \ \lambda \int (e^{\alpha +\beta y}-1)\frac{f}{w_2}(y)\; {\text {d}}y \int e^{\alpha +\beta y}\frac{f}{w_2}(y) \; {\text {d}}y, \\ V_{13} \ = \ \lambda \int ye^{\alpha +\beta y} (e^{\alpha +\beta y}-1)\frac{f}{w_1w_2}(y)\;{\text {d}}y - \ \lambda \int (e^{\alpha +\beta y}-1)\frac{f}{w_2}(y)\; {\text {d}}y \int y e^{\alpha +\beta y}\frac{f}{w_2}(y) \; {\text {d}}y , \\ V_{23} \ = \ \lambda ^2 \int ye^{2\alpha +2\beta y} \frac{f}{w_1w_2}(y)\;{\text {d}}y - \ \lambda ^2 \int e^{\alpha +\beta y}\frac{f}{w_2}(y)\; {\text {d}}y \int y e^{\alpha +\beta y}\frac{f}{w_2}(y) \; {\text {d}}y, \end{array} \end{aligned}$$

with \(w_1\) and \(w_2\) defined in (11).

Theorem 2

Assume \(\rho _{N}\rightarrow \rho \) as \(N\rightarrow \infty \) with \(0<\rho <1\) and S defined in (10) is invertible. Then under model (4) with parameter space (5) and some regularity conditions (for MELE in general; see Qin and Lawless, 1994),

$$\begin{aligned} \sqrt{N}\begin{pmatrix} \hat{\lambda }_{\text {MELE}}-\lambda \\ \hat{\alpha }_{\text {MELE}}-\alpha \\ \hat{\beta }_{\text {MELE}}-\beta \end{pmatrix} \ {\mathop {\longrightarrow }\limits ^{D}} \ N\left( 0,\varSigma \right) , \end{aligned}$$

where \(\varSigma =\frac{1}{\rho (1-\rho )}S^{-1}VS^{-1}\) with S and V defined in (10) and (12) respectively, and \( \overset{D}{\rightarrow }\) denotes convergence in distribution. In addition, V and further \(\varSigma \) are positive definite.

In Theorem 2, the assumption of S being invertible is generally true except for some special cases such as \(\lambda =0\), or \(\alpha =\beta =0\), or f is a singleton. When \(0<\lambda <1\), \(\beta \ne 0\) and f is continuous (or an even weaker condition that the support of f contains at least two points), we would expect S being invertible. With the results in Theorem 2, one can easily make MELE-based inferences, such as Wald test, about the parameter when the unknown \(\theta \) in \(\varSigma \) is replaced with some consistent estimators such as the MELE.

4 MHDE of the parameters

Though MELE is efficient, it is generally non-robust against outliers and model misspecifications. As a robust alternative, we propose in this section an MHDE for the unknown parameters in model (4).

The Hellinger distance between two functions \(f_1\) and \(f_2\) is defined as \(\Vert f_1^{1/2}-f_2^{1/2}\Vert \), the \(L^2\)-norm of root functions. For a fully parametric model \(\{h_\theta :\theta \in \varTheta \}\) with \(\varTheta \) the parameter space, the MHDE of \(\theta \) is defined as

$$\begin{aligned} {\hat{\theta }}_{\text {MHDE}} \ = \ \underset{t\in \varTheta }{\arg \min } \left\| h_t^{1/2}-{\hat{h}}^{1/2}\right\| , \end{aligned}$$
(13)

where \({\hat{h}}\) is an appropriate nonparametric estimator of \(h_\theta \). MHDE was first introduced by Beran (1977) for fully parametric models. Beran (1977) showed that the MHDE for parametric model has both full efficiency and good robustness properties. Lindsay (1994) outlined the comparison between MHDE and MLE in terms of robustness and efficiency and showed that MHDE and MLE are members of a larger class of efficient estimators with various second-order efficiency properties. However, the literature on MHDE for mixture models is not redundant. Lu et al. (2003) considered the MHDE for mixture of Poisson regression models. MHDE of mixture complexity for finite mixture models was investigated by Woo and Sriram (2006, 2007). Recently, MHDE has been extended from parametric models to semiparametric models. Wu et al. (2010) proposed an MHDE for two-sample case-control data under model (3) and investigated the asymptotic properties and robustness of the proposed estimator. Xiang et al. (2014) proposed a minimum profile Hellinger distance estimator (MPHDE) for the two-component semiparametric mixture model studied by Bordes et al. (2006) where one component is known and the other is an unknown symmetric function with unknown location parameter. Wu et al. (2017) and Wu and Zhou (2018) proposed algorithms for calculating the MPHDE and the MHDE, respectively, for the two-component semiparametric location-shifted mixture model. Inspired by these works, we propose to use the MHDE to estimate the parameters in (4).

In model (4), even though \(\alpha \) and \(\beta \) can possibly take any value on the real line, we can essentially use intervals that are large enough to cover their true values. So for practical purpose, we can assume that \(\theta \in \varTheta \) with \(\varTheta \) a compact subset of \(\mathbb {R}^{3}\). To give the MHDE for model (4), note that the MHDE defined in (13) is not available in practice since the f in \(h_t (x)=(1-t_1+t_1 e^{t_2+t_3 x})f(x)\), with \(t=(t_1,t_2,t_3)^\top \), is unknown. Intuitively, we can use an appropriate nonparametric density estimator, say \(f_{m}\), based on \(X_1,\dots ,X_m\) to replace f and apply the plug-in rule to estimate \(h_t\) by

$$\begin{aligned} {\hat{h}}_t(x) \ = \ (1-t_1+t_1 e^{t_2+t_3 x}) f_{m}(x). \end{aligned}$$
(14)

With another appropriate nonparametric density estimator, say \(h_n\), of \(h_\theta \) based on \(Y_1,\dots ,Y_n\), we define the MHDE of \(\theta =(\lambda ,\alpha ,\beta )^\top \) as

$$\begin{aligned} \hat{\theta }_{\text {MHDE}} \ = \ T(f_m,h_n) \ = \ \underset{t\in \varTheta }{\arg \min } \left\| {\hat{h}}_t^{1/2}-h_n^{1/2}\right\| . \end{aligned}$$
(15)

That is, \(\hat{\theta }_{\text {MHDE}}\) is the minimizer t of the Hellinger distance between the estimated parametric model \(\hat{h}_t\) and the nonparametric density estimator \(h_{n}\). In this work, we use kernel density estimators

$$\begin{aligned}andf_{m}(x) \ = \ \frac{1}{mb_{m}}\sum _{i=1}^{m}K_{0}\left( \frac{x-X_{i}}{b_{m}}\right) , \end{aligned}$$
(16)
$$\begin{aligned}&h_{n}(x) \ = \ \frac{1}{nb_{n}}\sum _{j=1}^{n}K_{1}\left( \frac{x-Y_{j}}{b_{n}}\right) , \end{aligned}$$
(17)

where \(K_{0}\) and \(K_{1}\) are kernel p.d.f.s and bandwidths \(b_{n}\) and \(b_{m}\) are positive sequences such that \(b_{m}\rightarrow 0\) as \(m \rightarrow \infty \) and \(b_{n} \rightarrow 0\) as \(n \rightarrow \infty \). The MHDE of p(y) in (2) is given by

$$\begin{aligned} {\hat{p}}_{\text {MHDE}} (y)= \frac{{\hat{\lambda }}_{\text {MHDE}} \exp [{\hat{\alpha }}_{\text {MHDE}}+{\hat{\beta }}_{\text {MHDE}} y]}{(1-{\hat{\lambda }}_{\text {MHDE}})+{\hat{\lambda }}_{\text {MHDE}} \exp [{\hat{\alpha }}_{\text {MHDE}}+{\hat{\beta }}_{\text {MHDE}} y]}. \end{aligned}$$
(18)

Note that in (15) we do not impose any restriction on \(\hat{h}_t\) to make it a density function. The reason behind this is that a \(t\in \varTheta \) that makes \(\hat{h}_t\) not a density can still make \(h_t\) a density. The true parameter value \(\theta \) may not make \(\hat{h}_{\theta }\) a density, but it is not reasonable to exclude \(\theta \) as the estimate \({\hat{\theta }}_{\text {MHDE}}\) of itself. As the explicit expression of \({\hat{\theta }}_{\text {MHDE}}\) does not exist, one needs to use iterative methods such as Newton-Raphson to numerically calculate it. Karunamuni and Wu (2011) has shown for parametric models that with an appropriate initial value, even one-step iteration works well and gives a quite accurate approximation of MHDE.

We now examine the asymptotic properties of the MHDE \(\hat{\theta }_{\text {MHDE}}\) given in (15). Let \({\mathcal {H}}\) be the set of all p.d.f.s with respect to Lebesgue measure on the real line. The conditions (D1)–(D3) and (C0)–(C11) are listed in Appendix. The following theorem gives the existence, Hellinger-metric consistency, and uniqueness of the proposed MHDE.

Theorem 3

Suppose \(\varTheta \) is a compact parameter space and (D1) holds for all \(t\in \varTheta \). Then

  1. (i)

    For every \(\varphi \in {\mathcal {H}}\) and \(f_m\) defined in (16) with \(K_0\) compactly supported, there exist \(T(f,\varphi )\) and \(T(f_m, \varphi )\) in \(\varTheta \) satisfying (15).

  2. (ii)

    Suppose that \(m,n\rightarrow \infty \) as \(N\rightarrow \infty \) and \(\theta =T(f, \varphi )\) is unique for \(\varphi \in {\mathcal {H}}\). Then \(\theta _N = T(f_m,\varphi _n)\rightarrow \theta \) as \(N\rightarrow \infty \) for any density sequences \(f_m\) and \(\varphi _n\) such that \(\Vert \varphi _n^{1/2}-\varphi ^{1/2}\Vert \rightarrow 0\) and \(\sup _{t\in \varTheta }\Vert {\hat{h}}_t^{1/2}-h_t^{1/2}\Vert \rightarrow 0\) as \(N\rightarrow \infty \) with \({\hat{h}}_t\) given in (14).

  3. (iii)

    \(T(f,h_\theta )=\theta \) uniquely for any \(\theta \in \varTheta \).

With the results given in Theorem 3, the consistency of our proposed MHDE can be proved and is presented in the following theorem.

Theorem 4

Let \(m,n\rightarrow \infty \) as \(N\rightarrow \infty \). Suppose that (D1) holds for any \(\theta \in \varTheta \) and the bandwidths \(b_{m}\) and \(b_{n}\) in (16) and (17) respectively satisfy \(b_{m},b_{n}\rightarrow 0\) and \(m b_{m},n b_{n}\rightarrow \infty \) as \(N\rightarrow \infty \). Further suppose that either (D2) or (D3) holds. Then \(\Vert f_m^{1/2}-f^{1/2}\Vert {\mathop {\rightarrow }\limits ^{P}} 0\), \(\Vert h_{n}^{1/2}-h_{\theta }^{1/2}\Vert {\mathop {\rightarrow }\limits ^{P}} 0\) and \(\sup _{t\in \varTheta }\Vert \hat{h}_t^{1/2}-h_t^{1/2}\Vert {\mathop {\rightarrow }\limits ^{P}}0\) as \(N\rightarrow \infty \). Furthermore, \(\hat{\theta }_{\text {MHDE}}{\mathop {\rightarrow }\limits ^{P}}\theta \) as \(N\rightarrow \infty \), where \(\hat{\theta }_{\text {MHDE}}\) is defined in (15) with \(f_{m}\), \(h_{n}\) and \(\hat{h}_t\) given in (16), (17) and (14), respectively.

We derive in the next theorem an expression of the bias term \(\hat{\theta }_{\text {MHDE}}-\theta \). Define symmetric matrix

$$\begin{aligned} \varDelta (\theta )\ = \ \int \left( \frac{\partial w_1(x)}{\partial \theta }\right) \left( \frac{\partial w_1(x)}{\partial \theta }\right) ^\top \frac{f}{w_1}(x){\text {d}}x \ = \ \begin{pmatrix} \varDelta _{11}(\theta ) &{} \varDelta _{12}(\theta ) &{} \varDelta _{13}(\theta ) \\ \varDelta _{12}(\theta ) &{} \varDelta _{22}(\theta ) &{} \varDelta _{23}(\theta ) \\ \varDelta _{13}(\theta ) &{} \varDelta _{23}(\theta ) &{} \varDelta _{33}(\theta ) \end{pmatrix}, \end{aligned}$$
(19)

where

$$\begin{aligned} \begin{array}{l} \varDelta _{11} (\theta ) \ = \ \int \left( \frac{\partial w_1(x)}{\partial \lambda }\right) ^2 \frac{f}{w_1}(x){\text {d}}x \ = \ \int (e^{\alpha +\beta x}-1)^{2}\frac{f}{w_1}(x){\text {d}}x, \\ \varDelta _{22} (\theta ) \ = \ \int \left( \frac{\partial w_1(x)}{\partial \alpha }\right) ^2 \frac{f}{w_1}(x){\text {d}}x \ = \ \lambda ^2 \int e^{2\alpha +2\beta x}\frac{f}{w_1}(x){\text {d}}x, \\ \varDelta _{33} (\theta ) \ = \ \int \left( \frac{\partial w_1(x)}{\partial \beta }\right) ^2 \frac{f}{w_1}(x){\text {d}}x \ = \ \lambda ^2 \int x^2 e^{2\alpha +2\beta x}\frac{f}{w_1}(x){\text {d}}x, \\ \varDelta _{12} (\theta ) \ = \ \int \frac{\partial w_1(x)}{\partial \lambda } \frac{\partial w_1(x)}{\partial \alpha } \frac{f}{w_1}(x){\text {d}}x \ = \ \lambda \int e^{\alpha +\beta x} (e^{\alpha +\beta x}-1)\frac{f}{w_1}(x){\text {d}}x, \\ \varDelta _{13} (\theta ) \ = \ \int \frac{\partial w_1(x)}{\partial \lambda } \frac{\partial w_1(x)}{\partial \beta } \frac{f}{w_1}(x){\text {d}}x \ = \ \lambda \int x e^{\alpha +\beta x} (e^{\alpha +\beta x}-1)\frac{f}{w_1}(x){\text {d}}x, \\ \varDelta _{23} (\theta ) \ = \ \int \frac{\partial w_1(x)}{\partial \alpha } \frac{\partial w_1(x)}{\partial \beta } \frac{f}{w_1}(x){\text {d}}x \ = \ \lambda ^2 \int x e^{2\alpha +2\beta x}\frac{f}{w_1}(x){\text {d}}x. \end{array} \end{aligned}$$

Let

$$\begin{aligned} A_N(\theta ) \ = \ \int \frac{\partial w_1}{\partial \theta }(x) \left[ \frac{f_{m}^{1/2}h_{n}^{1/2}}{ w_1^{1/2}}(x) -f_{m}(x)\right] \; {\text {d}}x \ = \ \begin{pmatrix} A_{N1}(\theta )\\ A_{N2}(\theta )\\ A_{N3}(\theta ) \end{pmatrix}, \end{aligned}$$
(20)

where

$$\begin{aligned} \begin{array}{l} A_{N1}(\theta ) \ = \ \int (e^{\alpha +\beta x}-1) \left[ \frac{f_{m}^{1/2}h_{n}^{1/2}}{ w_1^{1/2}}(x) -f_{m}(x)\right] \;{\text {d}}x, \\ A_{N2}(\theta ) \ = \ \int \lambda e^{\alpha +{\beta } x} \left[ \frac{f_{m}^{1/2}h_{n}^{1/2}}{ w_1^{1/2}}(x) -f_{m}(x)\right] \;{\text {d}}x, \\ A_{N3}(\theta ) \ = \ \int \lambda x e^{\alpha +{\beta } x} \left[ \frac{f_{m}^{1/2}h_{n}^{1/2}}{ w_1^{1/2}}(x)-f_{m}(x)\right] \;{\text {d}}x. \end{array} \end{aligned}$$

Theorem 5

Suppose \(\theta \in int(\varTheta )\), \(\varDelta (\theta )\) defined in (19) is invertible, (C2) and assumptions in Theorem 4 hold. Then it follows that

$$\begin{aligned} \hat{\theta }_{\text {MHDE}}-\theta \ = \ 2\left[ \varDelta ^{-1}(\theta )+R_{N}\right] A_N(\theta ), \end{aligned}$$
(21)

where \(\hat{\theta }_{\text {MHDE}}\) is defined by (15) and \(R_{N}\) is a \(3\times 3\) matrix with elements tending to zero in probability as \(N\rightarrow \infty \).

From this result, we obtain immediately the asymptotic distribution of the MHDE.

Theorem 6

Suppose that \({\hat{\theta }}_{\text {MHDE}}\) defined in (15) satisfies (21). Further suppose that conditions \((C0)--(C9)\) hold. Then the asymptotic distribution of \(\sqrt{N} (\hat{\theta }_{\text {MHDE}}-\theta )\) is \(N(0,\varSigma )\), where \(\varSigma \) is defined by

$$\begin{aligned} \begin{array}{lll} \varSigma &{} = &{} \displaystyle \varDelta ^{-1}(\theta )\left[ \frac{1}{1-\rho }{\bar{\varDelta }}(\theta )+\frac{1}{\rho }\varDelta (\theta )\right] \varDelta ^{-1}(\theta ) \\ &{} = &{} \displaystyle \frac{1}{\rho (1-\rho )} \varDelta ^{-1}(\theta ) \left[ \varDelta (\theta )+\rho \left( {\bar{\varDelta }}(\theta )-\varDelta (\theta ) \right) \right] \varDelta ^{-1}(\theta ) \end{array} \end{aligned}$$

with

$$\begin{aligned} \begin{array}{c} {\bar{\varDelta }}(\theta ) \ = \ \int \frac{\partial w_1}{\partial \theta }(x)\left[ \frac{\partial w_1}{\partial \theta }(x)\right] ^\top f(x) {\text {d}}x, \\ \varDelta (\theta ) \ = \ \int \frac{\partial w_1}{\partial \theta }(x)\left[ \frac{\partial w_1}{\partial \theta }(x)\right] ^\top \frac{f}{w_1}(x) {\text {d}}x. \end{array} \end{aligned}$$

In Theorems 5 and 6, the assumption of \(\varDelta (\theta )\) being invertible is generally true except for some special cases such as \(\lambda =0\), or \(\alpha =\beta =0\), or f is a singleton.

Since MELE is a likelihood-based method, we expect MELE asymptotically more efficient than MHDE. Looking at the asymptotic covariance matrices given in Theorems 2 and 6 for MELE and MHDE respectively, we note that roughly speaking \(|S_{ij}|\) is smaller than \(|\varDelta _{ij}|\) while \(|{\bar{\varDelta }}_{ij}|\) is much bigger than both, due to the fact that the denominator \(w_1\) or \(w_2\) can control the exponential terms in the integrands. As a result, the asymptotic variances of the MELEs of \(\lambda \), \(\alpha \) and \(\beta \) are smaller than those of the corresponding MHDEs.

To see the asymptotic relative efficiency of the MHDE with respect to the MELE, we take the normal mixture \((1-\lambda )N(0,1)+\lambda N(\mu ,1)\), a model examined also in our later simulation studies, as an example and Table 1 below presents the asymptotic covariance matrices of the MELE and MHDE for \(\rho =0.5\) and various parameter settings \(\mu =.5, 1, 2\) (i.e., \((\alpha ,\beta )=(-.125, .5), (-.5,1), (-2,2)\) respectively) and \(\lambda =.05, .20, .50, .80, .95\). From Table 1 we see that, as expected, the elements in the covariance matrices of the MELE are almost always smaller than those of the MHDE, in absolute values. The difference in the asymptotic variances of the MELE and MHDE of \(\lambda \) or \(\beta \) is very small for smaller \(\mu \) (\(=.5\)) and smaller \(\lambda \) values. The relative efficiency of the MHDE with respect to the MELE deteriorates when \(\lambda \) increases or, especially when \(\mu \) increases. We also observe from Table 1 that the asymptotic variances of both the MELE and MHDE of \(\alpha \) and \(\beta \) decrease when \(\lambda \) increases, which is expected by the fact that the information of \(\alpha \) and \(\beta \) is contained in the second component only and the amount of such information increases when the mixing proportion \(\lambda \) of the second component increases. Comparatively, when \(\lambda \) increases, the asymptotic variances of both the MELE and MHDE of \(\lambda \) increase first, reach their maxima around \(\lambda =0.5\) and then decrease. This phenomenon can be explained by the fact that the estimation of \(\lambda \) is associated with a Bernoulli random variable with parameter \(\lambda \) and the variance of this random variable, \(\lambda (1-\lambda )\), as a function of \(\lambda \) has the same increasing-decreasing trend.

Table 1 Asymptotic covariance matrices of the MELE and MHDE under normal mixture \((1-\lambda )N(0,1)+\lambda N(\mu ,1)\) with \(\rho =0.5\)

5 Test of the semiparametric model

In this section we discuss the validity of the semiparametric mixture model (4) or equivalently the relationship (3) between f and g, with \(r(x)=x\). Several goodness-of-fit test statistics for testing the relationship (3) in case-control studies are available in literature; see, for example, Qin and Zhang (1997), Zhang (1999, 2001, 2006), and Deng et al. (2009). For our semiparametric mixture model (4), we will construct Kolmogorov–Smirnov (K–S) type statistics based on the proposed MELE and MHDE.

The idea of K–S test statistic is to use the discrepancy between two c.d.f. estimates, one with the model assumption and the other without, to assess the validity of a model. For model (4) with \(r(x)=x\), we can use the empirical c.d.f. based on the first sample \(X_i\)’s as the first estimate and the MELE or MHDE based on both samples \(X_i\)’s and \(Y_i\)’s exploiting (4) as the second. We first look at the special case of testing \(\beta =0\) in model (4). When \(\beta =0\), we also have \(\alpha =0\), which implies the equality of the two components F and G and further the equality of F and H. The two-sample K–S statistic for testing the equality of F and H is

$$\begin{aligned} \sup _{t}|\hat{F}(t)-\hat{H}(t)| \ = \ \frac{N}{n}\displaystyle \sup _{t}\left| \hat{F}(t)-\tilde{F_{0}}(t)\right| , \end{aligned}$$

where \(N=m+n\), \({\hat{F}}\), \({\hat{H}}\) and \({\tilde{F}}_0\) denote the empirical c.d.f.s based on \(X_i\)’s, \(Y_i\)’s and \(T_i\)’s respectively, and \((T_1,\dots , T_N)^\top =(X_1,\dots ,X_m,Y_1,\dots , Y_n)^\top \) is the pooled sample. Note that \(\hat{F}\) and \(\hat{H}\) are the nonparametric MLEs of F and H respectively without the assumption of \(F=H\), whereas \(\tilde{F_{0}}\) is the nonparametric MLE of F with the assumption of \(F=H\). Now consider the general case of testing (4) with any \(\beta \). Motivated by the construction of the K–S statistic for testing \(\beta =0\), to test the validity of model (4) with \(r(x)=x\), we propose to use the test statistic

$$\begin{aligned} KS \ = \ N^{1/2}\sup _{t}\left| \hat{F}(t)-\tilde{F}(t)\right| , \end{aligned}$$
(22)

where \(\tilde{F}\) is an estimator of F based on either the MELE or the MHDE of \(\theta =(\lambda , \alpha ,\beta )^\top \) with model assumption (4). Recall in Sect. 3, with \(p_i=dF(T_i)\) and \(\rho _N=n/N\), the MELE \({\hat{p}}_i\) of \(p_i\) is given by (6). Then an estimator \({\tilde{F}}\) of F under model (4) is given by

$$\begin{aligned} \tilde{F}(t) \ = \ \sum _{i=1}^{N}\hat{p}_{i}I(T_{i}\le t) \ = \ \frac{1}{N}\sum _{i=1}^{N}\frac{I(T_{i}\le t)}{1-\rho _N \hat{\lambda }+\rho _N\hat{\lambda } e^{\hat{\alpha }+\hat{\beta }T_{i}}}. \end{aligned}$$
(23)

If the \({\hat{\theta }}=({\hat{\lambda }}, {\hat{\alpha }},{\hat{\beta }})^\top \) in (6) and (23) is the MELE \({\hat{\theta }}_{MELE}\) that we constructed in Sect. 3, then the resulting \({\tilde{F}}_{MELE}\) is the actual MELE of F under (4) and we denote the corresponding test statistic in (22) as \({\text {KS}}_{MELE}\). This test statistic is similar to that in Qin and Zhang (1997), but they used it for case-control data instead of our more complicated mixture model (4). Intuitively, we can also use the MHDE \({\hat{\theta }}_{\text {MHDE}}\) given in (15) for \({\hat{\theta }}\), and then we denote the resulting \({\tilde{F}}\) in (23) and KS in (22) as \({\tilde{F}}_{\text {MHDE}}\) and \({\text {KS}}_{\text {MHDE}}\), respectively.

In our later numerical studies, we use bootstrap procedure to find the approximated distributions and critical values for \({\text {KS}}_{MELE}\) and \({\text {KS}}_{\text {MHDE}}\). To generate bootstrapping data, we randomly select independent samples \(X_i^*\)’s from \(d\tilde{F}(x)\) and \(Y_i^*\)’s from \((1-{\hat{\lambda }}+{\hat{\lambda }} e^{{\hat{\alpha }}+{\hat{\beta }} x})d\tilde{F}(x)\), where \({\hat{\theta }}\) and \({\tilde{F}}\) are either the MELEs \({\hat{\theta }}_{\text {MELE}}\) and \({\tilde{F}}_{\text {MELE}}\) or the MHDEs \({\hat{\theta }}_{\text {MHDE}}\) and \({\tilde{F}}_{\text {MHDE}}\), respectively. Note that both \(X_i^*\)’s and \(Y_i^*\)’s are selected from the pooled data \((T_1,\dots ,T_N)\) but with different probability distribution function. This means that some of the selected \(X_{i}^{*}\)’s could be values in the original second sample \(Y_i\)’s and some of the selected \(Y_{i}^{*}\)’s could be values in the original first sample \(X_i\)’s. Let \((T_{1}^{*},\ldots ,T_N^{*})\) denote the combined bootstrapping sample and \({\hat{\theta }}^*=({\hat{\lambda }}^{*},{\hat{\alpha }}^{*},{\hat{\beta }}^{*})^\top \) be either the MELE or the MHDE based on the bootstrapping samples \(X_i^*\)’s and \(Y_i^*\)’s. Then we can calculate the empirical function \({\hat{F}}\) based on \(X_i^*\)’s, the quantities in (6) and the function in (23) based on \(T_i^*\)’s and \({\hat{\theta }}^*\), with results denoted by \({\hat{F}}^*\), \({\hat{p}}^*_i\) and \({\tilde{F}}^*\), respectively. Then finally the bootstrapping KS test statistic is

$$\begin{aligned} KS^* \ = \ N^{1/2} \sup _t \left| \hat{F}^{*}(t)-\tilde{F}^{*}(t)\right| . \end{aligned}$$

We can generate 1000 bootstrapping samples to produce 1000 bootstrapping KS statistic values for both \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\) at the same time. Then the distributions, and further the critical values, of \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\) can be estimated by these respective 1000 statistic values.

6 Simulation studies

In this section we examine through Monte Carlo simulation studies the finite-sample performance of the proposed MELE and MHDE of the parameters in model (4) and the proposed K–S tests of this model. In our simulation study, we consider mixture of normals, Poissons or uniforms given in Table 2. We can easily check that all the five models satisfy the stochastic dominance condition, however only models M1-M4 satisfy relationship (4). Model M5 does not satisfy (4) since the two components have different support. Even though the focus of this paper is on continuous mixture models, we also want to check the performance of the proposed methods for discrete mixture models such as M3 and M4. For each model, the true values of \(\alpha \) and \(\beta \) are calculated and listed in Table 2.

Table 2 Mixture models considered in simulation study

6.1 Efficiency of the estimators

For each of the mixture models in Table 2, we consider different \(\lambda \) values varying from .05 to .95. Under each model, we take \(N=1000\) repeated random samplings and for each sampling we use two different sample sizes \((m,n)=(30,30)\) and (100, 100). In the kernel density estimators \(f_m\) and \(h_n\) given in (16) and (17) respectively, standard normal p.d.f. is used as both kernels \(K_0\) and \(K_1\), and the bandwidths \(b_m\) and \(b_n\) are chosen the same as in Silverman (1986) and Wu and Abedin (2021). The effect of different choice of kernel on kernel estimator is trivial, while the bandwidth has more influence on the finite-sample performance and the convergence rate of the nonparametric kernel estimator. Nevertheless, Theorems 46 indicate that the MHDE has the consistency and asymptotic normality if the bandwidths are of the form \(b_m,b_n=O(N^{-r})\) with r from a subinterval of (0, 1). This subinterval may depend on the population of the two components. With the component population unknown, we choose \(r=1/5\) in our simulation as in Silverman (1986) and Wu and Abedin (2021) which achieves the optimal mean square error of the kernel estimator. In addition, the bandwidths \(b_m\) and \(b_n\) of our choice are adaptive in the sense that they involve a factor of robust scale estimate to incorporate the different variation of f and h.

To compute the estimates numerically, we use \(\lambda _+\), an estimator based on odds ratio, given by Smith and Vounatsou (1997) as the initial of \(\lambda \). Initial values of \(\alpha \) and \(\beta \) are calculated by exploiting the relationship (3), or equivalently

$$\begin{aligned} \log \frac{h(x)/f(x)-(1-\lambda )}{\lambda } \ = \ \alpha +\beta x. \end{aligned}$$

Thus for each \(T_i\) in the pooled sample, we generate the pair \((T_i, R_i)\), where \(R_i=\log \frac{h_n(T_i)/f_m(T_i)-(1-\lambda _+)}{\lambda _+}\). Then we use \((T_i,R_i)\), \(i=1,\dots ,N\), to fit a least-squares regression line and the fitted coefficients are used as the initials of \(\alpha \) and \(\beta \). To evaluate the finite-sample performance of an estimator \({\hat{\theta }}\) of \(\theta \), we calculate the bias and mean squared error (MSE) over \(N=1000\) replications. The results are presented in Tables 3, 4 and 5.

Table 3 Bias (MSE) of \({\hat{\lambda }}_{\text {MELE}}\), \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) with \(m=n=30\)
Table 4 Bias (MSE) of \({\hat{\lambda }}_{\text {MELE}}\), \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) with \(m=n=100\)
Table 5 Bias (MSE) of \({\hat{\beta }}_{\text {MELE}}\) and \({\hat{\beta }}_{\text {MHDE}}\)

Table 3 gives the biases and MSEs of \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) for all the five models with varying \(\lambda \) values and sample size \(m=n=30\) whereas Table 4 is for \(m=n=100\). For the purpose of comparison, we also report the results of the two estimators \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) proposed in Wu and Abedin (2021) for model (1) directly, without assuming the relationship (3). The \({\hat{\lambda }}\) is based on cumulative distribution estimation while \({\hat{\lambda }}_L\) is the MLE of the multinomial approximation of model (1). From Tables 3 and 4 we can see that all the four methods are very competitive when \(\lambda \) is concerned. The nonparametric estimators \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) perform slightly better than the semiparametric estimators \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) when the two components are close to each other (such as M1 and M3) or the relationship (3) does not holds (M5), while \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) perform better than \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) when the two components are relatively far away (such as M2). Both the MELE and the MHDE perform surprisingly well for model M5, even though the assumed relationship (3) is not valid for M5.

The simulation results for estimating \(\beta \) are given in Table 5. Since (3) is not assumed in Wu and Abedin (2021), their methods are for estimating \(\lambda \) only and thus in Table 5 we compare \({\hat{\beta }}_{\text {MELE}}\) and \({\hat{\beta }}_{\text {MHDE}}\) only. In addition, we omit the results for \(\alpha \) and model M5 as \(\alpha \) depends on \(\beta \) through normalization and M5 doesn’t satisfy (3). From Table 5 we observe that the MHDE performs better than the MELE in terms of both smaller bias and MSE. However, the MHDE tends to give very large bias and MSE when \(\lambda \) is small, such as M2 with \(\lambda =.05\). We also observe from Table 5 that the MSEs of both MHDE and MELE of \(\beta \) decrease as the true \(\lambda \) value increases, which is very consistent with the observations from Table 1 on their asymptotic variances. This phenomenon is justified by the fact that when \(\lambda \) increases the expected number of observations from the second component g gets larger, and as a result the data contains more and more information about \(\beta \). This is also the reason why the MHDE and the MELE give large biases and inflated MSEs for small \(\lambda \), especially when sample size is small. For example, for model M2 with \(\lambda =0.05\) and \(m=n=30\) (or 100), we would expect only \(n\lambda =1.5\) (or 5) observations on average from g that contain information about \(\beta \), which thus produces estimates of \(\beta \) with large bias and MSE.

In order to examine the effect of unbalanced sample sizes on the performance of MELE and MHDE, in Table 6 we present the simulation results for sample sizes \((m,n)=(50, 150)\) and (150, 50). Comparing Table 6 with Table 4, we observe that when \(\lambda \) is concerned, the MELE and MHDE under unbalanced sample sizes \((m,n)=(50, 150)\) and (150, 50) perform slightly worse, though still comparable, than under balanced sample sizes \((m,n)=(100, 100)\), especially in terms of MSE. When \(\beta \) is concerned and Table 6 is compared with Table 5, we observe that the performance of the MELE and MHDE under unbalanced sample sizes is significantly worse than that under balanced sample sizes, especially in terms of MSE. These phenomenon can be possibly explained by the fact that both the asymptotic covariance matrices of MELE and MHDE given in Theorems 2 and 6 respectively have a factor \(\frac{1}{\rho (1-\rho )}\) in front. Other places in the covariance matrix expressions also involve \(\rho \), though not as influencing as this factor, which makes the level of deterioration different for \(\lambda \) and \(\beta \) when \(m/(m+n)\) deviates away from 1/2 (i.e., balanced samples).

Table 6 Bias (MSE) of \({\hat{\lambda }}_{\text {MELE}}\), \({\hat{\beta }}_{\text {MELE}}\), \({\hat{\lambda }}_{\text {MHDE}}\) and \({\hat{\beta }}_{\text {MHDE}}\) for unbalanced sample sizes

To examine the performance of \({\hat{p}}_{\text {MELE}}\) and \({\hat{p}}_{\text {MHDE}}\) given in (8) and (18) respectively, we calculate misclassification rates (MR) based on these estimates and a simple classification rule with use of .5 as the hard threshold, i.e., an individual with observation y is classified as from G if \({\hat{p}}(y)>.5\) and as from F if otherwise, where \({\hat{p}}\) is either \({\hat{p}}_{\text {MELE}}\) or \({\hat{p}}_{\text {MHDE}}\). Considering the fact that MR is higher for some models, such as M1 when the two components significantly overlap, than some other models such as M2, we use the optimal misclassification rate (OMR) in Wu and Abedin (2021) as the baseline to compare with. The OMR is the misclassification rate calculated assuming p(y) is completely known (ideal scenario) and the same value .5 is used as the classification threshold. The simulation results are presented in Table 7. For the purpose of comparison, we also report the results of this classification rule but based on the nonparametric estimators \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) in Wu and Abedin (2021).

Table 7 Misclassification rate (MR, in %) of p(y) in (2) with threshold .5 based on \({\hat{\theta }}_{\text {MELE}}\), \({\hat{\theta }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\)

From Table 7 we observe that \({\hat{\lambda }}_L\) performs worst, while other three methods are quite competitive. Particularly, \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) performs much better than \({\hat{\lambda }}\) in model M2, while \({\hat{\lambda }}\) has a little better performance in M1, M3 and M5. Even though the assumed relationship (3) is not valid for M5, the classification results based on MELE and MHDE for M5 are surprisingly reasonable and the MRs don’t deviate from OMR too much for larger sample sizes.

6.2 Robustness of the estimators

In this subsection, we examine the robustness properties of the proposed MELE and MHDE. In order to compare with the two nonparametric estimators in Wu and Abedin (2021), we check the performance of estimators of \(\lambda \) only but not \(\alpha \) and \(\beta \) that Wu and Abedin (2021) didn’t introduce.

Specifically, we examine the behavior of the four estimators, \({\hat{\lambda }}_{\text {MELE}}\), \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) when data are contaminated by a single outlying observation. Presence of several outliers will be similar and thus omitted here. Note that the outlying observation can be in either the first sample from f(x) or the second sample from the mixture h(x). We report results for the latter case and similar results are observed for the former case. We look at the change in estimate before and after data contamination. For this purpose, the \(\alpha \)-influence function (\(\alpha \)-IF) given in Beran (1977) is an appropriate measure of the change in estimate. However its application in mixture context is very difficult, as discussed in Karlis and Xekalaki (1998). Therefore, we employ an adaptive version of the \(\alpha \)-IF as in Lu et al. (2003) which uses the change in estimate, before and after outlying observations are included, divided by contamination rate. Taking model M1 as an example, after drawing two independent samples with one from N(0, 1) and the other from \((1-\lambda ) N(0,1)+\lambda N(1,1)\), we replace the last observation generated from the mixture with a single outlier, an integer from the interval \([-30,20]\). Then the \(\alpha \)-IF is given by

$$\begin{aligned} IF(x)=\frac{W((X_{i})_{i=1}^{m},(x,Y_{i})_{i=1}^{n-1})-W((X_{i})_{i=1}^{m},(Y_{i})_{i=1}^{n})}{1/N} \end{aligned}$$

averaged over 100 replications, where the total sample size \(N=60\) or 200 and W is an estimator of \(\lambda \). In our study, W is either \({\hat{\lambda }}_{\text {MELE}}\), \({\hat{\lambda }}_{\text {MHDE}}\), \(\hat{\lambda }\) or \(\hat{\lambda }_{L}\). The results are presented in Figs. 1, 2 and 3. Figure 1 is for M1 with \(\lambda =.15\) and .55, Figure 2 for M2 with \(\lambda =.25\) and .75, and Fig. 3 for M3 with \(\lambda =.25\) and .75. The results for other models and \(\lambda \) values are very similar and thus omitted to save space.

Fig. 1
figure 1

The \(\alpha \)-IFs of \(\hat{\lambda }_{\text {MELE}}\) (dashed), \(\hat{\lambda }_{\text {MHDE}}\) (solid), \(\hat{\lambda }\) (dotted) and \(\hat{\lambda }_{L}\) (dot-dashed) for model M1: a \(\lambda =.15\) and \(m=n=30\); b \(\lambda =.15\) and \(m=n=100\); c \(\lambda =.55\) and \(m=n=30\); d \(\lambda =.55\) and \(m=n=100\)

Fig. 2
figure 2

The \(\alpha \)-IFs of \(\hat{\lambda }_{\text {MELE}}\) (dashed), \(\hat{\lambda }_{\text {MHDE}}\) (solid), \(\hat{\lambda }\) (dotted) and \(\hat{\lambda }_{L}\) (dot-dashed) for model M2: a \(\lambda =.25\) and \(m=n=30\); b \(\lambda =.25\) and \(m=n=100\); c \(\lambda =.75\) and \(m=n=30\); d \(\lambda =.75\) and \(m=n=100\)

Fig. 3
figure 3

The \(\alpha \)-IFs of \(\hat{\lambda }_{\text {MELE}}\) (dashed), \(\hat{\lambda }_{\text {MHDE}}\) (solid), \(\hat{\lambda }\) (dotted) and \(\hat{\lambda }_{L}\) (dot-dashed) for model M3: a \(\lambda =.25\) and \(m=n=30\); b \(\lambda =.25\) and \(m=n=100\); c \(\lambda =.75\) and \(m=n=30\); d \(\lambda =.75\) and \(m=n=100\)

From Figs. 1, 2 and 3 we observe that no matter for which model and what sample size, \({\hat{\lambda }}_{\text {MHDE}}\) always performs the best and \({\hat{\lambda }}_{\text {MELE}}\) performs the worst. The \(\alpha \)-IF of \({\hat{\lambda }}_{\text {MELE}}\) is generally unbounded while that of \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) seems bounded when the outlying observation increases in both directions for mixture of normals and in the right direction for mixture of Poissons. This indicates that \({\hat{\lambda }}_{\text {MELE}}\) is generally not resistant to outliers whereas \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) are. The bad performance of \({\hat{\lambda }}_{\text {MELE}}\) is mostly for when the outlying observation is bigger than 10. When the outlying observation is less than 10, the performance of \({\hat{\lambda }}_{\text {MELE}}\) is generally reasonable and is similar to that of other three estimators. The \(\alpha \)-IFs of \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) are almost constants for outliers beyond the range \([-10,5]\) for mixture of normals and [0, 5] for mixture of Poissons, though the constants are of different magnitude, and they fluctuate within the respective ranges. When \({\hat{\lambda }}_{\text {MHDE}}\), \({\hat{\lambda }}\) and \({\hat{\lambda }}_L\) are compared, \({\hat{\lambda }}_L\) behaves the worst in terms of having largest \(\alpha \)-IF for mixture of normals and \({\hat{\lambda }}\) behaves the worst for mixture of Poisson distributions.

6.3 Test of model

In this part, we examine the performance of the K–S tests we proposed in Sect. 5 for testing the validity of model (3). We consider model (3) with \(r(x)=(x,x^2)^\top \) as the collection of all possible models under consideration. Then we test whether the reduced model (3) with \(r(x)=x\) is the actual true model or not. For demonstration purpose, we consider mixture of normals \(H(x)=(1-\lambda )F(x)+\lambda G(x)\) with \(F\sim N(0,1)\) and \(G\sim N(\mu ,\sigma ^{2})\). Then f(x) and h(x) are related by

$$\begin{aligned} h_\theta (x) \ =: \ h(x) \ = \ \left( 1-\lambda +\lambda e^{\alpha +\beta x+\gamma x^{2}}\right) f(x), \end{aligned}$$
(24)

where \(\alpha =-\frac{1}{2}\left( \log \sigma ^{2}+\frac{\mu ^{2}}{\sigma ^{2}}\right) \), \(\beta =\frac{\mu }{\sigma ^{2}}\) and \(\gamma =\frac{1}{2}\left( 1-\frac{1}{\sigma ^{2}}\right) \). Note that (24) is a special case of (3) when \(r(x)=(x,x^2)^\top \). If \(\sigma =1\), then \(\gamma =0\) and thus model (3) holds with \(r(x)=x\). So testing the validity of model (3) with \(r(x)=x\) is equivalent to testing the null hypothesis \(H_{0}:\gamma =0\) under model (24). We consider \(\gamma =0\), \(-.9\) and \(-1.5\), \(\lambda =.35\) and .65, and sample sizes \(m=n=30\) and \(m=n=100\). For simplicity, we just fix \(\mu =1\) and as a result \(\sigma =1\), .6 and .5 for \(\gamma =0\), \(-.9\) and \(-1.5\), respectively. For each \(\lambda \), \(\gamma \) and sample size considered, we use 500 total number of replications for our calculation. Within each replication, we use totally 1000 bootstrapping samples to estimate the distribution and critical values of the test statistics \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\). We choose different levels of significance \(a=.10\), .05 and .01. The simulation results are presented in Table 8. Note that \(\gamma =0\) means model (3) with \(r(x)=x\) is correct and thus the correspondingly calculated values in Table 8 are the estimated significance levels. When \(\gamma \ne 0\), model (3) with \(r(x)=x\) is not correct and thus the correspondingly calculated values in Table 8 are the estimated powers at that value of \(\gamma \).

From Table 8 we can see that the two test statistics \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\) are quite competitive in terms of achieved significance level and power. The achieved levels of significance are quite close to the true levels for most of the cases except for the case of \({\text {KS}}_{\text {MHDE}}\) with \(\lambda =0.65\) and \(m=n=30\). The power of \({\text {KS}}_{\text {MHDE}}\) becomes larger when \(\gamma \) is away from 0 except for the case of \(\lambda =0.65\) and \(m=n=30\). Surprisingly, the power of \({\text {KS}}_{\text {MELE}}\) becomes smaller when \(\gamma \) is away from 0 except for the case of \(\lambda =0.65\) and \(m=n=100\). Note that \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\) use respectively the MELE and MHDE of \(\beta \), which have been shown in Sect. 6.1 to have large bias and MSE when sample sizes are small. Consequently, the results for \(m=n=30\) have some abnormals while the results for \(m=n=100\) are much better and more reliable. As expected, when the significance level a decreases, both the observed significance level and power decrease. For both \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\), the powers are generally high for significance levels \(a=0.10\) and 0.05.

Table 8 Estimated significance level and power of \({\text {KS}}_{\text {MELE}}\) and \({\text {KS}}_{\text {MHDE}}\)

7 Real data examples

In this section we consider two real-life data examples and demonstrate the applications of our proposed methods.

Example 1: Grain data.

Smith and Vounatsou (1997) analyzed a data to determine the proportion \(\lambda \) of cells in the test sample with size \(n=94\) that were exposed to radioactive materials. The cells in control group with size \(m=94\) were not exposed to radioactivity. The number of grains shown in autoradiograph of cells can measure the amount of radioactive material, however grains can appear in autoradiograph due to radioactive material or due to background fogging. This dataset and more details of the data are available in Smith and Vounatsou (1997), and Wu and Abedin (2021) showed that the dominance constraint \(F\ge G\) is valid for this dataset. We apply the two proposed estimators to this data and compare them with the estimators in Wu and Abedin (2021), Smith and Vounatsou (1997) and Smith et al. (1986). The results are given in Table 9, in which bootstrap method with 1000 repetitions are used to calculate the confidence intervals of \(\lambda \).

From Table 9 we can see that our two proposed methods produce similar point estimates of \(\lambda \) to most other methods available in literature except for the two-by-two table and Logistic power methods. When interval estimation is concerned, the two proposed estimators produce confidence intervals with width similar to those of \({\hat{\lambda }}\), \({\hat{\lambda }}_L\) and the latent class method but much smaller than those of Poisson mixture, two-by-two table and monotone logistic methods. In Table 9, the confidence intervals in the parentheses for \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) are calculated using the asymptotic covariance matrices given in Theorems 2 and 6, respectively. From the results we see that bootstrap approximation is quite accurate. So when making inferences about \(\lambda \), computationally intensive bootstrap is not necessary for \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) while it is necessary for all other methods due to the unavailability of results on their asymptotic distribution. This is another great benefit of using \({\hat{\lambda }}_{\text {MELE}}\) and \({\hat{\lambda }}_{\text {MHDE}}\) over other methods.

Table 9 Estimation of \(\lambda \) for the grain data

Example 2: Malaria data.

We also examine a clinical malaria dataset first described by Kitua et al. (1996) and further studied by Vounatsou et al. (1998). Parasite densities in children with fever can be modelled as a two-component mixture, where one component represents the parasite densities in children without clinical malaria (F) while the other with clinical malaria (G), and \(\lambda \) is the proportion of children whose fever is attributable to malaria. In this dataset, level of parasitaemia were observed on children between 6 and 9 months in a village in Tanzania for both the wet season (\(m=94\), \(n=251\)) and the dry season (\(m=122\), \(n=245\)). Wu and Abedin (2021) showed that the dominance constraint \(F\ge G\) is valid for both wet and dry seasons.

We apply the proposed \({\hat{\lambda }}_{\text {MELE}}\) to both seasons and compare the results with \({\hat{\lambda }}_L\) in Wu and Abedin (2021) and the Bayesian approach in Vounatsou et al. (1998). Note that this is a discretized data (parasitaemia levels were grouped into 10 categories), so kernel smoothing and hence \({\hat{\lambda }}_{\text {MHDE}}\) and \({\hat{\lambda }}\) are not applicable for this data. Table 10 presents the estimation results where the numbers in parentheses are estimated standard errors based on 500 bootstrapping samples. From this table we observe that \({\hat{\lambda }}_{\text {MELE}}\), \({\hat{\lambda }}_L\) and Bayesian approach produce similar estimates.

Table 10 Estimation of \(\lambda \) for the malaria data

8 Concluding remarks

In this work, we proposed a two-component semiparametric mixture model (4) to accommodate the stochastic dominance constraint. We not only proposed and studied two estimators, the MELE and the MHDE, but also proposed K–S statistics to test the validity of the model (4). For both estimators, we proved theoretically their consistency and asymptotic normality. The finite-sample performance, in terms of both efficiency and robustness, of the proposed estimators and tests of the semiparametric model were examined through simulation studies and real data analysis.

The introduction of the two-component semiparametric mixture model (4) has several advantages over the fully nonparametric mixture model (1). First, the semiparametric model automatically solves the unidentifiability issue that the fully nonparametric model generally has. Second, the semiparametric model easily accommodates the stochastic dominance constraint with an equivalent and natural positivity constraint on the parameter \(\beta \), while the dominance constraint in the original nonparametric model is imposed on two functions which is much harder to handle and assess directly. Third, the semiparametric model has much better interpretation than the nonparametric model. The \(\beta \) value quantifies the difference between the two components, and the higher the \(\beta \) value the larger the difference. Fourth, for the semiparametric model we proposed two estimators with proved asymptotic normality and derived asymptotic covariance matrices, and these asymptotic results are in lack in literature for the nonparametric model. As a result, we can easily conduct statistical inferences for the mixing proportion \(\lambda \) (and \(\beta \)) based on these derived asymptotics for the semiparametric model, while for the nonparametric model some computationally intensive method, such as bootstrap, is necessary in order to make inferences about \(\lambda \), due to the unavailability of asymptotic properties of available estimators. Of course, using the semiparametric model instead of the fully nonparametric model will lose some flexibility which we think will not be significant given the fact that many commonly used population families satisfy the exponential tilt relationship (3). This loss is worth, considering the aforementioned advantages of introducing the semiparametric model.

When the two proposed estimators for the semiparametric mixture model (4) are compared, we observe the following preference from our numerical studies. When the estimation of the mixing proportion \(\lambda \) or classification is our interest, MELE and MHDE perform equivalently well, in terms of bias and MSE or misclassification rate, and no one is dominantly better than the other. When the estimation accuracy of \(\beta \) is our focus, MHDE is highly preferred over MELE when \(\lambda \) is moderate or large while MELE is preferred when \(\lambda \) is small. When the presence of outliers is our concern, MHDE is much more robust and thus is highly preferred over MELE.

Since the proposed model and methods were motivated by the example in case-control genetic studies based on test statistic values (one dimension), this paper only focused on univariate case. Nevertheless, we can see that the proposed MELE and MHDE methods can be easily extended to multidimensional case. On the other side, the kernel estimations used in MHDE method may not be able to handle high dimensionality well.

For future work, we may consider to use minimum profile Hellinger distance estimation (MPHDE) for the semiparametric mixture model (4). Wu and Karunamuni (2015) first introduced the profile Hellinger distance particularly for semiparametric models and investigated the MPHDE for semiparametric model of general form. Wu and Karunamuni (2015) proved that the MPHDE is as robust as MHDE and achieves full efficiency at the true model. Xiang et al. (2014) used the MPHDE for a two-component semiparametric mixture model where one component is known up to some unknown parameters while the other component is unspecified. Wu et al. (2017) and Wu and Zhou (2018) applied the MPHDE for semiparametric location-shifted mixture models. Another direction to approach the estimation problem of model (4) could be Bayesian method. Vounatsou et al. (1998) applied Gibbs sampling approach to estimate the unknown mixing proportion in a two-component mixture model with discretized data.