Abstract

We extend the mean empirical likelihood inference for response mean with data missing at random. The empirical likelihood ratio confidence regions are poor when the response is missing at random, especially when the covariate is high-dimensional and the sample size is small. Hence, we develop three bias-corrected mean empirical likelihood approaches to obtain efficient inference for response mean. As to three bias-corrected estimating equations, we get a new set by producing a pairwise-mean dataset. The method can increase the size of the sample for estimation and reduce the impact of the dimensional curse. Consistency and asymptotic normality of the maximum mean empirical likelihood estimators are established. The finite sample performance of the proposed estimators is presented through simulation, and an application to the Boston Housing dataset is shown.

1. Introduction

The missing data problems exist widely in the social sciences, political sciences, medical research, and many other fields. There are different missing patterns including single variable nonresponse, multivariate nonresponse, monotone nonresponse, and general nonresponse, and there are three types of missing mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The MAR is defined as missing that only depends on the observed. Let Y be the value of a response that is subject to nonresponse and let X be a d-dimensional vector of auxiliary variables that are observed fully. means the dataset of Y which is observed, and means the dataset of Y which is missing or nonresponse. We interested in the mean of Y; i.e., . Let M be the indicator for Y, where M = 1 if Y is observed and M = 0 otherwise. In the MAR mechanism,means that missing data of Y is only related to and has nothing to do with .

Empirical likelihood [1, 2] which is widely used for nonparametric and semiparametric statistical inferences is a competitive and powerful method for constructing confidence intervals (CIs). The empirical log-likelihood ratio (ELR) usually has an asymptotic chi-squared distribution, and the CI based on EL has many excellent properties, as proposed by Owen [3]. Due to these excellent properties, Wang and Rao [4], Liang et al. [5], and Stute et al. [6] extended the EL to missing data by imputing the missing data by a kernel regression function of observed data. For example, the bias-corrected EL method was constructed via the inverse propensity weighting (IPW), mean imputation (MI), and augmented inverse propensity weighting (AIPW). Guan et al. [7] proposed that an efficient estimator of the response means is constructed using the estimated response probabilities, which is a new method based on semiparametric maximum likelihood inference. Xie and Zhang [8] construct empirical likelihood confidence regions for the regression parameters without and with constraints by unconstrained and constrained empirical likelihood ratio statistics. Wang and Deng [9] proposed dimension-reduced empirical likelihood that was a two-stage estimation procedure and applied sufficient dimension reduction (SDR) technique in the kernel estimation of the propensity as well as the conditional mean response function. The method can avoid the well-known curse of dimensionality in the multivariate situation. But, it works only with a large sample. A lot of studies devote themselves to improve the accuracy of empirical likelihood ratio confidence region inference when there are small sample sizes and multidimensional situations. Hall and Scala [10], Diciccio et al. [11], and Tsao [12] found that empirical likelihood ratio confidence intervals could have poor accuracy, especially in the small sample and multidimensional situations. For confidence estimation, Dicicio et al. [11], Chen et al. [13], and Abraham and Wu [14] proposed the Bartlett correction empirical likelihood (BEL), the adjusted empirical likelihood, and an extended empirical likelihood (EEL), respectively. However, these methods require the calculation of the Bartlett correction constant, which is difficult to compute. So, Liang et al. [15] proposed the new method named mean empirical likelihood, which constructs an empirical likelihood function based on a set of pseudodata and it is easy to compute. Therefore, we need to develop a new empirical likelihood inference method for response mean with data missing at random in case of a small sample and multidimensional data.

Let be the pairwise-mean dataset. We will introduce a new empirical likelihood method named mean empirical likelihood. It constructs an empirical likelihood function based on a set of pseudodata and it is easy to complete compared to the mean empirical likelihood (MEL). The large sample properties of MEL are presented by Liang et al. [15]. The simulation studies indicate that the confidence regions by MEL are much more accurate than those by other empirical likelihood methods. Bigger coverage probability and smaller average interval length are other significant advantages of MEL. Nonetheless, there are two main problems in inference for response mean with data missing at random. First, the size of the sample is too small so that asymptotic normality is a little hard. The other is the high dimension of the covariate, and in these cases, the EL method is ineffective. MEL can solve the above problems by increasing the size of the sample for estimation.

In this paper, we propose a new empirical likelihood to make efficient inference on response mean, especially for confidence region of response mean, by applying the mean empirical likelihood inference for small sample sizes and multidimensional variable situations. In specific, we first apply the mean empirical likelihood (MEL) method (see [15]) for response mean with data missing at random. We construct three bias-corrected nonparametric MEL functions by pairwise-mean dataset based on the IPW, MI, and AIPW approaches. We showed that the resulting three MEL methods yield asymptotically equivalent estimators that achieve the desirable asymptotic unbiasedness and asymptotic normality. After a factor adjusted to the MELR function based on the IPW or MI, the proposed three MELR functions are shown to be asymptotically standard chi-squared with (d − 1) degree of freedom, where d is the rank of the covariance matrix of the pair-mean dataset. Simulation results show that the proposed method not only has higher accurate coverage probabilities, smaller average lengths, and standard deviation but also has efficient point estimators when the sample size is only 40.

This paper is organized as follows. Section 2 presents three types of nonparametric MELRs. Then, we introduce our main idea and establish a number of asymptotic properties. The simulation studies are given in Section 3. Section 4 provides a real data analysis and the paper concludes with a discussion in Section 5.

2. Mean Empirical Likelihood Inference for Response Mean with Data Missing at Random

2.1. Some Methods for Missing Response Problem

Let , be independent and identically distributed realizations from , where is a fully observed multivariate covariate, is a univariate response having missing values, and is a binary response indicator that equals 1 if and only if is observed.

In model (1), we make MAR assumptions about the full data, and the missing pattern is generally nonresponse. Furthermore, To adjust the weight, we introduce the probability function , which is , and conditional mean response function, such that . The unspecified functions and can be estimated by the kernel regression estimators as follows:where , is a symmetric kernel function, and h is a bandwidth.

Next, we introduce three bias-corrected estimating equations using the nonparametric kernel estimators and in equation (2) for handling missing data as follows.(i)Nonparametric inverse propensity weighting (IPW) approach: this method assigns each observed with weight proportional to the inverse of the estimated propensity in (2), i.e.,(ii)Nonparametric mean imputation (MI) approach: for each with , we use an estimator in equation (2) to estimate , and a nonparametric imputation estimating equation for is given by(iii)Nonparametric augmented inverse propensity weighting (AIPW) approach: we combine the nonparametric IPW and MI approaches leading to the nonparametric AIPW estimating equation for as follows:

In general, we use the empirical likelihood method to estimate the response mean. Let present the probability weight allocated to for . The profile empirical log-likelihood ratio function for based on with data missing at random is defined as

However, when the sample size is less than 100 or the data is high dimensional, its finite sample properties may not work well because of the low precision of approximation. Hence, the efficiency of the EL method for the response mean with data missing at random is low. In particular, the estimated coverage probability and average length are far from the theoretical value. The mean empirical likelihood can deal with a small sample and multidimensional situations of response mean problem with data missing at random.

2.2. Mean Empirical Likelihood for Response Mean with Data Missing at Random

In this paper, we apply the MEL to estimate the response mean with data missing at random under small sample and high-dimensional variate situation.

We here follow notations in the previous section, and for simplicity, we denote , , and further denote the pairwise-mean dataset as follows:which can also be written as with . Let donate the probability weight allocated to for . Based on this pairwise-mean dataset, the empirical log-likelihood ratio for is defined aswhich is named mean empirical likelihood ratio. It follows thatwhere satisfying

Let be the unknown true parameter value. Then, the mean empirical log-likelihood ratio is given by

Now, we have the following main theorem.

Theorem 1. Under the conditions listed in the Appendix, as , we havewhere denotes IPW, MI, and AIPW, respectively, and denotes convergence in distribution.

Theorem 2. Under the conditions listed in the Appendix, as , we havewhere is the -distributed random variable with one degree of freedom, , .

Theorem 3. Under the conditions that exists and , we have , in dist.

Proof. See Appendix.

Following Theorem 1, a confidence region for the parameter with asymptotic coverage probability can be defined as

3. Simulation Studies

We present two simulation studies in this section, which correspond to the finite sample performance of different MEL methods in Section 2. We will compare the MEL with OEL for response mean with data missing at random. For the one-dimensional covariate model, the estimating equation is equal to the estimating equation . For the multivariate covariate model, we compare the performance for different , correlation structures, and the size of the sample.

3.1. Simulation 1: Single Covariate Model

The results and discussion may be presented separately, or in one combined section, and may optionally be divided into headed subsections. Suppose that is the estimating equation for . We aim to compare the confidence intervals derived from MEL and OEL, for a given sample size n. We consider different scenarios by generating observations from , respectively.

There are four indicators for evaluating method performance. The bias represents the difference between the estimated result and the true value, which measures the accuracy of point estimators. The coverage probability represents the probability that the point estimator falls within the confidence interval of the true value, which measures the stability of point estimators. The average length represents the average length of the confidence interval of the estimated value, and the variance represents the degree of dispersion of the estimated value, which measures the accuracy of interval estimators.

Based on 5000 replicates, the bias, coverage proportions, average length, and standard deviation were calculated. The simulation results are presented in Table 1.

Table 1 shows the confidence region characteristics of different empirical likelihood methods when the is the one-dimensional normal distribution with mean 1 and variate 1, denoted as . with for . Then, we generate from the Bernoulli distribution with probability function .

From Tables 1 and 2, we conclude the following:(1)Comparison for different sample sizes: when the sample sizes increase, all coverage probability estimated by OEL or MEL increases for small sample size n and all coverage probability estimated by OEL or MEL decreases. However, the bias estimated by AIPW and MI method reduced significantly, and the bias estimated by the complete-case and IPW method changed a little(2)Comparison of different estimation methods: MEL is much better than OEL. All coverage probabilities estimated by MEL are closer to the nominal levels than those of the OEL method. All standard deviations by MEL are much smaller than those of the OEL method. All average lengths by MEL are much smaller than those of the OEL method(3)Comparison for different estimating equations: the point estimator based on MI is superior to other estimating equations. In particular, the coverage probabilities estimated by the MI method perform very well in all simulations. In a word, the sequence about the accuracy of the estimating equation is that the MI method is superior to the AIPW method and the AIPW method is superior to IPW(4)Comparison for different error situations: the average length estimated by OEL and MEL with is much longer and more stable than the method with . But the standard deviation estimated by OEL and MEL with is much bigger than the method with . Hence, when the errors in the method are heteroscedastic with , MEL method for response mean also plays well

In a word, the performances of IPW, MI, and AIPW based on MEL precede those methods based on OEL, and AIPW based on MEL achieves the most accurate results.

3.2. Simulation 2: Multivariate Covariate Model

In simulation 2, are 6-dimensional normal distribution with mean 0 and covariance matrix , denoted by , where is a symmetric matrix with and for , with for . Two kinds of are considered:(i)The first generates homoscedastic errors where and are independent and (ii)The second errors are heteroscedastic with

We generate from the Bernoulli distribution with probability and consider three choices of :

The coefficients in the propensity models are chosen so that the unconditional rates of missing data are between 20% and 40%. When are linear and is logistic linear under M1–M3, so and under M1–M3 are one-dimensional.

In addition, we also conduct two simulations with four correlation structures:(i)Independent covariates, where the correlation coefficient is 0(ii)Slightly correlated covariates, where the correlation coefficient is 0.2(iii)Moderately correlated covariates, where the correlation coefficient is 0.5(iv)Strongly correlated covariates, where the correlation coefficient is 0.8

All results are based on 1000 replications and the sample sizes are n = 20 and n = 40. Tables 3 and 4 show the biases and standard deviations (SDs) of the point estimators, the average lengths (ALs), and the coverage probabilities (CPs) of confidence intervals (CIs) at the nominal level of 95%. The CIs based on the mean empirical likelihood ratio are obtained by in equation (15). The CIs based on and are obtained by the normal approximation , with standard error obtained by the square roots of bootstrap variance estimators based on n/4 bootstrap replications, where n is the size of the sample.

From Tables 3 and 4, we can see the following.(1)Comparison for different sample sizes: when the sample sizes increase, estimators , , and have small biases, the coverage probability is big, the AL and the SD are small. So, the finite sample performance of the mean empirical likelihood is as that of the original empirical likelihood(2)Comparison for different estimation methods: estimators , , and based on MEL have similar point estimation compared with those based on OEL. In addition, estimators , , and based on MEL have bigger CP and smaller SD than those based on OEL. So, the ALs based on MEL are comparable, close to the method based on (3)Comparison for different estimating equations: as to the bias, the estimator is closer than and . The CP of the estimator performs better than and in many cases. So, the AL of the estimator performs better than and in many cases(4)Comparison for different error situations: the variation range of the bias estimated by OEL and MEL with is much smaller and more stable than the method with . But the standard deviation estimated by OEL and MEL with is close to the method with (5)Comparison for different correlation structures: when is large, the biases based on MEL are not stable but are compared to . But the CPs and the ALs still perform well, and the ALs are stable. when and , the MEL method performs better than the OEL method

In a word, the performances of IPW, MI, and AIPW based on MEL precede those methods based on OEL, and AIPW based on MEL achieves the most accurate results.

4. Real Data Analysis

In this section, we apply the proposed method to the Boston Housing data to illustrate our proposed MEL for missing data. The data is taken from the UCI Irvine Machine Learning Repository. The distribution of the per capita crime rate (CRIM) by the town is unknown. The inference on CRIM distribution based on the full data has even been analyzed by Liang et al. [15]. The dataset consists of 506 observations and 14 variates, with CRIM being the response variate and the other 11 being the covariates. We are interested in the mean of CRIM and its confidence region. We set the unconditional rates of missing data between 5% and 15%, and all real data analysis results are based on 1000 replications and the sample size is n = 20, n = 40, and n = 80. Tables 57 show the biases and standard deviations (SDs) of the point estimators, the average lengths (ALs), and the coverage probabilities (CPs) of confidence intervals (CIs) at the nominal level 95%.

In Tables 57, it is observed that when the sample sizes increase, the coverage probabilities of the estimators , , and based on the MEL method are bigger than those of the AL and the SD. In particular, the confidence regions of the response mean are close to the confidence region of the full data. So, the finite sample performance of mean empirical likelihood is better than that of the original empirical likelihood.

5. Conclusions

This paper proposed a new empirical likelihood for inference on response mean with a small sample and multidimensional variate using the mean empirical likelihood. Its large sample properties were proved with different bias-corrected estimating equations. This new method outperforms the original empirical likelihood methods.

On the basis of IPW, MI, and AIPW, the new method uses a pairwise-mean dataset to obtain its advantages. Compared with the original EL method, this method can obtain an approximate point estimator and better CPs, ALs, and SDs. Compared with the other EL inference methods, this method has two characteristics: one is that the size of the sample is much larger than dimensions, and the other is that the number of samples is large. In addition, AIPW based on MEL achieves the most accurate results. Theoretically, when the dimension of the covariate is increasing, the performance of response mean inference based EL is bad. The method of Liang et al. [15] can solve this problem. Meanwhile, it has superior performance when the size of the sample is small. Therefore, we recommend that three bias-corrected models be carefully constructed and used to infer the confidence region of response mean with missing at random. We will further study the MEL method for categorical data with random missing covariates.

Appendix

(C1)The propensity , the X-density function , and all have continuous and bounded partial derivatives with respect to X up to order with and , where is the order of the kernel K, and the propensity is bounded away from 0 to 1.(C2) is finite. (C3)As , .(C4) exists, where . (C5)When , then .

Lemma A.1. Satisfying situations (C1)–(C4),(i)(ii)

Proof. (i) According to Lemma A.1 (ii) in Liang et al. [15], noticingand following the assumption , we can obtain ; therefore, Then, we prove that (i) holds.
(ii) According to Lemma A.1 (iv) in Liang et al. [15], we notice thatTherefore, .

Proof. of Theorem 1. Similar to the proof of Theorem 1 in Wang and Deng [9], we can prove Theorem 1.

Proof. of Theorem 2. For , , using the arguments in the Wang and Rao [4], we haveUsing Theorem 2 and Lemma (i), we have . Therefore, we havewhere , , and .

Proof. ofTheorem 3. According to Lemma 11.1 in Owen [3], with probability tending to 1, 0 is inside the convex hull of , . By using the Lagrange multiplier, we havewhere satisfies the equation . Then, applying Lemma A.1 in Liang et al. [15], we haveOn the other hand, with the following equation [5, 16]:we getUsing Taylor’s expansion, we can getSubstituting into (A.10), we obtain

Data Availability

The Boston Housing data used in the real data analysis comes from an open-source database UCI Irvine Machine Learning Repository (http://archive.ics.UCI.edu/ml).

Conflicts of Interest

The authors declare that there are no conflicts of interest in the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (grant no. 71963008) and the support program of the Guangxi China Science Foundation (grant no. 2018GXNSFAA294131).