Abstract
This paper considers estimation and inference for a class of varying coefficient models in which some of the responses and some of the covariates are missing at random and outliers are present. The paper proposes two general estimators—and a computationally attractive and asymptotically equivalent one-step version of them—that combine inverse probability weighting and robust local linear estimation. The paper also considers inference for the unknown infinite-dimensional parameter and proposes two Wald statistics that are shown to have power under a sequence of local Pitman drifts and are consistent as the drifts diverge. The results of the paper are illustrated with three examples: robust local generalized estimating equations, robust local quasi-likelihood and robust local nonlinear least squares estimation. A simulation study shows that the proposed estimators and test statistics have competitive finite sample properties, whereas two empirical examples illustrate the applicability of the proposed estimation and testing methods.
Similar content being viewed by others
1 Introduction
This paper considers estimation and inference for a general class of varying coefficient models where some of the responses and possibly some of the covariates are not always observed and outliers can be present. In the absence of outliers and when all the variables are observable, the estimation of the unknown infinite-dimensional parameter for specific examples of the model considered here can be carried out using a number of alternative methods, such as spline approximation [see, for example, Eubank et al. (2004) and Hastie and Tibshirani (1993) for varying coefficient models, and Verhasselt (2014) for generalized varying coefficient models], series estimation [see, for example, Huang et al. (2002)] or local smoothing [see, for example, Fan et al. (1998) for local maximum likelihood, Fan et al. (1995) for local quasi-likelihood and Ruppert and Wand (1994) for local least squares among many others]. However, the influence function of the resulting estimators is unbounded, and thus, outliers or large deviations from the response variable to its conditional mean can have potentially a very negative effect on them. Furthermore, ignoring the fact that some of the data is not always observable or simply excluding the missing observations, the so-called complete case analysis, may also negatively affect the estimators and cause a great loss of information. Clearly, these potential problems might also have negative effects on the quality of any inference about the unknown infinite-dimensional parameter, since these inferences are typically based on test statistics that rely on these estimators.
This paper uses local smoothing, which is flexible enough to accommodate the estimation of the unknown infinite-dimensional parameter of the general model considered here and it makes the computation of the asymptotic covariance matrices required for inference relatively easy—especially when some of the observations in the sample are missing and/or are characterized by the presence of outliers. The paper assumes that some of the responses and possibly some of the covariates are missing at random (MAR) and proposes estimators that combine the ideas of smoothing and robust estimation with the inverse probability weighting (IPW) method (Horvitz and Thompson 1952). Robust estimation in nonparametric and semiparametric models has been considered by Fan et al. (1994), Boente et al. (2006), Bianco et al. (2011) and Hu and Cui (2010) among many others. Nonparametric and semiparametric estimation with MAR observations has been considered by Cheng (1994), Liang et al. (2004), Chen et al. (2006), Liang (2008) and Bianco et al. (2019) among others. Robust nonparametric and semiparametric estimation with missing data has been considered recently by Boente et al. (2009) and Bravo (2015). The estimators of this paper use a real-valued function that downweights high leverage covariates and either a robustified loss function (M estimator) or a robustified set of estimating equations (Z estimators) that yield bounded influence functions. The unknown infinite-dimensional parameter is estimated using the local linear estimator (Fan and Gijbels 1996), whereas the probability of missing—the so-called selection probability—is estimated using either a robust parametric or a robust nonparametric estimator.
For inference, the paper focuses on a robust Wald statistic. A similar Wald statistic was used by Bianco and Spano (2019) in the context of parametric nonlinear regression models with MAR responses and by Bianco et al. (2006) for the finite-dimensional parameter in a partially linear model with all the variables observable. The Wald statistics considered here are different from those considered by Bianco and Spano (2019) and Bianco et al. (2006), because they use IPW-based robust local estimators. Furthermore, one of the proposed Wald statistics is characterized by a nonstandard asymptotic distribution. Alternatively, a robust “distance” type of statistic, which is in the same spirit of the robust deviance statistic proposed by Cantoni and Ronchetti (2001) in the context of parametric quasi-likelihood estimation, could be used for inference. However, as opposed to the robust Wald statistics considered here, this statistic is not asymptotic distribution-free (that is it depends on nuisance parameters) under the null hypothesis [see Remark 4 in Sect. 3.2], which makes it less attractive for inferential purposes.
The new results of the paper are the following: first, it establishes the asymptotic normality of the proposed robust local M and Z estimators and it shows that the presence of MAR observations affects their asymptotic variance but not the asymptotic bias. This result is consistent with that obtained by Chen et al. (2006) and Bravo and Jacho-Chavez (2016) for semiparametric estimators in the presence of MAR responses and with some general results obtained by Robins and Rotnitzky (1995) (albeit for statistical models with finite-dimensional parameters). Second, it considers asymptotically equivalent one-step version of the proposed estimators that are computationally attractive and seem to perform well in the simulations. These results are rather general as can be applied to both single and multiple equation models (i.e., models for longitudinal or repeated outcomes data) and can be used to robustify a number of estimators including the local quasi-likelihood estimator for generalized linear models of Fan et al. (1995) and Chen et al. (2006), the local maximum likelihood estimator for varying coefficient models of Cai et al. (2000) and the local nonlinear least squares estimator for varying coefficient models of Kurum et al. (2013). They can also be applied to construct estimators for marginal parameters of the model, such as the marginal mean of the response, see Remark 3 in Sect. 3.1 for an example. Third, it considers two Wald statistics that can be used to test linear hypotheses about the infinite-dimensional parameter. Both statistics are shown to have power against a sequence of local alternatives and are consistent when the local alternatives diverge. The second Wald statistic has a nonstandard asymptotic distribution, but it is asymptotically distribution-free under the null hypothesis, which makes it easy to simulate and therefore appealing to be used in the applied research. Fourth, the paper considers three examples, that have been previously considered in the literature but not with outliers and MAR observations: estimation and inference for models defined by a quasi-likelihood function, for nonlinear regression models and for generalized estimating equation models. Finally, this paper presents Monte Carlo evidence about the finite sample performance for the proposed estimators and test statistics for the three examples and considers two real data applications that illustrate the applicability and usefulness of the proposed methods.
The rest of the paper is structured as follows: The next section introduces the statistical models and the estimators. Section 3 contains the main results of the paper; Sect. 4 presents one of the three examples (the robust local generalized estimation equations model) and reports the results of a simulation study. Section 5 contains one of the two real data applications, and Sect. 6 contains some concluding remarks. A supplemental appendix available online contains the other two examples (with related simulation studies), the other real data application and all the proofs of the results of the paper.
The following notation is used throughout the paper: “\(^{T}\)” and “\(\otimes \)” denote, respectively, transpose and the standard Kronecker product, \( \left\| \cdot \right\| \) is the Euclidean norm and finally for any vector v \(v^{\otimes 2}=vv^{T}\).
2 The statistical models and estimators
Let \(\left\{ Y_{i},X_{i},U_{i}\right\} _{i=1}^{n}\) denote a random sample from \(\left[ Y,X,U\right] \), where both Y and U are scalar random variables and \(X=\left[ X_{1}^{T},X_{2}^{T}\right] ^{T}\) is an \(\mathbb {R} ^{k}\) \(\left( k=k_{1}+k_{2}\right) \)-valued random vector.Footnote 1 Assume that the response variable Y is related to the covariates X and U through the semiparametric specification \(\eta \left( X,\alpha _{0}\left( U\right) \right) \), where \( \eta \left( \cdot \right) :\mathcal {X}\times \mathcal {U}\rightarrow \mathbb {R }\) is a known smooth function, that could represent, for example, a regression function or the link function in a generalized linear model, and \( \alpha _{0}\left( \cdot \right) \) is an \(\mathbb {R}^{k}\)-valued unknown infinite-dimensional parameter assumed to be sufficiently smooth.
To introduce the M-type estimator for \(\alpha _{0}\left( \cdot \right) \), let
denote a loss function, where \(\omega \left( \cdot \right) \) is a real-valued function that downweights high leverage covariates. Examples of \( \zeta \left( \cdot \right) \) include Huber’s \(\rho \) and Tukey’s bisquare \( \rho \) functions and the loss functions used, for example, by Boente et al. (2006) and Bianco et al. (2011) to bound the deviances and/or the Pearson residuals. Let \(\delta ^{Y}\) and \(\delta ^{X_{1}}\) denote the binary indicators of missingness for Y and \(X_{1}\), respectively; for \( \delta =\delta ^{Y}\delta ^{X_{1}}\) let
denote the selection probability, which allows for the observed responses \( Y_{i}\) and the covariates in the vector \(X_{1i}\) to be MAR for the same units i (i.e., \(\delta _{i}^{Y}\delta _{i}^{X_{1}}=0\)) as well as for different units \(i\ne j\) (i.e., \(\delta _{i}^{Y}\delta _{j}^{X_{1}}=0\), ) in the sample \(1\le i,j\le n\).Footnote 2 Note that by the law of iterated expectations
which forms the basis for the estimators of this paper. Let
denote the linear approximation of \(\alpha _{0}\left( U\right) \) at the point u, where \(a_{1}=\alpha _{0}\left( u\right) ,\) \(a_{2}=h\partial \alpha _{0}\left( \cdot \right) /\partial u\), \(W=\left[ 1,\left( U-u\right) /h\right] ^{T}\), \(h\left( n\right) :=h\) is the bandwidth, and let \(\widehat{ \pi }\left( X_{2i},U_{i}\right) \) denote an estimator for \(\pi \left( X_{2i},U_{i}\right) \), which can be either parametric or nonparametric. For the former, let \(\pi \left( X_{2i},U_{i},\gamma _{0}\right) \) denote a parametric specification for \(\pi \left( X_{2i},U_{i}\right) \) (for example, a logit or a probit model), where \(\gamma _{0}\in \Gamma \) is a vector of unknown finite-dimensional parameters, and let \(\widehat{\gamma }\) denote a robust alternative to the maximum likelihood estimator such as the one suggested by Bianco and Yohai (1996) for the logistic regression; then, \( \widehat{\pi }\left( X_{2i},U_{i}\right) =\pi \left( X_{2i},U_{i},\widehat{ \gamma }\right) \). For the latter
where \(\omega \left( \cdot \right) \) is real-valued function given in (1), \(V_{i}=\left[ X_{2i}^{T},U_{i}\right] ^{T}\) and \( L_{b}\left( \cdot \right) =L\left( \cdot /b\right) /b^{k_{2}+1}\) is a product kernel function with bandwidth \(b\left( n\right) :=b\).
The IPW-based robust local M estimator for \(\alpha _{0}\left( \cdot \right) \) evaluated at \(U=u\) is \(\widehat{a}_{\widehat{\pi }}^{M}=\left[ \widehat{a}_{1 \widehat{\pi }}^{M}{}^{T},\widehat{a}_{2\widehat{\pi }}^{M}{}^{T}\right] ^{T} \), where
and \(K_{h}\left( \cdot \right) =K\left( \cdot /h\right) /h\) is a kernel function with bandwidth \(h\left( n\right) :=h\). Alternatively, \(\widehat{a}_{ \widehat{\pi }}^{M}\) can be defined as the solution to the first-order conditions:
The latter estimator suggests the second class of robust local estimators considered in this paper. Let
denote an \(\mathbb {R}^{k}\)-valued vector of robust estimating equations. Estimating equations arise naturally in statistics, with the derivative of a quasi-likelihood (the quasi-score) (Wedderburn 1974) being a prominent example. Other important examples include generalized estimating equations of Liang and Zeger (1986), the variance function estimating equations of Carroll and Ruppert (1988) and the first-order conditions used in the Gauss–Newton method to solve nonlinear least squares problems. However, estimating equations are not robust to outlier and hence the use of their robust analog (4). The vector of robust estimating equations \(\mu \left( \cdot \right) \) is such that \(E\left[ \mu _{G_{\omega }}\left( Y,\eta \left( X,\alpha \left( U\right) \right) \right) \omega \left( X\right) \right] =0\) for a unique \(\alpha \left( U\right) =\alpha _{0}\left( U\right) \), where \(\mu _{G_{\omega }}\left( \cdot \right) \) is the centered robust estimating equations with the centering factor \( G_{\omega }\left( \cdot \right) \) used to achieve Fisher consistency, see Sect. 4.1 for an example of \(G_{\omega }\left( \cdot \right) \).
The IPW-based robust local Z estimator \(\widehat{a}{}^{Z}\) for \(\alpha _{0}\left( \cdot \right) \) evaluated at \(U=u\) is defined as the solution to
where \(\eta ^{\prime }\left( \cdot \right) =\partial \eta \left( \cdot \right) /\partial a.\)
3 Asymptotic results
3.1 Estimation
Let \(\mathcal {U}\) denote the support of U and, for simplicity of notation, let \(\zeta \left( \cdot \right) \omega \left( \cdot \right) :=\zeta _{\omega }\left( \cdot \right) \), \(\mu _{G_{\omega }}\left( \cdot \right) \omega \left( \cdot \right) :=\mu _{\omega }\left( \cdot \right) \), \(H=diag\left[ 1,h\right] \otimes I_{k}\),
where \(\eta ^{\prime \prime }\left( \cdot \right) =\partial ^{2}\eta \left( \cdot \right) /\left( \partial a\right) ^{\otimes 2}.\)
Let \(v_{1k}=\int u^{k}K\left( u\right) \mathrm{d}u\), \(v_{2k}=\int u^{k}K^{2}\left( u\right) \mathrm{d}u\),
Theorem 1
Under assumptions A1–A6 in the supplemental appendix
Furthermore, if \(K\left( \cdot \right) \) is symmetric
To reduce the computational cost of \(\widehat{a }{}_{\widehat{\pi } }^{M} \), a one-step version of it is proposed. This procedure, which is effectively one iteration of the Raphson–Newton method, is appealing when the minimization of the loss function is difficult (or very time-consuming) to achieve to the desired degree of accuracy. If the initial estimator \(\widehat{a}{}_{\widehat{\pi }}^{M}=\left[ \widehat{a}_{1 \widehat{\pi }}^{M}{}^{T},\widehat{a}_{2\widehat{\pi }}^{M}{}^{T}\right] ^{T} \) is close enough to \(\alpha _{0}\left( U\right) \) for \(U \approx u\)—see Assumption A7 in the supplemental appendix, then the estimator from applying one iteration will have the same asymptotic variance as that of the minimizer of the loss function. To be specific, the one-step IPW-based robust M local estimator has the form
Theorem 2
Under the same assumptions of Theorem 1 and A7 in the supplemental appendix, the IPW-based one-step robust local M estimator given in (6) has the same asymptotic distribution as that given in Theorem 1. In particular, if \(K\left( \cdot \right) \) is symmetric
The next theorem establishes the asymptotic normality of the IPW-based robust local Z estimator defined in (5).
Theorem 3
Under assumptions A1–A3, A4–A6 in the supplemental appendix
Furthermore, if \(K\left( \cdot \right) \) is symmetric
As with the previous class of M estimators, it is possible to consider a one-step version of the IPW-based robust local Z estimator, which has the form
Theorem 4
Under the same assumptions of Theorem 3 and A7 (with \( \widehat{a}{}{}_{\widehat{\pi }}^{Z}\) replacing \(\widehat{a}{}{}_{\widehat{ \pi }}^{M}\)), the IPW-based one-step version of the robust local Z estimator given in (7) has the same asymptotic distribution as that given in Theorem 3. In particular, if \(K\left( \cdot \right) \) is symmetric
Remark 1
In the case where all the variables are observable, the resulting robust local M and Z estimators have the same asymptotic distributions as those given in Theorems 1–4 without the selection probability \(\pi \left( X_{2},U\right) \).
Remark 2
In the case where all the X covariates are MAR, that is if the selection probability is \(\Pr \left( \delta =1|Y,X,U\right) =\Pr \left( \delta =1|U\right) :=\pi \left( U\right) \), the resulting IPW-based robust local M and Z estimators have the same asymptotic distributions as those given in Theorems 1–4 with \( \Sigma _{\pi \times 0}\left( u\right) =\Sigma _{\times 0}\left( u\right) /\pi \left( u\right) \) and \(\times =\zeta \) or \(\mu \).
Remark 3
The robust IPW estimation method of this paper can also be used to construct estimators for the unknown marginal mean \(\mu _{0}\) of the response Y. We first consider the case when only some of the observed responses \(Y_{i}\) are MAR, that is the selection probability is
We consider two estimators: the first one is
whereas the second one is based on the assumption that \(E\left( Y|X,U\right) =\eta \left( X,\alpha _{0}\left( U\right) \right) \) and is
where \(\widehat{\pi }\left( X_{i},U_{i}\right) \) is either the robust logit parametric or nonparametric estimate of the selection probability (8) and \(\widehat{\alpha } \left( \cdot \right) \) can be either a robust M or Z estimator as discussed in Sect. 2. The first estimator is the standard IPW sample mean dating back to Horvitz and Thompson (1952); it is fully nonparametric in the sense that it does not include the additional information that the response Y is related to the covariates X and U through the function \(\eta \left( \cdot \right) \) and is robust to the presence of outliers in the covariates X, but is not robust to a possibly misspecified parametric model \(\pi \left( X_{i},U_{i},\gamma _{0}\right) \) for the selection probability (8). The second estimator is an imputation-type estimator, in the same spirit of the doubly robust estimators often used in the missing data literature, see, for example, Robins et al. (1994) and Scharfstein et al. (1999). By construction, it is robust to possible misspecification of the regression function \(E\left( Y|X,U\right) \) or of the parametric model \(\pi \left( X_{i},U_{i},\gamma _{0}\right) \) (i.e., it is doubly robust) but is sensitive to the presence of outliers in the covariates X, although this sensitivity is mitigated by the fact that \( \widehat{\alpha }\left( \cdot \right) \) is a robust estimator. The following theorem establishes the asymptotic normality of \(n^{1/2}\left( \widehat{\mu } _{\bullet }-\mu _{0}\right) \), where \(\bullet \) is either IPW or DR.
Theorem 5
Under Assumption A8
where
Theorem 5 shows that the two proposed estimators are asymptotically equivalent. The asymptotic variance \(V_{NP}\left( \mu _{0}\right) \) corresponds to the semiparametric efficiency bound of Hahn (1998); the asymptotic variance \(V_{P_{r}}\left( \mu _{0}\right) \) will be smaller than \(V_{NP}\left( \mu _{0}\right) \) if \(V_{1P_{r}}\left( \gamma _{0}\right) \le V_{2P_{r}}\left( \gamma _{0}\right) \). To this end, note that if the selection probability (8) was estimated by an ordinary parametric logit, the resulting asymptotic variance would be \( V_{P}\left( \mu _{0}\right) =V_{NP}\left( \mu _{0}\right) -V_{1P}\left( \mu _{0}\right) \), where
and \(I\left( \gamma _{0}\right) \) is the information matrix for a logit estimator, which implies that \(\widehat{\mu }_{\bullet }^{P}\) would be more efficient than \(\widehat{\mu }_{\bullet }^{NP}\). Thus, the closer (numerically) the influence function \(M^{-1}\left( \gamma _{0}\right) r_{\omega }\left( \gamma _{0}\right) \) of the Bianco and Yohai (1996) estimator \(\widehat{\gamma }\) is to that of the ordinary logit estimator, the more likely \(\widehat{\mu }_{\bullet }^{P_{r}}\) will be more efficient than \(\widehat{\mu }_{\bullet }^{NP}.\)
We conclude this remark by briefly discussing the case where some of the observed covariates \(X_{i}\), say \(X_{1i}\), are also MAR. In this case, the IPW estimator can still be used, as it relies only on the observable covariates \(X_{2}\) and U; on the other hand, the imputation estimator becomes unfeasible because of its dependence on the missing \(X_{1}\). To obtain a feasible imputation estimator additional assumptions, such as specifying the joint distribution of \(X_{1}\) and \(X_{2}\), or the existence of an additional of set of covariates, say Z, that are related (parametrically or nonparametrically) to \(X_{1}\), would be required.
3.2 Inference
The results of the previous section can be used to construct Wald statistics to test local statistical hypotheses about \(\alpha \left( \cdot \right) \). To investigate the asymptotic properties of such statistics, we consider the following local hypothesis with a Pitman drift
where R is an \(l\times k\) matrix of constants and \(\gamma _{n}\left( \cdot \right) \) is a bounded continuous function that may depend on n. Let \(S= \left[ I_{k},O_{k}\right] \) denote a selection matrix, where \(O_{k}\) is a \( k\times k\) matrix of zeros, and let
denote the Wald statistic, where, for \(*=M\) or Z, \(\widetilde{\widehat{ a}}{}_{1\widehat{\pi }}^{*}{}\) can be either the IPW-based robust local M or Z local estimator \(\widehat{a}{}_{1\widehat{\pi }}^{*}\) or its one-step version \(\widetilde{a}{}_{1\widehat{\pi }}^{*}\), \(\widehat{\Gamma } _{\times }^{v_{1}}\left( \cdot \right) \) and \(\widehat{\Sigma }_{\widehat{ \pi }\times }^{v_{2}}\left( \cdot \right) \) are consistent estimatorsFootnote 3 of \(\Gamma _{\times 0}^{v_{1}}\left( \cdot \right) \) and \(\Sigma _{\pi \times 0}^{v_{2}}\left( \cdot \right) \), and \(\widehat{\pi }\left( \cdot \right) \) is either the parametric or nonparametric estimator of \(\pi \left( \cdot \right) \).
Proposition 1
Under the assumptions of Theorems 1–4, if \( rank\left( R\right) =l\) \(\left( l\le k\right) \), and \(nh^{5}\rightarrow 0\), then under (9) (i) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u\right) \rightarrow \gamma \left( u\right) >0\) (for some \(\left\| \gamma \left( u\right) \right\| <\infty \))
where \(\chi ^{2}\left( \kappa ,l\right) \) is a noncentral Chi-squared distribution with l degrees of freedom and noncentrality parameter
(ii) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u\right) \rightarrow \infty \),
Proposition 1 shows that with undersmoothing the proposed test has power against local Pitman-type alternatives and is consistent against any fixed alternatives of the form \(\gamma _{n}\left( \cdot \right) =\gamma \left( \cdot \right) .\) Under the null hypothesis \(H_{0}:\) \(R\alpha \left( u\right) =r_{0}\left( u\right) \), the proposition can be used to construct confidence regions for \(R\alpha \left( u\right) \) with nominal coverage \( 1-\alpha \), that is for \(\Pr \left( \chi ^{2}\left( l\right) \le c_{\alpha }\right) =1-\alpha \) and \(C_{\alpha }\left( u\right) =\Pr \left( r_{0}\left( u\right) |W\left( u\right) \le c_{\alpha }\right) ,\)
Note that in the case of \(K\left( \cdot \right) \) being symmetric, the Wald statistic \(W\left( u\right) \) simplifies to
Proposition 1 can also be used to test the important hypothesis of constancy of the varying coefficients \(\alpha \left( \cdot \right) \), corresponding to
The test can be implemented using the finite-dimensional analog of the IPW-based robust local M and Z estimators defined in (3) and (5), that is
where A is a compact set and \(\alpha _{0}\in int\left( A\right) \). Let \( \widehat{\alpha }_{\widehat{\pi }}^{*}\) \(\left( *=M\text { or } Z\right) \) denote either of the estimators defined in (12) and note that under the null hypothesis (11) and the assumption that \(n^{1/2}\left( \widehat{\alpha }_{\widehat{\pi } }^{*}-\alpha _{0}\right) =O_{p}\left( 1\right) ,\)
hence by Proposition 1
It is important to note that the test statistics \(W\left( u\right) \), \( W_{s}\left( u\right) \) and \(W_{c}\left( u\right) \) are asymptotically valid at a single point u. To increase their power, one can consider them over a fixed range of values of u, say \(\left\{ u_{j}\right\} _{j=1}^{s}\), and use instead the test statistics \(\max _{j}W\left( u_{j}\right) \) and \( \max _{j}W_{c}\left( u_{j}\right) \) \(\left( j=1,\ldots ,s\right) \), as the following proposition shows.
Proposition 2
Under the assumptions of Proposition 1, (i) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u_{j}\right) \rightarrow \gamma \left( u_{j}\right) >0\) (for some \(\left\| \gamma \left( u_{j}\right) \right\| <\infty \mathrm{)}\)
where
(ii) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u_{j}\right) \rightarrow \infty \)
Note that the distribution of the test statistics in Proposition 2 is nonstandard, since it involves the maximum of s independent noncentral Chi-squared distributions. However, under the null hypothesis \(R\alpha \left( u\right) =r_{0}\left( u\right) \), the test statistic is asymptotically distribution-free; hence, its distribution can be evaluated numerically or easily simulated.
Remark 4
As mentioned in Introduction, a robust distance statistic can also be used to test (9); however, the resulting statistic would not be asymptotically distribution-free as the following proposition shows for the simple null hypothesis \(H_{0}:\alpha \left( u\right) =\alpha _{0}\left( u\right) \) and \(K\left( \cdot \right) \) symmetric. Let
Proposition 3
Under the same assumptions of Proposition 1
where \(\lambda _{j}\) are the eigenvalues of \(\Sigma _{\pi \zeta 0}\left( u\right) \Gamma _{\zeta 0}\left( u\right) ^{-1}.\)
4 Example and Monte Carlo
4.1 Robust generalized estimating equations (GEE) estimation
This section considers estimation of a varying coefficient GEE model for longitudinal data [see, for example, Liang and Zeger (1986) for GEE estimation with unknown finite-dimensional parameters]. Suppose that
where Y is an m-dimensional random vector (m is the number of subjects or clusters) and \(\eta \left( X^{T}\alpha \left( U\right) \right) =\left[ \eta \left( X_{1}^{T}\alpha \left( U_{1}\right) \right) ,\ldots ,\eta \left( X_{m}^{T}\alpha \left( U_{m}\right) \right) \right] ^{T}.\) In this example, the robust estimating equations \(\mu _{\omega }\left( \cdot \right) \) are
where \(V\left( \eta ,\beta _{0}\right) =R\left( \beta _{0}\right) A_{0}^{1/2} \), \(R\left( \beta _{0}\right) \) is the working correlation matrix indexed by the q-dimensional unknown parameter \(\beta _{0}\), \(\psi \left( .\right) \) is a robust function such as the Huber function defined as
with tuning constant c,
\(\phi _{0}\) is the unknown dispersion parameter,
is the correction factor used to achieve Fisher consistency and a monotone missing data patternFootnote 4 is assumed, that is, for \(\delta _{1}\ge \delta _{2}\ge \cdots \ge \delta _{m}\), \(\delta _{1}=1,\)
Calculations show that
where \(s_{\alpha }\left( X,U\right) =\partial \log f\left( Y|X,U\right) /\partial \alpha \) and \(f\left( Y|X,U\right) \) is the joint conditional density of the response Y. Consistent estimators for \(\Gamma _{\mu 0}\left( u\right) \) and \(\Sigma _{\pi \mu 0}\left( u\right) \) can be found in the supplemental appendix.
4.2 Monte Carlo results
This section investigates the finite sample performance of the estimator and test statistic \(\max _{j}W\left( u_{j}\right) \) given in Proposition 2 for the GEE model considered in the previous section using a varying coefficient logit regression
where Y is a three-dimensional binary response variable Y (i.e., \(m=3\) is number of subjects or clusters), the covariates \(X=\left[ X_{k1},X_{k2} \right] ^{T}\) \(\left( k=1,2,3\right) \) are independently normally distributed with mean zero and unit variances, the three-dimensional covariate U is independent of X and uniformly distributed between 0 and 1 and \( \alpha _{0}\left( U\right) =\left[ \sin \left( \pi U/2\right) ,\cos \left( \pi U\right) \right] ^{T}\). To generate the responses with an exchangeable covariance structure (with correlation coefficient set equal to 0.3), Parzen (2009) approach is used, in which a random effect is added to the marginal probability of success
The selection probabilities for subjects \(k=2\) and \(k=3\) are specified as
with \(\left[ \gamma _{10},\gamma _{20},\gamma _{30},\gamma _{40},\gamma _{50} \right] ^{T}=\left[ 1,1.5,0.3,1,0.7\right] ^{T}\), which implies that the average percentages of missing are about 25% for \(k=2\) and 20% for \(k=3\). We consider three cases: the first one (case 0) has no missing values nor outliers; the second one (case C0) has missing values but no outliers; and finally the last case (case C3) has both missing values and outliers generated as
that is in (C3) the first ten elements of the three covariates \(X_{k2}\) are outliers. The computation of \(\widehat{a}_{\widehat{\pi }}^{Z}\) is carried out under the working independence assumption with \(\phi _{0}=1\) using the Huber function (15) with \(c=1.2\) and the Newton–Raphson algorithm. In the simulations, the Epanechnikov kernel is used, whereas the weight function is
where M and S are robust location-scatter estimators such as the minimum covariance determinant estimator (MCD), see, for example, He et al. (2005). The MCD estimator is computed using the R routine CovNAMcd, which uses imputation to deal with missing observations. The bandwidth h is selected by a robust cross-validation procedure in which in the first step for a given h, \(\widehat{t}_{-i}=\sum _{j\ne i}^{n}\Psi _{\widehat{\pi } c}\left( \widehat{t}_{j},u\right) /n=0\) and in the second step the robust bandwidth is chosen as \(\widehat{h}=\arg \min _{h}\sum _{i=1}^{n}\delta _{i}\zeta _{\omega }\left( \widehat{t}_{-i}\right) /\widehat{\pi }\left( X_{2i},U_{i}\right) ,\) that is \(\widehat{h}\) is the minimizer of the robust quasi deviance. Note that the selection probabilities (16) are estimated only using the robust parametric logit estimator.
Table 1 reports the mean absolute bias (B)
and standard deviation (SD) of four different estimators of \(\alpha _{k0}\left( \cdot \right) \): the robust local complete case estimator \( \widehat{a}_{1kc}\)—that is the estimator based on the sample where all the missing observations are dropped, the IPW local robust estimator \( \widehat{a}_{1k\widehat{\pi }p}\) with robust logit estimation of \(\pi \left( X_{2i},U_{i}\right) \) and their nonrobust analogs computed with the Huber function (15) set to \(c=\infty \), no correction term \(G_{\omega }\left( \cdot \right) \) and an ordinary logit estimator for \(\pi \left( X_{2i},U_{i}\right) .\)
Inference is based on the statistic \(\max _{1\le j\le s}W\left( u_{j}\right) \) (with \(s=5\)) for the hypothesis
where \(u_{kj}=\left[ 0.1,0.3,0.5,0.7,0.9\right] \), \(\alpha _{10}\left( u_{kj}\right) =\left[ 0.156,0.453,0.707,0.891,0.987\right] \) and \(\alpha _{20}\left( u_{kj}\right) =\left[ 0.951,0.587,0,0.587,0.951\right] \) \(\left( k=1,2,3\right) \) (that is, there are six parameters), \(\widehat{\alpha }_{ \widehat{\pi }}^{Z}\) is the parametric estimator defined in (12) and \(\gamma \in \mathbb {R}\) index the departure from the null hypothesis (corresponding to \(\gamma =0\)). The upper 10% and 5% critical values of the nonstandard distribution given in Proposition 2 are calculated using \(10^{5}\) simulations and are \(\left[ 11.307,13.452\right] \) for \(n=200\) and \(\left[ 11.077,13.168\right] \) for \(n=500\). Table 2 reports the finite sample size at the 0.10 and 0.05 nominal level of \(\max _{1\le j\le 5}W\left( u_{kj}\right) \) based on the four estimators and the three cases (0, C0 and C3) used in Table 1.
A full evaluation of the finite sample power of \(\max _{1\le j\le 5}W\left( u_{kj}\right) \) under (18) is not feasible as it would have to be calculated over the hypergrid \(\gamma \times \cdots \times \gamma =\left[ -1,-0.75,\ldots ,0,\ldots 0.75,1\right] ^{4}\) (6, 561 evaluation points); therefore, Fig. 1 reports the contour plots at the level 0.45 of the size-adjusted finite sample power curves for the test of \(H_{\gamma 2}\) (that is only for the second subject (or cluster)) over the grid \(\gamma \times \gamma =\left[ -1,-0.75,\ldots ,0,\ldots 0.75,1\right] \times \left[ -1,-0.75,\ldots ,0,\ldots 0.75,1\right] \) for the C0 and C3 cases. Note that smaller contour plots indicate higher finite sample power.
The results of Tables 1 and 2 (together with those of Tables 3–6 in the supplemental appendix) can be summarized as follows: For estimation (Tables 1 and 3, 5 in the supplemental appendix), first, without outliers and missing observations (case 0) the nonrobust local estimators perform better than the robust ones both in terms of bias and standard deviation, with the bias and standard deviation up to, respectively, 16% and 8% smaller for the GEE example, 13% and 10% smaller for the Poisson regression example l and 7% and 11% smaller for the nonlinear regression example in the supplemental appendix. Note that for both estimators the bias and standard deviation decrease as the sample size increases, implying the validity of the asymptotic results of Sect. 3. Second, without outliers but with missing observations (case C0) the performance of the nonrobust and robust local estimators is fairly similar, with biases and standard deviations being, respectively, between 1% and 8% smaller and 2% and 6% (for the Poisson regression example—see Table 3 in the supplemental appendix, and the GEE example). Third, when outliers are present (case C3 and cases C1–C2 in the supplemental appendix) the robust estimators clearly outperform the nonrobust ones with biases being up to 60% smaller and standard deviations up to 50% smaller. Fourth, among the three robust local estimators, those based on the inverse probability weighting perform better in terms of bias (for the GEE example on average around 18% smaller, for the Poisson regression example on average around 9% smaller and for the nonlinear regression example on average around 10% smaller) but not in terms of standard deviation (for the GEE example on average around 2%, for the Poisson regression example on average around 1% and for the nonlinear regression example on average around 2%) than those based on the complete case analysis, which is not surprising because their asymptotic variance is affected by the estimated selection probabilities in the denominator. Note, however, that in terms of the mean-squared error (MSE) the local robust estimators have typically a smaller one than that of the nonrobust estimators, the exceptions being the GEE and the Poisson regression examples both with 200 observations, where the MSEs are, respectively, 0.272 and 0.279 and 0.312 and 0.315 (for the IPW estimator with nonparametric estimation of the selection probabilities). Finally, between the two inverse probability weighting estimators, those based on the parametric estimator of the selection probabilities seem to have an edge over those based on the nonparametric estimator in terms of both bias and standard deviation, which is again not surprising, given that the selection probabilities are estimated using the (robust) maximum likelihood estimator for a correctly specified parametric model. For inference (Tables 2 and 4 and 6 in the supplemental appendix), first, without outliers and missing observations (case 0) or without outliers but with missing observations (case C0), the tests based on the nonrobust local estimators are characterized by a slightly better (i.e., closer to the nominal level) size than that of the tests based on the robust local estimators (up to 3% smaller for the GEE-based test with 200 observation). Second, when outliers are present (cases C1–C3) the size distortion of the tests based on the nonrobust local estimators worsen significantly (up to 30% larger size distortions in the GEE case with 200 observations), whereas that of the tests based on the robust local estimator remain similar to that of case C0. Third among the three test statistics, those based on the inverse probability weighting are more accurate (that is they have the smallest size distortion) with those based on the parametric estimator of the selection probabilities being the most accurate. Finally, Fig. 1 (combined with Figures 3 and 4 in the supplemental appendix) shows that in terms of finite sample power the tests based on the inverse probability weighting robust local estimators have typically larger power compared to those based on the complete case estimators both in the case of no outliers present (case C0) or with outliers (cases C2 and C3).
5 Empirical application
This section illustrates the applicability of the proposed estimation and testing methods by considering the New York air quality measurements data (from May to September 1973, available in the R package datasets which consists of 154 daily observations of mean ozone parts (per billion), solar radiations, wind speed (in mph) and temperature (in degrees F) and contains 37 missing ozone part observations and 7 missing solar radiation observations. The same data set was considered by Bianco and Spano (2019), who fitted an exponential growth regression model between the ozone parts and the temperature. Here, a linear varying coefficients specification is considered
where Y represents the ozone parts, \(X=\left[ 1,X_{1},X_{2}\right] ^{T}\) represent, respectively, the solar radiation and temperature, U is the wind speed and \(\varepsilon \) is a standard normal. The same Huber function with \(c=1.2\) and weight function \(\omega \left( \cdot \right) \) as those given in (15) and (17) are used, with the computation of \(\widehat{\alpha }_{\widehat{\pi }}\left( \cdot \right) \) based on the Newton–Raphson algorithm for the local estimating equations
where \(\widehat{\varepsilon }_{i}=Y_{i}-X_{i}^{T}\widehat{a}W_{i}\).
Figure 2 shows the three varying coefficients \(\widehat{\alpha }_{j}\left( u_{i}\right) \) estimated using (19) with the Bianco and Yohai (1996) robust logit estimator for the selection probabilities \(\pi \left( X_{2i},U_{i}\right) \) and their nonrobust analogs, together with their associated 95% confidence intervals.
The first estimated coefficient (intercept) represents the direct effect of the wind speed on the ozone parts and is a decreasing function of it. (The same decreasing relationship was found by Bianco and Spano (2019).) The other two estimated varying coefficients represent the combined effect on the ozone parts of the solar radiation and temperature with the wind speed. The former shows a clear decreasing relationship between the ozone parts and the wind speed combined with the solar radiations up to a speed of around 12 mph followed by a flatter relationship, whereas the latter shows an initial increasing relationship between the ozone parts and the wind speed and the temperature followed by a less clear relationship. Note that the robust local estimators are characterized by a more regular (and therefore easier to interpret) pattern. In terms of the mean effect (i.e., \(\overline{\widehat{ \alpha }_{j}}=\sum _{i=1}^{n}\widehat{\alpha }_{j}\left( U_{i}\right) /n\) for \(j=1,2,3\)), the nonrobust estimation procedure has \(\left[ \overline{ \widehat{\alpha }_{1}},\overline{\widehat{\alpha }_{2}},\overline{\widehat{ \alpha }_{3}}\right] ^{T}=\left[ -0.94,0.08,1.56\right] ^{T}\), whereas the robust one has \(\left[ \overline{\widehat{\alpha }_{1}},\overline{\widehat{ \alpha }_{2}},\overline{\widehat{\alpha }_{3}}\right] ^{T}=\left[ -0.84,0.07,1.36\right] ^{T}\). For inference, the null hypotheses of joint and individual constancy of the three varying coefficients are considered, that is
are tested using the \(\max _{1\le j\le 25}W_{c}\left( u_{j}\right) \) statistic (2) evaluated at 25 points \(u_{j}\). The sample values of \(\max _{1\le j\le 25}W_{c}\left( u_{j}\right) \) for the null hypotheses of joint and individual constancy of the three varying coefficients are, respectively, 11.84, 7.84, 7.04 and 8.89 with corresponding p values of 0.028, 0.022, 0.028 and 0.016. Thus, the null hypotheses of joint and individual constancy of the varying coefficients are rejected at the 0.05 significance level.
6 Conclusions
This paper has considered robust local estimation and inference for a general class of varying coefficients models where some of the responses and covariates are missing at random and outliers might be present. The paper has proposed a general estimation method (and a computationally attractive one-step version of it) based on inverse probability of weighting of the selection probabilities that can be used to obtain both M and Z estimators and can accommodate longitudinal data. The paper has also proposed two Wald statistics that can be used to test hypotheses on the infinite-dimensional parameter, including that of constancy. A Monte Carlo study shows that the proposed estimators and Wald statistics perform well in finite samples, while two empirical applications illustrate their practical usefulness.
Notes
Note that the results of the paper are valid also for multivariate models, in which case Y and U are, say, \(\mathbb {R}^{m}\)-valued random vectors and X is an \(\mathbb {R}^{m}\times \mathbb {R}^{k}\)-valued random matrix; see Sect. 4 for an example.
See the supplemental appendix for some examples of \( \widehat{\Gamma }_{\times }^{v_{1}}\left( \cdot \right) \) and \(\widehat{ \Sigma }_{\widehat{\pi }\times }^{v_{2}}\left( \cdot \right) \).
The monotone missing data pattern assumption is fairly common in missing data models for longitudinal studies, see, for example, Ibrahim and Molenberghs (2009) for a review. For nonmonotone missing data patters, the IPW estimation method of this paper is still valid; however, the estimation of the selection probabilities is substantially more challenging. One possible method is to use the randomized monotone missingness model proposed by Robins and Gill (1997), which is unfortunately quite complex to implement in practice and computationally intensive, see Lia et al. (2013) for further details.
References
Bianco A, Spano P (2019) Robust inference in nonlinear regression models. Test 28:369–398
Bianco A, Yohai V (1996) Robust estimation in the logistic regression model. In: Robust statistics, data analysis and computer intensive methods, Lecture Notes in Statistics 109, Springer, New York
Bianco A, Boente G, Martinez E (2006) Robust tests in semiparametric partly linear models. Scand J Stat 33:435–450
Bianco A, Boente G, Sombielle S (2011) Robust estimation for nonparametric generalized regression. Stat Probab Lett 81:1986–1994
Bianco A, Boente G, Gonzalez-Manteiga W, Perez A (2019) Plug-in marginal estimation under general regression model with missing responses and covariates. Test 28:106–146
Boente G, He X, Zhou J (2006) Robust estimates in generalized partially linear models. Ann Stat 34:2856–2878
Boente G, Gonzalez-Manteiga W, Perez-Gonzalez A (2009) Robust nonparametric estimation with missing data. J Stat Plan Inference 139:571–592
Bravo F (2015) Semiparametric estimation with missing covariates. J Multivar Anal 139:329–346
Bravo F, Jacho-Chavez D (2016) Semiparametric quasi-likelihood estimation with missing data. Commun Stat Theory Methods 46:1345–1369
Cai Z, Fan J, Li R (2000) Efficient estimation and inference for varying-coefficient models. J Am Stat Assoc 95:888–902
Cantoni E, Ronchetti E (2001) Robust inference for generalized linear models. J Am Stat Assoc 96:1022–1030
Carroll R, Ruppert D (1988) Transformation and weighting in regression. Chapman and Hall, London
Chen J, Fan J, Li K, Zhou H (2006) Local quasi-likelihood estimation with data missing at random. Statistica Sinica 16:1071–1100
Cheng P (1994) Nonparametric estimation of mean functionals with data missing at random. J Am Stat Assoc 89:81–87
Eubank R, Huang C, Munoz Maldonado Y, Wang N, Wang S, Buchanan R (2004) Smoothing spline estimation in varying-coefficient models. J R Stat Soc B 66:653–667
Fan J, Gijbels I (1996) Local polynomial modeling and its applications. Chapman and Hall, London
Fan J, Hu TC, Truong Y (1994) Robust non-parametric function estimation. Scand J Stat 21:433–446
Fan J, Heckman N, Wand M (1995) Local polynomial kernel regression for generalized linear models and quasilikelihood functions. J Am Stat Assoc 90:141–150
Fan J, Farmer M, Gijbels I (1998) Local maximum likelihood estimation and inference. J R Stat Soc B 60:591–608
Hahn J (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66:315–331
Hastie T, Tibshirani R (1993) Varying-coefficient models (with discussion). J R Stat Soc 55:757–796
He X, Fung W, Zhu Z (2005) Robust estimation in generalized partial linear models for clustered data. J Am Stat Assoc 100:1176–1184
Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47:663–685
Hu T, Cui H (2010) Robust estimates in generalised varying coefficient partially linear models. J Nonparametric Stat 22:737–754
Huang J, Wu C, Zhou L (2002) Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89:111–128
Ibrahim J, Molenberghs G (2009) Missing data methods in longitudinal studies: a review. Test 18:1–43
Kurum E, Li R, Senturk D, Wang Y (2013) Nonlinear varying coefficient models with application to a photosynthesis study. J Agric Biol Environ Stat 19:57–81
Lia L, Shen X, Li X, Robins J (2013) On weighting approaches for missing data. Stat Methods Med Res 22:14–30
Liang H (2008) Generalized partially linear models with missing covariates. J Multivar Anal 99:880–895
Liang K, Zeger S (1986) Longitudinal data analysis using generalised linear models. Biometrika 73:13–22
Liang H, Wang S, Robins J, Carroll R (2004) Estimation in partially linear models with missing covariates. J Am Stat Assoc 99:357–367
Parzen M (2009) A random effects model for simulating clustered binary data. Technical Report, Harvard University
Robins J, Gill R (1997) Non-response models for the analysis of non-monotone ignorable missing data. Stat Med 16:39–56
Robins J, Rotnitzky A (1995) Analysis of semiparametric models for repeated outcomes and missing data. J Am Stat Assoc 90:106–121
Robins J, Rotnitzky A, Zhao L (1994) Estimation of regression coefficients when some of the regressors are not always observed. J Am Stat Stat Assoc 89:846–866
Ruppert D, Wand P (1994) Multivariate locally weighted least squares regression. Ann Stat 22:1346–1370
Scharfstein D, Rotnitzky A, Robins J (1999) Adjusting for ignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc 94:1096–1120
Verhasselt A (2014) Generalized varying coefficient models: a smooth variable selection technique. Statistica Sinica 24:147–171
Wedderburn R (1974) Quasi-likelihood functions, generalized linear models and the gauss-newton method. Biometrika 61:439–447
Acknowledgements
I would like to thank an associate editor and a referee for very useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Bravo, F. Robust estimation and inference for general varying coefficient models with missing observations. TEST 29, 966–988 (2020). https://doi.org/10.1007/s11749-019-00692-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-019-00692-0