1 Introduction

This paper considers estimation and inference for a general class of varying coefficient models where some of the responses and possibly some of the covariates are not always observed and outliers can be present. In the absence of outliers and when all the variables are observable, the estimation of the unknown infinite-dimensional parameter for specific examples of the model considered here can be carried out using a number of alternative methods, such as spline approximation [see, for example, Eubank et al. (2004) and Hastie and Tibshirani (1993) for varying coefficient models, and Verhasselt (2014) for generalized varying coefficient models], series estimation [see, for example, Huang et al. (2002)] or local smoothing [see, for example, Fan et al. (1998) for local maximum likelihood, Fan et al. (1995) for local quasi-likelihood and Ruppert and Wand (1994) for local least squares among many others]. However, the influence function of the resulting estimators is unbounded, and thus, outliers or large deviations from the response variable to its conditional mean can have potentially a very negative effect on them. Furthermore, ignoring the fact that some of the data is not always observable or simply excluding the missing observations, the so-called complete case analysis, may also negatively affect the estimators and cause a great loss of information. Clearly, these potential problems might also have negative effects on the quality of any inference about the unknown infinite-dimensional parameter, since these inferences are typically based on test statistics that rely on these estimators.

This paper uses local smoothing, which is flexible enough to accommodate the estimation of the unknown infinite-dimensional parameter of the general model considered here and it makes the computation of the asymptotic covariance matrices required for inference relatively easy—especially when some of the observations in the sample are missing and/or are characterized by the presence of outliers. The paper assumes that some of the responses and possibly some of the covariates are missing at random (MAR) and proposes estimators that combine the ideas of smoothing and robust estimation with the inverse probability weighting (IPW) method (Horvitz and Thompson 1952). Robust estimation in nonparametric and semiparametric models has been considered by Fan et al. (1994), Boente et al. (2006), Bianco et al. (2011) and Hu and Cui (2010) among many others. Nonparametric and semiparametric estimation with MAR observations has been considered by Cheng (1994), Liang et al. (2004), Chen et al. (2006), Liang (2008) and Bianco et al. (2019) among others. Robust nonparametric and semiparametric estimation with missing data has been considered recently by Boente et al. (2009) and Bravo (2015). The estimators of this paper use a real-valued function that downweights high leverage covariates and either a robustified loss function (M estimator) or a robustified set of estimating equations (Z estimators) that yield bounded influence functions. The unknown infinite-dimensional parameter is estimated using the local linear estimator (Fan and Gijbels 1996), whereas the probability of missing—the so-called selection probability—is estimated using either a robust parametric or a robust nonparametric estimator.

For inference, the paper focuses on a robust Wald statistic. A similar Wald statistic was used by Bianco and Spano (2019) in the context of parametric nonlinear regression models with MAR responses and by Bianco et al. (2006) for the finite-dimensional parameter in a partially linear model with all the variables observable. The Wald statistics considered here are different from those considered by Bianco and Spano (2019) and Bianco et al. (2006), because they use IPW-based robust local estimators. Furthermore, one of the proposed Wald statistics is characterized by a nonstandard asymptotic distribution. Alternatively, a robust “distance” type of statistic, which is in the same spirit of the robust deviance statistic proposed by Cantoni and Ronchetti (2001) in the context of parametric quasi-likelihood estimation, could be used for inference. However, as opposed to the robust Wald statistics considered here, this statistic is not asymptotic distribution-free (that is it depends on nuisance parameters) under the null hypothesis [see Remark 4 in Sect. 3.2], which makes it less attractive for inferential purposes.

The new results of the paper are the following: first, it establishes the asymptotic normality of the proposed robust local M and Z estimators and it shows that the presence of MAR observations affects their asymptotic variance but not the asymptotic bias. This result is consistent with that obtained by Chen et al. (2006) and Bravo and Jacho-Chavez (2016) for semiparametric estimators in the presence of MAR responses and with some general results obtained by Robins and Rotnitzky (1995) (albeit for statistical models with finite-dimensional parameters). Second, it considers asymptotically equivalent one-step version of the proposed estimators that are computationally attractive and seem to perform well in the simulations. These results are rather general as can be applied to both single and multiple equation models (i.e., models for longitudinal or repeated outcomes data) and can be used to robustify a number of estimators including the local quasi-likelihood estimator for generalized linear models of Fan et al. (1995) and Chen et al. (2006), the local maximum likelihood estimator for varying coefficient models of Cai et al. (2000) and the local nonlinear least squares estimator for varying coefficient models of Kurum et al. (2013). They can also be applied to construct estimators for marginal parameters of the model, such as the marginal mean of the response, see Remark 3 in Sect. 3.1 for an example. Third, it considers two Wald statistics that can be used to test linear hypotheses about the infinite-dimensional parameter. Both statistics are shown to have power against a sequence of local alternatives and are consistent when the local alternatives diverge. The second Wald statistic has a nonstandard asymptotic distribution, but it is asymptotically distribution-free under the null hypothesis, which makes it easy to simulate and therefore appealing to be used in the applied research. Fourth, the paper considers three examples, that have been previously considered in the literature but not with outliers and MAR observations: estimation and inference for models defined by a quasi-likelihood function, for nonlinear regression models and for generalized estimating equation models. Finally, this paper presents Monte Carlo evidence about the finite sample performance for the proposed estimators and test statistics for the three examples and considers two real data applications that illustrate the applicability and usefulness of the proposed methods.

The rest of the paper is structured as follows: The next section introduces the statistical models and the estimators. Section 3 contains the main results of the paper; Sect. 4 presents one of the three examples (the robust local generalized estimation equations model) and reports the results of a simulation study. Section 5 contains one of the two real data applications, and Sect. 6 contains some concluding remarks. A supplemental appendix available online contains the other two examples (with related simulation studies), the other real data application and all the proofs of the results of the paper.

The following notation is used throughout the paper: “\(^{T}\)” and “\(\otimes \)” denote, respectively, transpose and the standard Kronecker product, \( \left\| \cdot \right\| \) is the Euclidean norm and finally for any vector v \(v^{\otimes 2}=vv^{T}\).

2 The statistical models and estimators

Let \(\left\{ Y_{i},X_{i},U_{i}\right\} _{i=1}^{n}\) denote a random sample from \(\left[ Y,X,U\right] \), where both Y and U are scalar random variables and \(X=\left[ X_{1}^{T},X_{2}^{T}\right] ^{T}\) is an \(\mathbb {R} ^{k}\) \(\left( k=k_{1}+k_{2}\right) \)-valued random vector.Footnote 1 Assume that the response variable Y is related to the covariates X and U through the semiparametric specification \(\eta \left( X,\alpha _{0}\left( U\right) \right) \), where \( \eta \left( \cdot \right) :\mathcal {X}\times \mathcal {U}\rightarrow \mathbb {R }\) is a known smooth function, that could represent, for example, a regression function or the link function in a generalized linear model, and \( \alpha _{0}\left( \cdot \right) \) is an \(\mathbb {R}^{k}\)-valued unknown infinite-dimensional parameter assumed to be sufficiently smooth.

To introduce the M-type estimator for \(\alpha _{0}\left( \cdot \right) \), let

$$\begin{aligned} \zeta \left( Y,\eta \left( X,\alpha \left( U\right) \right) \right) \omega \left( X\right) \end{aligned}$$
(1)

denote a loss function, where \(\omega \left( \cdot \right) \) is a real-valued function that downweights high leverage covariates. Examples of \( \zeta \left( \cdot \right) \) include Huber’s \(\rho \) and Tukey’s bisquare \( \rho \) functions and the loss functions used, for example, by Boente et al. (2006) and Bianco et al. (2011) to bound the deviances and/or the Pearson residuals. Let \(\delta ^{Y}\) and \(\delta ^{X_{1}}\) denote the binary indicators of missingness for Y and \(X_{1}\), respectively; for \( \delta =\delta ^{Y}\delta ^{X_{1}}\) let

$$\begin{aligned} \Pr \left( \delta =1|Y,X,U\right) =\Pr \left( \delta =1|X_{2},U\right) :=\pi \left( X_{2},U\right) >0\quad a.s., \end{aligned}$$
(2)

denote the selection probability, which allows for the observed responses \( Y_{i}\) and the covariates in the vector \(X_{1i}\) to be MAR for the same units i (i.e., \(\delta _{i}^{Y}\delta _{i}^{X_{1}}=0\)) as well as for different units \(i\ne j\) (i.e., \(\delta _{i}^{Y}\delta _{j}^{X_{1}}=0\), ) in the sample \(1\le i,j\le n\).Footnote 2 Note that by the law of iterated expectations

$$\begin{aligned}&E\left[ \frac{\delta }{\pi \left( X_{2},U\right) }\zeta \left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) \omega \left( X\right) \right] \\&\quad = E\left\{ E\left[ E\left( \frac{\delta }{\pi \left( X_{2},U\right) }\zeta \left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) \omega \left( X\right) |Y,X,U\right) |U\right] \right\} \\&\quad = E\left\{ E\left[ \zeta \left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) \omega \left( X\right) |U\right] \right\} , \end{aligned}$$

which forms the basis for the estimators of this paper. Let

$$\begin{aligned} \alpha _{0}\left( U\right) =a_{1}+a_{2}\left( U-u\right) :=aW \end{aligned}$$

denote the linear approximation of \(\alpha _{0}\left( U\right) \) at the point u, where \(a_{1}=\alpha _{0}\left( u\right) ,\) \(a_{2}=h\partial \alpha _{0}\left( \cdot \right) /\partial u\), \(W=\left[ 1,\left( U-u\right) /h\right] ^{T}\), \(h\left( n\right) :=h\) is the bandwidth, and let \(\widehat{ \pi }\left( X_{2i},U_{i}\right) \) denote an estimator for \(\pi \left( X_{2i},U_{i}\right) \), which can be either parametric or nonparametric. For the former, let \(\pi \left( X_{2i},U_{i},\gamma _{0}\right) \) denote a parametric specification for \(\pi \left( X_{2i},U_{i}\right) \) (for example, a logit or a probit model), where \(\gamma _{0}\in \Gamma \) is a vector of unknown finite-dimensional parameters, and let \(\widehat{\gamma }\) denote a robust alternative to the maximum likelihood estimator such as the one suggested by Bianco and Yohai (1996) for the logistic regression; then, \( \widehat{\pi }\left( X_{2i},U_{i}\right) =\pi \left( X_{2i},U_{i},\widehat{ \gamma }\right) \). For the latter

$$\begin{aligned} \widehat{\pi }\left( X_{2i},U_{i}\right) =\frac{\sum _{j=1}^{n}\delta _{j}L_{b}\left( V_{j}-V_{i}\right) \omega \left( X_{2j}\right) }{ \sum _{j=1}^{n}L_{b}\left( V_{j}-V_{i}\right) \omega \left( X_{2j}\right) }, \end{aligned}$$

where \(\omega \left( \cdot \right) \) is real-valued function given in (1), \(V_{i}=\left[ X_{2i}^{T},U_{i}\right] ^{T}\) and \( L_{b}\left( \cdot \right) =L\left( \cdot /b\right) /b^{k_{2}+1}\) is a product kernel function with bandwidth \(b\left( n\right) :=b\).

The IPW-based robust local M estimator for \(\alpha _{0}\left( \cdot \right) \) evaluated at \(U=u\) is \(\widehat{a}_{\widehat{\pi }}^{M}=\left[ \widehat{a}_{1 \widehat{\pi }}^{M}{}^{T},\widehat{a}_{2\widehat{\pi }}^{M}{}^{T}\right] ^{T} \), where

$$\begin{aligned} \widehat{a}_{\widehat{\pi }}^{M}= & {} \underset{a_{1},a_{2}\in \mathbb {R}^{k}}{ \arg \min }\sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) }\zeta \left( Y_{i},\eta \left( X_{i},aW_{i}\right) \right) \nonumber \\&\quad \omega \left( X_{i}\right) K_{h}\left( U_{i}-u\right) , \end{aligned}$$
(3)

and \(K_{h}\left( \cdot \right) =K\left( \cdot /h\right) /h\) is a kernel function with bandwidth \(h\left( n\right) :=h\). Alternatively, \(\widehat{a}_{ \widehat{\pi }}^{M}\) can be defined as the solution to the first-order conditions:

$$\begin{aligned} \sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) } \frac{\partial \zeta \left( Y_{i},\eta \left( X_{i},\widehat{a}_{\widehat{ \pi }}^{M}W_{i}\right) \right) }{\partial a}\omega \left( X_{i}\right) K_{h}\left( U_{i}-u\right) =0. \end{aligned}$$

The latter estimator suggests the second class of robust local estimators considered in this paper. Let

$$\begin{aligned} \mu \left( Y,\eta \left( X,\alpha \left( U\right) \right) \right) \omega \left( X\right) \end{aligned}$$
(4)

denote an \(\mathbb {R}^{k}\)-valued vector of robust estimating equations. Estimating equations arise naturally in statistics, with the derivative of a quasi-likelihood (the quasi-score) (Wedderburn 1974) being a prominent example. Other important examples include generalized estimating equations of Liang and Zeger (1986), the variance function estimating equations of Carroll and Ruppert (1988) and the first-order conditions used in the Gauss–Newton method to solve nonlinear least squares problems. However, estimating equations are not robust to outlier and hence the use of their robust analog (4). The vector of robust estimating equations \(\mu \left( \cdot \right) \) is such that \(E\left[ \mu _{G_{\omega }}\left( Y,\eta \left( X,\alpha \left( U\right) \right) \right) \omega \left( X\right) \right] =0\) for a unique \(\alpha \left( U\right) =\alpha _{0}\left( U\right) \), where \(\mu _{G_{\omega }}\left( \cdot \right) \) is the centered robust estimating equations with the centering factor \( G_{\omega }\left( \cdot \right) \) used to achieve Fisher consistency, see Sect. 4.1 for an example of \(G_{\omega }\left( \cdot \right) \).

The IPW-based robust local Z estimator \(\widehat{a}{}^{Z}\) for \(\alpha _{0}\left( \cdot \right) \) evaluated at \(U=u\) is defined as the solution to

$$\begin{aligned} \sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) } \mu _{G_{\omega }}\left( Y_{i},\eta \left( X_{i},\widehat{a}{} ^{Z}W_{i}\right) \right) \eta ^{\prime }\left( X_{i},\widehat{a}{} ^{Z}W_{i}\right) \omega \left( X_{i}\right) K_{h}\left( U_{i}-u\right) =0, \nonumber \\ \end{aligned}$$
(5)

where \(\eta ^{\prime }\left( \cdot \right) =\partial \eta \left( \cdot \right) /\partial a.\)

3 Asymptotic results

3.1 Estimation

Let \(\mathcal {U}\) denote the support of U and, for simplicity of notation, let \(\zeta \left( \cdot \right) \omega \left( \cdot \right) :=\zeta _{\omega }\left( \cdot \right) \), \(\mu _{G_{\omega }}\left( \cdot \right) \omega \left( \cdot \right) :=\mu _{\omega }\left( \cdot \right) \), \(H=diag\left[ 1,h\right] \otimes I_{k}\),

$$\begin{aligned}&\frac{\partial \zeta _{\omega }\left( Y,\eta \left( X,aW\right) \right) }{ \partial a}=\frac{\partial \zeta _{\omega }\left( Y,\eta \left( X,aW\right) \right) }{\partial \eta }\eta ^{\prime }\left( X,aW\right) :=\zeta _{\omega 1}\left( Y,\eta \left( X,aW\right) \right) , \\&\frac{\partial ^{2}\zeta _{\omega }\left( Y,\eta \left( X,aW\right) \right) }{\left( \partial a\right) ^{\otimes 2}}=\frac{\partial ^{2}\zeta _{\omega }\left( Y,\eta \left( X,aW\right) \right) }{\partial \eta ^{2}}\eta ^{\prime }\left( X,aW\right) ^{\otimes 2} \\&\quad +\, \frac{\partial \zeta _{\omega }\left( Y,\eta \left( X,aW\right) \right) }{ \partial \eta }\eta ^{\prime \prime }\left( X,aW\right) :=\zeta _{\omega 2}\left( Y,\eta \left( X,aW\right) \right) , \\&\frac{\partial \left( \mu _{\omega }\left( Y,\eta \left( X,aW\right) \right) \eta ^{\prime }\left( X,aW\right) \right) }{\partial a}=\frac{ \partial \mu _{\omega }\left( Y,\eta \left( X,aW\right) \right) }{\partial \eta }\eta ^{\prime }\left( X,aW\right) ^{\otimes 2} \\&\quad +\, \mu _{\omega }\left( Y,\eta \left( X,aW\right) \right) \eta ^{\prime \prime }\left( X,aW\right) :=\mu _{\omega 1}\left( Y,\eta \left( X,aW\right) \right) , \end{aligned}$$

where \(\eta ^{\prime \prime }\left( \cdot \right) =\partial ^{2}\eta \left( \cdot \right) /\left( \partial a\right) ^{\otimes 2}.\)

Let \(v_{1k}=\int u^{k}K\left( u\right) \mathrm{d}u\), \(v_{2k}=\int u^{k}K^{2}\left( u\right) \mathrm{d}u\),

$$\begin{aligned} \Gamma _{\zeta 0}\left( u\right)= & {} E\left[ \zeta _{\omega 2}\left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) |U=u\right] , \\ \Sigma _{\pi \zeta 0}\left( u\right)= & {} E\left[ \frac{\zeta _{\omega 1}\left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) }{\pi \left( X_{2},U\right) }^{\otimes 2}|U=u\right] , \\ \Gamma _{\mu 0}\left( u\right)= & {} E\left[ \mu _{\omega 1}\left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) |U=u\right] , \\ \Sigma _{\pi \mu 0}\left( u\right)= & {} E\left[ \frac{\left( \mu _{\omega }\left( Y,\eta \left( X,\alpha _{0}\left( U\right) \right) \right) \eta ^{\prime }\left( X,\alpha _{0}\left( U\right) \right) \right) }{\pi \left( X_{2},U\right) }^{\otimes 2}|U=u\right] , \end{aligned}$$
$$\begin{aligned} \Gamma _{\times 0}^{v_{1}}\left( u\right)= & {} \left[ \begin{array}{cc} 1 &{} v_{11} \\ v_{11} &{} v_{12} \end{array} \right] \otimes \Gamma _{\times 0}\left( u\right) \text { for }\times =\zeta \text { or }\mu , \\ \Sigma _{\pi \times 0}^{v _{2}}\left( u\right)= & {} \left[ \begin{array}{cc} v_{20} &{} v_{21} \\ v_{21} &{} v_{22} \end{array} \right] \otimes \Sigma _{\pi \times 0}\left( u\right) \text { for}\times =\zeta \text { or }\mu . \end{aligned}$$

Theorem 1

Under assumptions A1–A6 in the supplemental appendix

$$\begin{aligned}&\left( nh\right) ^{1/2}\left( H\left( \widehat{a}{}_{\widehat{\pi } }^{M}-a_{0}\right) -\frac{h^{2}f\left( u\right) }{2\left( v_{12}-v_{11}^{2}\right) }\left[ \begin{array}{c} \left( v_{12}^{2}-v_{11}v_{13}\right) \\ \left( v_{13}-v_{11}v_{12}\right) \end{array} \right] \otimes \frac{\partial ^{2}\alpha _{0}\left( u\right) }{\partial u^{2}}\right) \\&\quad \overset{d}{\rightarrow } N\left( 0,\Gamma _{\zeta 0}^{v_{1}}\left( u\right) ^{-1}\frac{\Sigma _{\pi \zeta 0}^{\nu _{2}}\left( u\right) }{f\left( u\right) }\Gamma _{\zeta 0}^{v_{1}}\left( u\right) ^{-1}\right) . \end{aligned}$$

Furthermore, if \(K\left( \cdot \right) \) is symmetric

$$\begin{aligned}&\left( nh\right) ^{1/2}\left( \widehat{a}{}_{1\widehat{\pi }}^{M}-\alpha _{0}\left( u\right) -\frac{h^{2}f\left( u\right) v_{12}}{2}\frac{\partial ^{2}\alpha _{0}\left( u\right) }{\partial u^{2}}\right) \\&\qquad \overset{d}{ \rightarrow }N\left( 0,\frac{\Gamma _{\zeta 0}\left( u\right) ^{-1}v_{20}\Sigma _{\pi \zeta 0}\left( u\right) \Gamma _{\zeta 0}\left( u\right) ^{-1}}{f\left( u\right) }\right) . \end{aligned}$$

To reduce the computational cost of \(\widehat{a }{}_{\widehat{\pi } }^{M} \), a one-step version of it is proposed. This procedure, which is effectively one iteration of the Raphson–Newton method, is appealing when the minimization of the loss function is difficult (or very time-consuming) to achieve to the desired degree of accuracy. If the initial estimator \(\widehat{a}{}_{\widehat{\pi }}^{M}=\left[ \widehat{a}_{1 \widehat{\pi }}^{M}{}^{T},\widehat{a}_{2\widehat{\pi }}^{M}{}^{T}\right] ^{T} \) is close enough to \(\alpha _{0}\left( U\right) \) for \(U \approx u\)—see Assumption A7 in the supplemental appendix, then the estimator from applying one iteration will have the same asymptotic variance as that of the minimizer of the loss function. To be specific, the one-step IPW-based robust M local estimator has the form

$$\begin{aligned} \left[ \begin{array}{c} \widetilde{a}_{1\widehat{\pi }}^{M} \\ \widetilde{a}_{2\widehat{\pi }}^{M} \end{array} \right]= & {} \left[ \begin{array}{c} \widehat{a}_{1\widehat{\pi }}^{M} \\ \widehat{a}{}_{2\widehat{\pi }}^{M} \end{array} \right] -\left[ \sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }_{i}}\zeta _{2\omega }\left( Y_{i},\eta \left( X_{i},\widehat{a}{}_{\widehat{\pi } }^{M}W_{i}\right) \right) K_{h}\left( U_{i}-u\right) \right] ^{-1} \nonumber \\&\quad \times \, \sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }_{i}}\zeta _{1\omega }\left( Y_{i},\eta \left( X_{i},\widehat{a}{}_{\widehat{\pi } }^{M}W_{i}\right) \right) \eta ^{\prime }\left( X_{i},\widehat{a} W_{i}\right) K_{h}\left( U_{i}-u\right) . \end{aligned}$$
(6)

Theorem 2

Under the same assumptions of Theorem 1 and A7 in the supplemental appendix, the IPW-based one-step robust local M estimator given in (6) has the same asymptotic distribution as that given in Theorem 1. In particular, if \(K\left( \cdot \right) \) is symmetric

$$\begin{aligned}&\left( nh\right) ^{1/2}\left( \widetilde{a}{}_{1\widehat{\pi }}^{M}-\alpha _{0}\left( u\right) - \frac{h^{2}f\left( u\right) v_{12}}{2}\frac{\partial ^{2}\alpha _{0}\left( u\right) }{\partial u^{2}}\right) \\&\qquad \overset{d}{\rightarrow }N\left( 0,\frac{ \Gamma _{\zeta 0}\left( u\right) ^{-1}v_{20}\Sigma _{\pi \zeta 0}\left( u\right) \Gamma _{\zeta 0}\left( u\right) ^{-1}}{f\left( u\right) }\right) . \end{aligned}$$

The next theorem establishes the asymptotic normality of the IPW-based robust local Z estimator defined in (5).

Theorem 3

Under assumptions A1–A3, A4–A6 in the supplemental appendix

$$\begin{aligned}&\left( nh\right) ^{1/2}\left( H\left( \widehat{a}{}_{\widehat{\pi } }^{Z}-a_{0}\right) -\frac{h^{2}f\left( u\right) }{2\left( v_{12}-v_{11}^{2}\right) }\left[ \begin{array}{c} \left( v_{12}^{2}-v_{11}v_{13}\right) \\ \left( v_{13}-v_{11}v_{12}\right) \end{array} \right] \otimes \frac{\partial ^{2}\alpha _{0}\left( u\right) }{\partial u^{2}}\right) \\&\quad \overset{d}{\rightarrow } N\left( 0,\Gamma _{\mu 0}^{v_{1}}\left( u\right) ^{-1}\frac{\Sigma _{\pi \mu 0}^{\nu _{2}}\left( u\right) }{f\left( u\right) }\Gamma _{\mu 0}^{v_{1}}\left( u\right) ^{-1}\right) . \end{aligned}$$

Furthermore, if \(K\left( \cdot \right) \) is symmetric

$$\begin{aligned}&\left( nh\right) ^{1/2}\left( \widehat{a}{}_{1\widehat{\pi }}^{Z}-\alpha _{0}\left( u\right) -\frac{h^{2}f\left( u\right) v_{12}}{2}\frac{\partial ^{2}\alpha _{0}\left( u\right) }{\partial u^{2}}\right) \\&\qquad \overset{d}{ \rightarrow }N\left( 0,\frac{\Gamma _{\mu 0}\left( u\right) ^{-1}v_{20}\Sigma _{\pi \mu 0}\left( u\right) \Gamma _{\mu 0}\left( u\right) ^{-1}}{f\left( u\right) }\right) . \end{aligned}$$

As with the previous class of M estimators, it is possible to consider a one-step version of the IPW-based robust local Z estimator, which has the form

$$\begin{aligned} \left[ \begin{array}{c} \widetilde{\alpha }{}_{1\widehat{\pi }}^{Z} \\ \widetilde{a}{}_{1\widehat{\pi }}^{Z} \end{array} \right]= & {} \left[ \begin{array}{c} \widehat{a}_{1\widehat{\pi }}^{Z} \\ \widehat{a}_{2\widehat{\pi }}^{Z} \end{array} \right] -\left[ \sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) }\mu _{1\omega }\left( Y_{i},\eta \left( X_{i},\widehat{a }{}{}^{Z}W_{i}\right) \right) K_{h}\left( U_{i}-u\right) \right] ^{-1} \nonumber \\&\quad \times \, \sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) } \mu _{\omega }\left( Y_{i},\eta \left( X_{i},\widehat{a}{}^{Z}W_{i}\right) \right) K_{h}\left( U_{i}-u\right) . \end{aligned}$$
(7)

Theorem 4

Under the same assumptions of Theorem 3 and A7 (with \( \widehat{a}{}{}_{\widehat{\pi }}^{Z}\) replacing \(\widehat{a}{}{}_{\widehat{ \pi }}^{M}\)), the IPW-based one-step version of the robust local Z estimator given in (7) has the same asymptotic distribution as that given in Theorem 3. In particular, if \(K\left( \cdot \right) \) is symmetric

$$\begin{aligned}&\left( nh\right) ^{1/2}\left( \widetilde{a}_{1\widehat{\pi }}^{Z}-\alpha _{0}\left( u\right) -\frac{h^{2}f\left( u\right) v_{12}}{2}\frac{\partial ^{2}\alpha _{0}\left( u\right) }{\partial u^{2}}\right) \\&\qquad \overset{d}{ \rightarrow }N\left( 0,\frac{\Gamma _{\mu 0}\left( u\right) ^{-1}v_{20}\Sigma _{\pi \mu 0}\left( u\right) \Gamma _{\mu 0}\left( u\right) ^{-1}}{f\left( u\right) }\right) . \end{aligned}$$

Remark 1

In the case where all the variables are observable, the resulting robust local M and Z estimators have the same asymptotic distributions as those given in Theorems 14 without the selection probability \(\pi \left( X_{2},U\right) \).

Remark 2

In the case where all the X covariates are MAR, that is if the selection probability is \(\Pr \left( \delta =1|Y,X,U\right) =\Pr \left( \delta =1|U\right) :=\pi \left( U\right) \), the resulting IPW-based robust local M and Z estimators have the same asymptotic distributions as those given in Theorems 14 with \( \Sigma _{\pi \times 0}\left( u\right) =\Sigma _{\times 0}\left( u\right) /\pi \left( u\right) \) and \(\times =\zeta \) or \(\mu \).

Remark 3

The robust IPW estimation method of this paper can also be used to construct estimators for the unknown marginal mean \(\mu _{0}\) of the response Y. We first consider the case when only some of the observed responses \(Y_{i}\) are MAR, that is the selection probability is

$$\begin{aligned} \Pr \left( \delta ^{Y}=1|Y,X,U\right) =\Pr \left( \delta ^{Y}=1|X,U\right) :=\pi \left( X,U\right) >0\quad a.s. \end{aligned}$$
(8)

We consider two estimators: the first one is

$$\begin{aligned} \widehat{\mu }_{IPW}=\frac{1}{n}\sum _{i=1}^{n}\frac{\delta _{i}^{Y}Y_{i}}{ \widehat{\pi }\left( X_{i},U_{i}\right) }, \end{aligned}$$

whereas the second one is based on the assumption that \(E\left( Y|X,U\right) =\eta \left( X,\alpha _{0}\left( U\right) \right) \) and is

$$\begin{aligned} \widehat{\mu }_{DR}=\frac{1}{n}\sum _{i=1}^{n}\frac{\delta _{i}^{Y}Y_{i}}{ \widehat{\pi }\left( X_{i},U_{i}\right) }+\frac{1}{n}\sum _{i=1}^{n}\left( 1- \frac{\delta _{i}^{Y}}{\widehat{\pi }\left( X_{i},U_{i}\right) }\right) \eta \left( X_{i},\widehat{\alpha }\left( U_{i}\right) \right) , \end{aligned}$$

where \(\widehat{\pi }\left( X_{i},U_{i}\right) \) is either the robust logit parametric or nonparametric estimate of the selection probability (8) and \(\widehat{\alpha } \left( \cdot \right) \) can be either a robust M or Z estimator as discussed in Sect. 2. The first estimator is the standard IPW sample mean dating back to Horvitz and Thompson (1952); it is fully nonparametric in the sense that it does not include the additional information that the response Y is related to the covariates X and U through the function \(\eta \left( \cdot \right) \) and is robust to the presence of outliers in the covariates X, but is not robust to a possibly misspecified parametric model \(\pi \left( X_{i},U_{i},\gamma _{0}\right) \) for the selection probability (8). The second estimator is an imputation-type estimator, in the same spirit of the doubly robust estimators often used in the missing data literature, see, for example, Robins et al. (1994) and Scharfstein et al. (1999). By construction, it is robust to possible misspecification of the regression function \(E\left( Y|X,U\right) \) or of the parametric model \(\pi \left( X_{i},U_{i},\gamma _{0}\right) \) (i.e., it is doubly robust) but is sensitive to the presence of outliers in the covariates X, although this sensitivity is mitigated by the fact that \( \widehat{\alpha }\left( \cdot \right) \) is a robust estimator. The following theorem establishes the asymptotic normality of \(n^{1/2}\left( \widehat{\mu } _{\bullet }-\mu _{0}\right) \), where \(\bullet \) is either IPW or DR.

Theorem 5

Under Assumption A8

$$\begin{aligned}&n^{1/2}\left( \widehat{\mu }_{\bullet }^{j}-\mu _{0}\right) \overset{d}{ \rightarrow }N\left( 0,V_{j}\left( \mu _{0}\right) \right) ,\quad j=``NP'', \text { or }``P_{r}'', \\&V_{NP}\left( \mu _{0}\right) =E\left[ \frac{\sigma ^{2}\left( X,U\right) }{ \pi \left( X,U\right) }+\left( E\left( Y|X,U\right) -\mu _{0}\right) ^{2} \right] , \\&V_{P_{r}}\left( \mu _{0}\right) =V_{NP}\left( \mu _{0}\right) +V_{1P_{r}}\left( \gamma _{0}\right) -2V_{2P_{r}}\left( \gamma _{0}\right) , \end{aligned}$$

where

$$\begin{aligned} V_{1P_{r}}\left( \gamma _{0}\right)= & {} E\left[ \frac{E\left( Y|X,U\right) }{ \pi \left( X,U,\gamma _{0}\right) }\frac{\partial \pi \left( X,U,\gamma _{0}\right) }{\partial \gamma ^{T}}\right] \Omega \left( \gamma _{0}\right) \\&E \left[ \frac{E\left( Y|X,U\right) }{\pi \left( X_{2},U,\gamma _{0}\right) } \frac{\partial \pi \left( X,U,\gamma _{0}\right) }{\partial \gamma }\right] , \\ V_{2P_{r}}\left( \gamma _{0}\right)= & {} E\left( E\left( Y|X,U\right) E\left( \frac{E\left( Y|X,U\right) }{\pi }\frac{\partial \pi }{\partial \gamma ^{T}} \right) M^{-1}\left( \gamma _{0}\right) r_{\omega }\left( \gamma _{0}\right) \right) , \\ \Omega \left( \gamma _{0}\right)= & {} M^{-1}\left( \gamma _{0}\right) Var\left( r_{\omega }\left( \gamma _{0}\right) \right) M^{-1}\left( \gamma _{0}\right) . \end{aligned}$$

Theorem 5 shows that the two proposed estimators are asymptotically equivalent. The asymptotic variance \(V_{NP}\left( \mu _{0}\right) \) corresponds to the semiparametric efficiency bound of Hahn (1998); the asymptotic variance \(V_{P_{r}}\left( \mu _{0}\right) \) will be smaller than \(V_{NP}\left( \mu _{0}\right) \) if \(V_{1P_{r}}\left( \gamma _{0}\right) \le V_{2P_{r}}\left( \gamma _{0}\right) \). To this end, note that if the selection probability (8) was estimated by an ordinary parametric logit, the resulting asymptotic variance would be \( V_{P}\left( \mu _{0}\right) =V_{NP}\left( \mu _{0}\right) -V_{1P}\left( \mu _{0}\right) \), where

$$\begin{aligned} V_{1P}\left( \mu _{0}\right)= & {} E\left[ \frac{E\left( Y|X,U\right) }{\pi \left( X,U,\gamma _{0}\right) }\frac{\partial \pi \left( X,U,\gamma _{0}\right) }{\partial \gamma ^{T}}\right] I\left( \gamma _{0}\right) ^{-1}\\&E \left[ \frac{E\left( Y|X,U\right) }{\pi \left( X_{2},U,\gamma _{0}\right) } \frac{\partial \pi \left( X,U,\gamma _{0}\right) }{\partial \gamma }\right] , \end{aligned}$$

and \(I\left( \gamma _{0}\right) \) is the information matrix for a logit estimator, which implies that \(\widehat{\mu }_{\bullet }^{P}\) would be more efficient than \(\widehat{\mu }_{\bullet }^{NP}\). Thus, the closer (numerically) the influence function \(M^{-1}\left( \gamma _{0}\right) r_{\omega }\left( \gamma _{0}\right) \) of the Bianco and Yohai (1996) estimator \(\widehat{\gamma }\) is to that of the ordinary logit estimator, the more likely \(\widehat{\mu }_{\bullet }^{P_{r}}\) will be more efficient than \(\widehat{\mu }_{\bullet }^{NP}.\)

We conclude this remark by briefly discussing the case where some of the observed covariates \(X_{i}\), say \(X_{1i}\), are also MAR. In this case, the IPW estimator can still be used, as it relies only on the observable covariates \(X_{2}\) and U; on the other hand, the imputation estimator becomes unfeasible because of its dependence on the missing \(X_{1}\). To obtain a feasible imputation estimator additional assumptions, such as specifying the joint distribution of \(X_{1}\) and \(X_{2}\), or the existence of an additional of set of covariates, say Z, that are related (parametrically or nonparametrically) to \(X_{1}\), would be required.

3.2 Inference

The results of the previous section can be used to construct Wald statistics to test local statistical hypotheses about \(\alpha \left( \cdot \right) \). To investigate the asymptotic properties of such statistics, we consider the following local hypothesis with a Pitman drift

$$\begin{aligned} H_{n}:R\alpha \left( u\right) =r_{0}\left( u\right) +\gamma _{n}\left( u\right) , \end{aligned}$$
(9)

where R is an \(l\times k\) matrix of constants and \(\gamma _{n}\left( \cdot \right) \) is a bounded continuous function that may depend on n. Let \(S= \left[ I_{k},O_{k}\right] \) denote a selection matrix, where \(O_{k}\) is a \( k\times k\) matrix of zeros, and let

$$\begin{aligned} W\left( u\right)= & {} \left( nh\right) \left[ \left( R\widetilde{\widehat{a}}{} _{1\widehat{\pi }}^{*}-r_{0}\left( u\right) \right) \right] ^{T}\left( RS \widehat{\Gamma }_{\times }^{v_{1}}\left( u\right) ^{-1}\widehat{\Sigma }_{ \widehat{\pi }\times }^{v_{2}}\left( u\right) \widehat{\Gamma }_{\times }^{v_{1}}\left( u\right) ^{-1}S^{T}R^{T}\right) ^{-1} \\&\quad \times \, \left( R\widetilde{\widehat{a}}{}_{1\widehat{\pi }}^{*}-r_{0}\left( u\right) \right) \text { for }\times =\zeta \text { or }\mu , \end{aligned}$$

denote the Wald statistic, where, for \(*=M\) or Z, \(\widetilde{\widehat{ a}}{}_{1\widehat{\pi }}^{*}{}\) can be either the IPW-based robust local M or Z local estimator \(\widehat{a}{}_{1\widehat{\pi }}^{*}\) or its one-step version \(\widetilde{a}{}_{1\widehat{\pi }}^{*}\), \(\widehat{\Gamma } _{\times }^{v_{1}}\left( \cdot \right) \) and \(\widehat{\Sigma }_{\widehat{ \pi }\times }^{v_{2}}\left( \cdot \right) \) are consistent estimatorsFootnote 3 of \(\Gamma _{\times 0}^{v_{1}}\left( \cdot \right) \) and \(\Sigma _{\pi \times 0}^{v_{2}}\left( \cdot \right) \), and \(\widehat{\pi }\left( \cdot \right) \) is either the parametric or nonparametric estimator of \(\pi \left( \cdot \right) \).

Proposition 1

Under the assumptions of Theorems 14, if \( rank\left( R\right) =l\) \(\left( l\le k\right) \), and \(nh^{5}\rightarrow 0\), then under (9) (i) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u\right) \rightarrow \gamma \left( u\right) >0\) (for some \(\left\| \gamma \left( u\right) \right\| <\infty \))

$$\begin{aligned} W\left( u\right) \overset{d}{\rightarrow }\chi ^{2}\left( \kappa ,l\right) , \end{aligned}$$

where \(\chi ^{2}\left( \kappa ,l\right) \) is a noncentral Chi-squared distribution with l degrees of freedom and noncentrality parameter

$$\begin{aligned} \kappa =f\left( u\right) \gamma \left( u\right) ^{T}\left( RS\Gamma _{\times 0}^{v_{1}}\left( u\right) ^{-1}\Sigma _{\pi \times 0}^{v_{2}}\left( u\right) \Gamma _{\times 0}^{v_{1}}\left( u\right) ^{-1}S^{T}R^{T}\right) ^{-1}\gamma \left( u\right) \text { }\left( \times =\zeta \text { or }\mu \right) ; \end{aligned}$$

(ii) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u\right) \rightarrow \infty \),

$$\begin{aligned} W\left( u\right) \overset{p}{\rightarrow }\infty . \end{aligned}$$

Proposition 1 shows that with undersmoothing the proposed test has power against local Pitman-type alternatives and is consistent against any fixed alternatives of the form \(\gamma _{n}\left( \cdot \right) =\gamma \left( \cdot \right) .\) Under the null hypothesis \(H_{0}:\) \(R\alpha \left( u\right) =r_{0}\left( u\right) \), the proposition can be used to construct confidence regions for \(R\alpha \left( u\right) \) with nominal coverage \( 1-\alpha \), that is for \(\Pr \left( \chi ^{2}\left( l\right) \le c_{\alpha }\right) =1-\alpha \) and \(C_{\alpha }\left( u\right) =\Pr \left( r_{0}\left( u\right) |W\left( u\right) \le c_{\alpha }\right) ,\)

$$\begin{aligned} \Pr \left( r_{0}\left( u\right) \in C_{\alpha }\left( u\right) \right) =1-\alpha +o\left( 1\right) . \end{aligned}$$

Note that in the case of \(K\left( \cdot \right) \) being symmetric, the Wald statistic \(W\left( u\right) \) simplifies to

$$\begin{aligned} W_{s}\left( u\right)= & {} \left( nh\right) \left( R\widetilde{\widehat{a}}{}_{1 \widehat{\pi }}^{*}\left( u\right) -r_{0}\left( u\right) \right) ^{T}\left( R\widehat{\Gamma }_{\times }\left( u\right) ^{-1}v_{20}\widehat{ \Sigma }_{\widehat{\pi }\times }\left( u\right) \widehat{\Gamma }_{\times }\left( u\right) ^{-1}R^{T}\right) ^{-1} \nonumber \\&\quad \times \, \left( R\widetilde{\widehat{a}}{}_{1\widehat{\pi }}^{*}\left( u\right) -r_{0}\left( u\right) \right) . \end{aligned}$$
(10)

Proposition 1 can also be used to test the important hypothesis of constancy of the varying coefficients \(\alpha \left( \cdot \right) \), corresponding to

$$\begin{aligned} H_{0}:\alpha _{0}\left( u\right) =\alpha _{0}. \end{aligned}$$
(11)

The test can be implemented using the finite-dimensional analog of the IPW-based robust local M and Z estimators defined in (3) and (5), that is

$$\begin{aligned} \widehat{\alpha }_{\widehat{\pi }}^{M}= & {} \arg \min _{\alpha \in A}\sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) } \zeta \left( Y_{i},\eta \left( X_{i},\alpha \right) \right) \omega \left( X_{i}\right) , \nonumber \\ \widehat{{\alpha }}_{\widehat{\pi }}^{Z}= & {} \sum _{i=1}^{n}\frac{\delta _{i}}{ \widehat{\pi }\left( X_{2i},U_{i}\right) }\mu \left( Y_{i},\eta \left( X_{i}, \widehat{\alpha }{}^{Z}\right) \right) \eta ^{\prime }\left( X_{i},\widehat{ \alpha }{}^{Z}\right) \omega \left( X_{i}\right) =0, \end{aligned}$$
(12)

where A is a compact set and \(\alpha _{0}\in int\left( A\right) \). Let \( \widehat{\alpha }_{\widehat{\pi }}^{*}\) \(\left( *=M\text { or } Z\right) \) denote either of the estimators defined in (12) and note that under the null hypothesis (11) and the assumption that \(n^{1/2}\left( \widehat{\alpha }_{\widehat{\pi } }^{*}-\alpha _{0}\right) =O_{p}\left( 1\right) ,\)

$$\begin{aligned} \left( nh\right) ^{1/2}\left( \widetilde{\widehat{a}}{}_{1\widehat{\pi } }^{*}-\widehat{\alpha }_{\widehat{\pi }}^{*}\right) =\left( nh\right) ^{1/2}\left( \widetilde{\widehat{a}}{}_{1\widehat{\pi }}^{*}-\alpha _{0}\right) +o_{p}\left( 1\right) , \end{aligned}$$

hence by Proposition 1

$$\begin{aligned} W_{c}\left( u\right)= & {} \left( nh\right) \left( \widetilde{\widehat{a}}{}_{1 \widehat{\pi }}^{*}-\alpha _{0}\right) ^{T}\left( S\widehat{\Gamma } _{\times }^{v_{1}}\left( u\right) ^{-1}\widehat{\Sigma }_{\widehat{\pi } \times }^{v_{2}}\left( u\right) \widehat{\Gamma }_{\times }^{v_{1}}\left( u\right) ^{-1}S^{T}\right) ^{-1} \nonumber \\&\quad \times \, \left( \widetilde{\widehat{a}}{}_{1\widehat{\pi }}^{*}-\alpha _{0}\right) \overset{d}{\rightarrow }\chi ^{2}\left( p\right) . \end{aligned}$$
(13)

It is important to note that the test statistics \(W\left( u\right) \), \( W_{s}\left( u\right) \) and \(W_{c}\left( u\right) \) are asymptotically valid at a single point u. To increase their power, one can consider them over a fixed range of values of u, say \(\left\{ u_{j}\right\} _{j=1}^{s}\), and use instead the test statistics \(\max _{j}W\left( u_{j}\right) \) and \( \max _{j}W_{c}\left( u_{j}\right) \) \(\left( j=1,\ldots ,s\right) \), as the following proposition shows.

Proposition 2

Under the assumptions of Proposition 1, (i) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u_{j}\right) \rightarrow \gamma \left( u_{j}\right) >0\) (for some \(\left\| \gamma \left( u_{j}\right) \right\| <\infty \mathrm{)}\)

$$\begin{aligned}&\max _{1\le j\le s}W\left( u_{j}\right) \overset{d}{\rightarrow } \max _{j}\chi _{j}^{2}\left( \kappa _{j},l\right) , \nonumber \\&\max _{1\le j\le s}W_{c}\left( u_{j}\right) \overset{d}{\rightarrow } \max _{j}\chi _{j}^{2}\left( \kappa _{jc},l\right) \end{aligned}$$
(14)

where

$$\begin{aligned} \kappa _{j}= & {} f\left( u_{j}\right) \gamma \left( u_{j}\right) ^{T}\left( RS\Gamma _{\times 0}^{v_{1}}\left( u_{j}\right) ^{-1}\Sigma _{\pi \times }^{v_{2}}\left( u_{j}\right) \Gamma _{\times 0}^{v_{1}}\left( u\right) ^{-1}S^{T}R^{T}\right) ^{-1}\gamma \left( u_{j}\right) , \\ \kappa _{jc}= & {} f\left( u_{j}\right) \gamma \left( u_{j}\right) ^{T}\left( S\Gamma _{\times 0}^{v_{1}}\left( u_{j}\right) ^{-1}\Sigma _{\pi \times 0}^{v_{2}}\left( u_{j}\right) \Gamma _{\times 0}^{v_{1}}\left( u\right) ^{-1}S^{T}\right) ^{-1}\gamma \left( u_{j}\right) ; \end{aligned}$$

(ii) for \(\left( nh\right) ^{1/2}\gamma _{n}\left( u_{j}\right) \rightarrow \infty \)

$$\begin{aligned} \max _{1\le j\le s}W\left( u_{j}\right) ,\text { }\max _{1\le j\le s}W_{c}\left( u_{j}\right) \overset{p}{\rightarrow }\infty . \end{aligned}$$

Note that the distribution of the test statistics in Proposition 2 is nonstandard, since it involves the maximum of s independent noncentral Chi-squared distributions. However, under the null hypothesis \(R\alpha \left( u\right) =r_{0}\left( u\right) \), the test statistic is asymptotically distribution-free; hence, its distribution can be evaluated numerically or easily simulated.

Remark 4

As mentioned in Introduction, a robust distance statistic can also be used to test (9); however, the resulting statistic would not be asymptotically distribution-free as the following proposition shows for the simple null hypothesis \(H_{0}:\alpha \left( u\right) =\alpha _{0}\left( u\right) \) and \(K\left( \cdot \right) \) symmetric. Let

$$\begin{aligned} D= & {} -2\sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) }\left( \zeta \left( Y_{i},\eta \left( X_{i},\widehat{a} _{1\widehat{\pi }}^{M}\right) \right) -\zeta \left( Y_{i},\eta \left( X_{i},\alpha _{0}\left( u\right) \right) \right) \right) \\&\omega \left( X_{i}\right) K_{h}\left( U_{i}-u\right) . \end{aligned}$$

Proposition 3

Under the same assumptions of Proposition 1

$$\begin{aligned} D\overset{d}{\rightarrow }\sum _{j=1}^{k}\lambda _{j}\chi _{j}^{2}\left( 1\right) , \end{aligned}$$

where \(\lambda _{j}\) are the eigenvalues of \(\Sigma _{\pi \zeta 0}\left( u\right) \Gamma _{\zeta 0}\left( u\right) ^{-1}.\)

4 Example and Monte Carlo

4.1 Robust generalized estimating equations (GEE) estimation

This section considers estimation of a varying coefficient GEE model for longitudinal data [see, for example, Liang and Zeger (1986) for GEE estimation with unknown finite-dimensional parameters]. Suppose that

$$\begin{aligned} E\left( Y|X\right) =\eta \left( X^{T}\alpha _{0}\left( U\right) \right) , \end{aligned}$$

where Y is an m-dimensional random vector (m is the number of subjects or clusters) and \(\eta \left( X^{T}\alpha \left( U\right) \right) =\left[ \eta \left( X_{1}^{T}\alpha \left( U_{1}\right) \right) ,\ldots ,\eta \left( X_{m}^{T}\alpha \left( U_{m}\right) \right) \right] ^{T}.\) In this example, the robust estimating equations \(\mu _{\omega }\left( \cdot \right) \) are

$$\begin{aligned}&\frac{\delta }{\pi }\mu _{\omega }\left( Y,\eta \left( X,\alpha \left( U\right) \right) \right) \\&\quad =\eta ^{\prime }\left( X^{T}\alpha \left( U\right) \right) ^{T}V\left( \eta ,\beta _{0}\right) ^{-1}\Omega \Pi \left( \delta ,X_{2},U\right) \\&\quad \quad \times \, \left[ \psi \left( A^{-1/2}\left( Y-\eta \left( X^{T}\alpha \left( U\right) \right) \right) \right) -G_{\omega }\left( Y,\eta \left( X^{T}\alpha \left( U\right) \right) \right) \right] , \end{aligned}$$

where \(V\left( \eta ,\beta _{0}\right) =R\left( \beta _{0}\right) A_{0}^{1/2} \), \(R\left( \beta _{0}\right) \) is the working correlation matrix indexed by the q-dimensional unknown parameter \(\beta _{0}\), \(\psi \left( .\right) \) is a robust function such as the Huber function defined as

$$\begin{aligned} \psi _{c}\left( t\right) =\left\{ \begin{array}{l} t \qquad \qquad \quad \text { if }\left| t\right| \le c, \\ csign\left( t\right) \quad \text {if }\left| t\right| >c \end{array} \right. \end{aligned}$$
(15)

with tuning constant c,

$$\begin{aligned} A_{0}= & {} \phi _{0}\mathrm{diag}\left( \mathrm{Var}\left( \eta \left( X_{1}^{T}\alpha _{0}\left( U\right) \right) \right) ,\ldots ,\mathrm{Var}\left( \eta \left( X_{m}^{T}\alpha _{0}\left( U\right) \right) \right) \right) , \\ \Omega= & {} \mathrm{diag}\left( \omega \left( X_{1}\right) ,\ldots ,\omega \left( X_{m}\right) \right) , \\ \Pi \left( \delta ,X_{2},U\right)= & {} \mathrm{diag}\left( \delta _{1}/\pi \left( X_{12},U_{1}\right) ,\ldots ,\delta _{m}/\pi \left( X_{m2},U_{m}\right) \right) , \end{aligned}$$

\(\phi _{0}\) is the unknown dispersion parameter,

$$\begin{aligned} G_{\omega }\left( Y,\eta \left( X^{T}\alpha \left( u\right) \right) \right) =E\left[ \psi \left( A^{-1/2}\left( Y-\eta \left( X^{T}\alpha \left( U\right) \right) \right) \right) \right] \end{aligned}$$

is the correction factor used to achieve Fisher consistency and a monotone missing data patternFootnote 4 is assumed, that is, for \(\delta _{1}\ge \delta _{2}\ge \cdots \ge \delta _{m}\), \(\delta _{1}=1,\)

$$\begin{aligned} \pi \left( X_{k2},U_{k}\right) =\Pr \left( \delta _{k}=1|X_{k2},U_{k}\right) =\prod \limits _{l=1}^{k}\Pr \left( \left( \delta _{l}=1|\delta _{l-1},X_{l2},U_{l}\right) \right) . \end{aligned}$$

Calculations show that

$$\begin{aligned}&\mu _{\pi \omega 1}\left( Y,\eta \left( X,aW\right) \right) \\&\quad = \sum _{j=1}^{2k}\frac{\partial \eta ^{\prime }\left( Y,\eta \left( X^{T}aW\right) \right) }{\partial a_{j}}V\left( \eta ,\beta \right) ^{-1}\Omega \Pi \left( \delta ,X_{2},U\right) \\&\quad \quad \times \, \left[ \psi \left( A^{-1/2}\left( Y-\eta \left( X^{T}aW\right) \right) \right) -G_{\omega }\left( Y,\eta \left( X^{T}aW\right) \right) \right] \\&\quad \quad +\, \eta ^{\prime }\left( X^{T}aW\right) ^{T}V\left( \eta ,\beta \right) ^{-1}\Omega \Pi \left( \delta ,X_{2},U\right) \\&\quad \quad \times \, \left[ \frac{\partial }{\partial a}\psi \left( A^{-1/2}\left( Y-\eta \left( X^{T}aW\right) \right) \right) -G_{\omega }^{\prime }\left( Y,\eta \left( X^{T}aW\right) \right) \right] , \\&\Gamma _{\mu 0}\left( u\right) \\&\quad = E\left[ \eta ^{\prime }\left( X^{T}\alpha _{0}\left( U\right) \right) ^{T}V\left( \eta ,\beta _{0}\right) ^{-1}\Omega s_{\alpha }\left( X,U\right) \right. \\&\quad \quad \times \, \left. \frac{\partial }{\partial \alpha }\psi \left( A^{-1/2}\left( Y-\eta \left( X^{T}\alpha _{0}\left( U\right) \right) \right) \right) |U=u\right] \\&\Sigma _{\pi \mu 0}\left( u\right) \\&\quad = E\left\{ \eta ^{\prime }\left( X^{T}\alpha _{0}\left( U\right) \right) ^{T}V\left( \eta ,\beta _{0}\right) ^{-1}\Omega \Pi \left( \delta ,X_{2},U\right) ^{2}\times \right. \\&\quad \quad -\, \left[ \psi \left( A^{-1/2}\left( Y-\eta \left( X^{T}\alpha _{0}\left( U\right) \right) \right) \right) -G_{\omega }\left( Y,\eta \left( X^{T}\alpha _{0}\left( u\right) \right) \right) \right] \left. |U=u\right\} ^{\otimes 2}, \end{aligned}$$

where \(s_{\alpha }\left( X,U\right) =\partial \log f\left( Y|X,U\right) /\partial \alpha \) and \(f\left( Y|X,U\right) \) is the joint conditional density of the response Y. Consistent estimators for \(\Gamma _{\mu 0}\left( u\right) \) and \(\Sigma _{\pi \mu 0}\left( u\right) \) can be found in the supplemental appendix.

4.2 Monte Carlo results

This section investigates the finite sample performance of the estimator and test statistic \(\max _{j}W\left( u_{j}\right) \) given in Proposition 2 for the GEE model considered in the previous section using a varying coefficient logit regression

$$\begin{aligned} \mathrm{logit}\left( E\left( Y=1|X,U\right) \right) =X_{1}\alpha _{10}\left( U\right) +X_{2}\alpha _{20}\left( U\right) , \end{aligned}$$

where Y is a three-dimensional binary response variable Y (i.e., \(m=3\) is number of subjects or clusters), the covariates \(X=\left[ X_{k1},X_{k2} \right] ^{T}\) \(\left( k=1,2,3\right) \) are independently normally distributed with mean zero and unit variances, the three-dimensional covariate U is independent of X and uniformly distributed between 0 and 1 and \( \alpha _{0}\left( U\right) =\left[ \sin \left( \pi U/2\right) ,\cos \left( \pi U\right) \right] ^{T}\). To generate the responses with an exchangeable covariance structure (with correlation coefficient set equal to 0.3), Parzen (2009) approach is used, in which a random effect is added to the marginal probability of success

$$\begin{aligned} m_{ki}=\frac{1}{1-\exp \left( X_{k1i}\alpha _{10}\left( U_{ki}\right) +X_{k2i}\alpha _{20}\left( U_{ki}\right) \right) }\text { }\left( k=1,2,3\right) . \end{aligned}$$

The selection probabilities for subjects \(k=2\) and \(k=3\) are specified as

$$\begin{aligned} \Pr \left( \delta _{2}=1|X_{22},U_{2},Y_{1}\right)= & {} \frac{\exp \left( \gamma _{10}+\gamma _{20}X_{22}+\gamma _{30}U_{2}+\gamma _{40}Y_{1}\right) }{ 1+\exp \left( \gamma _{10}+\gamma _{20}X_{22}+\gamma _{30}U_{2}+\gamma _{40}Y_{1}\right) }, \nonumber \\ \Pr \left( \delta _{3}=1|X_{32},U_{3},Y_{2},Y_{1}\right)= & {} \frac{\exp \left( \gamma _{10}+\gamma _{20}X_{32}+\gamma _{30}U_{3}+\gamma _{40}Y_{1}+\gamma _{50}Y_{2}\right) }{1+\exp \left( \gamma _{10}+\gamma _{20}X_{32}+\gamma _{30}U_{3}+\gamma _{40}Y_{1}+\gamma _{50}Y_{2}\right) }\nonumber \\ \end{aligned}$$
(16)

with \(\left[ \gamma _{10},\gamma _{20},\gamma _{30},\gamma _{40},\gamma _{50} \right] ^{T}=\left[ 1,1.5,0.3,1,0.7\right] ^{T}\), which implies that the average percentages of missing are about 25% for \(k=2\) and 20% for \(k=3\). We consider three cases: the first one (case 0) has no missing values nor outliers; the second one (case C0) has missing values but no outliers; and finally the last case (case C3) has both missing values and outliers generated as

$$\begin{aligned} (C3)\text { }X_{k2j}\sim N\left( 10,1\right) \quad \left( k=1,2,3;\quad j=1,\ldots ,10\right) , \end{aligned}$$

that is in (C3) the first ten elements of the three covariates \(X_{k2}\) are outliers. The computation of \(\widehat{a}_{\widehat{\pi }}^{Z}\) is carried out under the working independence assumption with \(\phi _{0}=1\) using the Huber function (15) with \(c=1.2\) and the Newton–Raphson algorithm. In the simulations, the Epanechnikov kernel is used, whereas the weight function is

$$\begin{aligned} \omega \left( X\right) =\left( 1+\left\| S^{-1}\left( X-M\right) \right\| ^{2}/2\right) ^{-1/2}, \end{aligned}$$
(17)

where M and S are robust location-scatter estimators such as the minimum covariance determinant estimator (MCD), see, for example, He et al. (2005). The MCD estimator is computed using the R routine CovNAMcd, which uses imputation to deal with missing observations. The bandwidth h is selected by a robust cross-validation procedure in which in the first step for a given h, \(\widehat{t}_{-i}=\sum _{j\ne i}^{n}\Psi _{\widehat{\pi } c}\left( \widehat{t}_{j},u\right) /n=0\) and in the second step the robust bandwidth is chosen as \(\widehat{h}=\arg \min _{h}\sum _{i=1}^{n}\delta _{i}\zeta _{\omega }\left( \widehat{t}_{-i}\right) /\widehat{\pi }\left( X_{2i},U_{i}\right) ,\) that is \(\widehat{h}\) is the minimizer of the robust quasi deviance. Note that the selection probabilities (16) are estimated only using the robust parametric logit estimator.

Table 1 reports the mean absolute bias (B)

$$\begin{aligned} B\left( \widehat{a}_{1kj}\right) =\frac{1}{n}\sum _{i=1}^{n}\left| \widehat{a}_{1kj}\left( U_{i}\right) -\alpha _{k0}\left( U_{i}\right) \right| \text { for }k=1,2 \end{aligned}$$

and standard deviation (SD) of four different estimators of \(\alpha _{k0}\left( \cdot \right) \): the robust local complete case estimator \( \widehat{a}_{1kc}\)—that is the estimator based on the sample where all the missing observations are dropped, the IPW local robust estimator \( \widehat{a}_{1k\widehat{\pi }p}\) with robust logit estimation of \(\pi \left( X_{2i},U_{i}\right) \) and their nonrobust analogs computed with the Huber function (15) set to \(c=\infty \), no correction term \(G_{\omega }\left( \cdot \right) \) and an ordinary logit estimator for \(\pi \left( X_{2i},U_{i}\right) .\)

Table 1 Bias (B) and standard deviation (SD) for robust and nonrobust estimators in the GEE example

Inference is based on the statistic \(\max _{1\le j\le s}W\left( u_{j}\right) \) (with \(s=5\)) for the hypothesis

$$\begin{aligned} H_{\gamma k}:\left[ \begin{array}{c} \alpha _{1}\left( u_{kj}\right) \\ \alpha _{2}\left( u_{kj}\right) \end{array} \right] =\left[ \begin{array}{c} \alpha _{10}\left( u_{kj}\right) \\ \alpha _{20}\left( u_{kj}\right) \end{array} \right] +\gamma \left[ \begin{array}{c} \widehat{\alpha }_{1\widehat{\pi }}^{Z}-\alpha _{10}\left( u_{kj}\right) \\ \widehat{\alpha }_{2\widehat{\pi }}^{Z}-\alpha _{20}\left( u_{kj}\right) \end{array} \right] \text { }\left( k=1,2,3\right) , \end{aligned}$$
(18)

where \(u_{kj}=\left[ 0.1,0.3,0.5,0.7,0.9\right] \), \(\alpha _{10}\left( u_{kj}\right) =\left[ 0.156,0.453,0.707,0.891,0.987\right] \) and \(\alpha _{20}\left( u_{kj}\right) =\left[ 0.951,0.587,0,0.587,0.951\right] \) \(\left( k=1,2,3\right) \) (that is, there are six parameters), \(\widehat{\alpha }_{ \widehat{\pi }}^{Z}\) is the parametric estimator defined in (12) and \(\gamma \in \mathbb {R}\) index the departure from the null hypothesis (corresponding to \(\gamma =0\)). The upper 10% and 5% critical values of the nonstandard distribution given in Proposition 2 are calculated using \(10^{5}\) simulations and are \(\left[ 11.307,13.452\right] \) for \(n=200\) and \(\left[ 11.077,13.168\right] \) for \(n=500\). Table 2 reports the finite sample size at the 0.10 and 0.05 nominal level of \(\max _{1\le j\le 5}W\left( u_{kj}\right) \) based on the four estimators and the three cases (0, C0 and C3) used in Table 1.

Table 2 Finite sample size for robust and nonrobust tests in the GEE example

A full evaluation of the finite sample power of \(\max _{1\le j\le 5}W\left( u_{kj}\right) \) under (18) is not feasible as it would have to be calculated over the hypergrid \(\gamma \times \cdots \times \gamma =\left[ -1,-0.75,\ldots ,0,\ldots 0.75,1\right] ^{4}\) (6, 561 evaluation points); therefore, Fig. 1 reports the contour plots at the level 0.45 of the size-adjusted finite sample power curves for the test of \(H_{\gamma 2}\) (that is only for the second subject (or cluster)) over the grid \(\gamma \times \gamma =\left[ -1,-0.75,\ldots ,0,\ldots 0.75,1\right] \times \left[ -1,-0.75,\ldots ,0,\ldots 0.75,1\right] \) for the C0 and C3 cases. Note that smaller contour plots indicate higher finite sample power.

Fig. 1
figure 1

Finite sample-adjusted power for the test statistic \(\max _{1\le j\le 5}W\left( u_{j}\right) \); solid lines indicate \(\max _{1\le j\le 5}W_{\widehat{\pi }p}\left( u_{j}\right) \) and dot dashed lines indicate \(\max _{1\le j\le 5}W_{co}\left( u_{j}\right) \) for the GEE example

The results of Tables 1 and 2 (together with those of Tables 3–6 in the supplemental appendix) can be summarized as follows: For estimation (Tables 1 and 3, 5 in the supplemental appendix), first, without outliers and missing observations (case 0) the nonrobust local estimators perform better than the robust ones both in terms of bias and standard deviation, with the bias and standard deviation up to, respectively, 16% and 8% smaller for the GEE example, 13% and 10% smaller for the Poisson regression example l and 7% and 11% smaller for the nonlinear regression example in the supplemental appendix. Note that for both estimators the bias and standard deviation decrease as the sample size increases, implying the validity of the asymptotic results of Sect. 3. Second, without outliers but with missing observations (case C0) the performance of the nonrobust and robust local estimators is fairly similar, with biases and standard deviations being, respectively, between 1% and 8% smaller and 2% and 6% (for the Poisson regression example—see Table 3 in the supplemental appendix, and the GEE example). Third, when outliers are present (case C3 and cases C1–C2 in the supplemental appendix) the robust estimators clearly outperform the nonrobust ones with biases being up to 60% smaller and standard deviations up to 50% smaller. Fourth, among the three robust local estimators, those based on the inverse probability weighting perform better in terms of bias (for the GEE example on average around 18% smaller, for the Poisson regression example on average around 9% smaller and for the nonlinear regression example on average around 10% smaller) but not in terms of standard deviation (for the GEE example on average around 2%, for the Poisson regression example on average around 1% and for the nonlinear regression example on average around 2%) than those based on the complete case analysis, which is not surprising because their asymptotic variance is affected by the estimated selection probabilities in the denominator. Note, however, that in terms of the mean-squared error (MSE) the local robust estimators have typically a smaller one than that of the nonrobust estimators, the exceptions being the GEE and the Poisson regression examples both with 200 observations, where the MSEs are, respectively, 0.272 and 0.279 and 0.312 and 0.315 (for the IPW estimator with nonparametric estimation of the selection probabilities). Finally, between the two inverse probability weighting estimators, those based on the parametric estimator of the selection probabilities seem to have an edge over those based on the nonparametric estimator in terms of both bias and standard deviation, which is again not surprising, given that the selection probabilities are estimated using the (robust) maximum likelihood estimator for a correctly specified parametric model. For inference (Tables 2 and 4 and 6 in the supplemental appendix), first, without outliers and missing observations (case 0) or without outliers but with missing observations (case C0), the tests based on the nonrobust local estimators are characterized by a slightly better (i.e., closer to the nominal level) size than that of the tests based on the robust local estimators (up to 3% smaller for the GEE-based test with 200 observation). Second, when outliers are present (cases C1–C3) the size distortion of the tests based on the nonrobust local estimators worsen significantly (up to 30% larger size distortions in the GEE case with 200 observations), whereas that of the tests based on the robust local estimator remain similar to that of case C0. Third among the three test statistics, those based on the inverse probability weighting are more accurate (that is they have the smallest size distortion) with those based on the parametric estimator of the selection probabilities being the most accurate. Finally, Fig. 1 (combined with Figures 3 and 4 in the supplemental appendix) shows that in terms of finite sample power the tests based on the inverse probability weighting robust local estimators have typically larger power compared to those based on the complete case estimators both in the case of no outliers present (case C0) or with outliers (cases C2 and C3).

5 Empirical application

This section illustrates the applicability of the proposed estimation and testing methods by considering the New York air quality measurements data (from May to September 1973, available in the R package datasets which consists of 154 daily observations of mean ozone parts (per billion), solar radiations, wind speed (in mph) and temperature (in degrees F) and contains 37 missing ozone part observations and 7 missing solar radiation observations. The same data set was considered by Bianco and Spano (2019), who fitted an exponential growth regression model between the ozone parts and the temperature. Here, a linear varying coefficients specification is considered

$$\begin{aligned} Y=X^{T}\alpha _{0}\left( U\right) +\varepsilon , \end{aligned}$$

where Y represents the ozone parts, \(X=\left[ 1,X_{1},X_{2}\right] ^{T}\) represent, respectively, the solar radiation and temperature, U is the wind speed and \(\varepsilon \) is a standard normal. The same Huber function with \(c=1.2\) and weight function \(\omega \left( \cdot \right) \) as those given in (15) and (17) are used, with the computation of \(\widehat{\alpha }_{\widehat{\pi }}\left( \cdot \right) \) based on the Newton–Raphson algorithm for the local estimating equations

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\frac{\delta _{i}}{\widehat{\pi }\left( X_{2i},U_{i}\right) }\psi _{c}\left( \widehat{\varepsilon }_{i}\right) \left[ \begin{array}{c} X_{i} \\ X_{i}\left( U_{i}-u\right) /h \end{array} \right] \omega \left( X_{i}\right) K_{h}\left( U_{i}-u\right) =0, \end{aligned}$$
(19)

where \(\widehat{\varepsilon }_{i}=Y_{i}-X_{i}^{T}\widehat{a}W_{i}\).

Figure 2 shows the three varying coefficients \(\widehat{\alpha }_{j}\left( u_{i}\right) \) estimated using (19) with the Bianco and Yohai (1996) robust logit estimator for the selection probabilities \(\pi \left( X_{2i},U_{i}\right) \) and their nonrobust analogs, together with their associated 95% confidence intervals.

Fig. 2
figure 2

Estimated varying coefficients for the New York air quality data. The solid line represents the robust local estimator with 95% confidence interval (dashed lines). The dot dashed line represents the nonrobust local estimator with 95% confidence interval (long dashed lines)

The first estimated coefficient (intercept) represents the direct effect of the wind speed on the ozone parts and is a decreasing function of it. (The same decreasing relationship was found by Bianco and Spano (2019).) The other two estimated varying coefficients represent the combined effect on the ozone parts of the solar radiation and temperature with the wind speed. The former shows a clear decreasing relationship between the ozone parts and the wind speed combined with the solar radiations up to a speed of around 12 mph followed by a flatter relationship, whereas the latter shows an initial increasing relationship between the ozone parts and the wind speed and the temperature followed by a less clear relationship. Note that the robust local estimators are characterized by a more regular (and therefore easier to interpret) pattern. In terms of the mean effect (i.e., \(\overline{\widehat{ \alpha }_{j}}=\sum _{i=1}^{n}\widehat{\alpha }_{j}\left( U_{i}\right) /n\) for \(j=1,2,3\)), the nonrobust estimation procedure has \(\left[ \overline{ \widehat{\alpha }_{1}},\overline{\widehat{\alpha }_{2}},\overline{\widehat{ \alpha }_{3}}\right] ^{T}=\left[ -0.94,0.08,1.56\right] ^{T}\), whereas the robust one has \(\left[ \overline{\widehat{\alpha }_{1}},\overline{\widehat{ \alpha }_{2}},\overline{\widehat{\alpha }_{3}}\right] ^{T}=\left[ -0.84,0.07,1.36\right] ^{T}\). For inference, the null hypotheses of joint and individual constancy of the three varying coefficients are considered, that is

$$\begin{aligned} H_{0}&:&\left[ \alpha _{1}\left( u_{j}\right) -\alpha _{10},\alpha _{2}\left( u_{j}\right) -\alpha _{20},\alpha _{3}\left( u_{j}\right) -\alpha _{30}\right] ^{T}=0^{T}\text { and } \\ H_{0}&:&\left[ \alpha _{k}\left( u_{j}\right) -\alpha _{k0}\right] =0\quad \left( k=1,2,3\right) \end{aligned}$$

are tested using the \(\max _{1\le j\le 25}W_{c}\left( u_{j}\right) \) statistic (2) evaluated at 25 points \(u_{j}\). The sample values of \(\max _{1\le j\le 25}W_{c}\left( u_{j}\right) \) for the null hypotheses of joint and individual constancy of the three varying coefficients are, respectively, 11.84, 7.84, 7.04 and 8.89 with corresponding p values of 0.028, 0.022, 0.028 and 0.016. Thus, the null hypotheses of joint and individual constancy of the varying coefficients are rejected at the 0.05 significance level.

6 Conclusions

This paper has considered robust local estimation and inference for a general class of varying coefficients models where some of the responses and covariates are missing at random and outliers might be present. The paper has proposed a general estimation method (and a computationally attractive one-step version of it) based on inverse probability of weighting of the selection probabilities that can be used to obtain both M and Z estimators and can accommodate longitudinal data. The paper has also proposed two Wald statistics that can be used to test hypotheses on the infinite-dimensional parameter, including that of constancy. A Monte Carlo study shows that the proposed estimators and Wald statistics perform well in finite samples, while two empirical applications illustrate their practical usefulness.