Abstract
Doubly protected methods are widely used for estimating the population mean of an outcome Y from a sample where the response is missing in some individuals. To compensate for the missing responses, a vector \(\mathbf {X}\) of covariates is observed at each individual, and the missing mechanism is assumed to be independent of the response, conditioned on \(\mathbf {X}\) (missing at random). In recent years, many authors have turned from the estimation of the mean to that of the median, and more generally, doubly protected estimators of the quantiles have been proposed. In this work, we present doubly protected estimators for the quantiles in semiparametric models that are also robust, in the sense that they are resistant to the presence of outliers in the sample.
Similar content being viewed by others
References
Agostinelli C, Bianco AM, Boente G (2017) Robust estimation in single index models when the errors have a unimodal density with unknown nuisance parameter. arXiv preprint arXiv:1709.05422
Bianco AM, Spano PM (2019) Robust inference for nonlinear regression models. TEST 28(2):369–398
Bianco A, Boente G (2004) Robust estimators in semiparametric partly linear regression models. J Stat Plan Inference 122(1–2):229–252
Bianco A, Boente G, González-Manteiga W, Pérez-González A (2010) Estimation of the marginal location under a partially linear model with missing responses. Comput Stat Data Anal 54(2):546–564
Bianco AM, Boente G, González-Manteiga W, Pérez-González A (2011) Asymptotic behavior of robust estimators in partially linear models with missing responses: the effect of estimating the missing probability on the simplified marginal estimators. TEST 20(3):524–548
Bianco AM, Boente G, González-Manteiga W, Pérez-González A (2018) Plug-in marginal estimation under a general regression model with missing responses and covariates. TEST 28(1):1–41
Boente G, Fraiman R (1989) Robust nonparametric regression estimation. J Multivar Anal 29(2):180–198
Boente G, Rodriguez D (2008) Robust bandwidth selection in semiparametric partly linear regression models: Monte Carlo study and influential analysis. Comput Stat Data Anal 52(5):2808–2828
Boente G, Rodriguez D (2012) Robust estimates in generalized partially linear single-index models. Test 21(2):386–411
Cheng PE (1994) Nonparametric estimation of mean functionals with data missing at random. J Am Stat Assoc 89(425):81–87
Díaz I (2017) Efficient estimation of quantiles in missing data models. J Stat Plan Inference 141(2):711–724
Fasano MV, Maronna Ricardo RA, Sued M, Víctor J et al (2012) Continuity and differentiability of regression M functionals. Bernoulli 18(4):1284–1309
Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685
Kang JDY, Schafer JL (2007) Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 22(4):523–539
Little RJA, Rubin DB (2014) Statistical analysis with missing data. Wiley, New York
Marazzi A, Yohai VJ (2004) Adaptively truncated maximum likelihood regression with asymmetric errors. J Stat Plan Inference 122(1–2):271–291
Maronna RA, Martin RD, Yohai VJ, Salibián-Barrera M (2018) Robust statistics: theory and methods (with R). Wiley, New York
Porter KE, Gruber S, Van Der Laan MJ, Sekhon JS (2011) The relative performance of targeted maximum likelihood estimators. Int J Biostat 7(1):1–34
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89(427):846–866
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429):106–121
Rotnitzky, A, Robins, J, Babino, L (2017) On the multiply robust estimation of the mean of the g-functional. arXiv preprint arXiv:1705.08582
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Statti F, Sued M, Yohai VJ (2018) High breakdown point robust estimators with missing data. Commun Stat Theory Methods 47(21):5145–5162
Sued M, Yohai VJ (2013) Robust location estimation with missing data. Can J Stat 41(1):111–132
Van der Vaart AW (2000) Asymptotic statistics, vol 3. Cambridge University Press, Cambridge
Wang Q, Linton O, Härdle W (2004) Semiparametric regression analysis with missing response at random. J Am Stat Assoc 99(466):334–345
Yohai VJ et al (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Stat 15(2):642–656
Zhang Z, Chen Z, Troendle JF, Zhang J (2012) Causal inference on quantiles with an obstetric application. Biometrics 68(3):697–706
Acknowledgements
We would like to thank Dr. Alfio Marazzi for the data set in the example and the editor and referees for their comments and suggestions which have helped us to improve this paper.
Funding
This work was supported by Secretaria de Ciencia y Tecnica, Universidad de Buenos Aires (Grant Nos. 20020150200110BA, 20020130100279BA), Agencia Nacional de Promoción Científica y Tecnológica (Grant No. pict 2014-0351).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was partially supported by Grant pict 2014-0351 from anpcyt and Grants 20020150200110BA and 20020130100279BA from the Universidad de Buenos Aires at Buenos Aires, Argentina.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
Proof of Theorem 1
If \(\pi (\mathbf {X})=\pi _{\infty }(\mathbf {X})\), then \(\pi (\mathbf {X})/\pi _{\infty }(\mathbf {X})=1\) and \(C_\infty =1\). Therefore, \(F_1=F_0\), \(F_{2a}=F_{3a}\) and \(F_{\infty }=F_0\).
If \(Y=g(\mathbf {X})+u\), with u independent of \((A, \mathbf {X})\) and \(g(\mathbf {X})=g_{\infty }(\mathbf {X})\), then \(F_{3a}\) is the distribution function of \(g(\mathbf {X})\) and G is the distribution function of u. Therefore, \(F_{3a}*G\) is the distribution function of \(g(\mathbf {X})+u=Y\), that is to say \(F_{3a}*G=F_0\). On the other hand, let Z be a random variable, independent of u, with distribution function \(F_{2a}\), then \(F_{2a}*G\) is the distribution function of \(Z+u\), which, by definition, is equal to
Thus, \(F_{\infty }=F_0\) also in this case. \(\square \)
The following five lemmas will be used to prove Theorem 2. Recall that \(\widetilde{F}_{1}\) and \(\widetilde{F}_{2a}\), defined in (6), are indeed random sequences of cumulative distribution functions based on sample of size n (which we omit in the notation).
Lemma 1
Consider \(\widetilde{F}_1\) and \(F_1\), defined in (6) and (7), respectively. Under assumptions A1 and A2, \(\widetilde{F}_1\) converges to \(F_1\) uniformly, a.s., that is \( \mathbb {P}\left( \sup _y\vert \widetilde{F}_1(y)- F_1(y)\vert \rightarrow 0 \right) =1 \)
Proof
We show first that \(C_n/n\rightarrow C_\infty \) a.s. To do so, note that we can write
By the law of large numbers, the second term in (13) converges a.s. to
It remains to prove that the first term in (13) converges to zero a.s. Now, under conditions A1 and A2, given \(\varepsilon \in (0,1)\) there exists \(n_0\) such that \(\left| \pi _\infty (\mathbf {X})-\widehat{\pi }_n(\mathbf {X})\right| <\varepsilon i_\infty \) for all \(n\ge n_0\), and therefore, \((1-\varepsilon )i_\infty \le \widehat{\pi }_n(\mathbf {X})\) for such n, implying that
and then we obtain the announced result.
Second, we prove that
To prove (15), notice that adding and subtracting \((nC_\infty )^{-1}\sum _{i=1}^n {{A_i\mathrm I_{\{Y_i\le y\}} }/{ \widehat{\pi }_n(\mathbf {X}_i)}}\), we get
Neither of the two terms in (16) depend on y, and they both converge to zero under A1-A2; the convergence of the first term follows from the convergence of \(C_n/n \) to \(C_\infty \) a.s., while the convergence of the second one has already been proved in (14). This proves (15).
Finally, using arguments similar to those in the proof of the Glivenko–Cantelli theorem (see, for instance, Theorem 19.1 in Van der Vaart 2000), it can be shown that
The result follows combining (15) and (17). \(\square \)
Henceforth, we use \(G_n \xrightarrow {w} G\) to denote weak convergence of cumulative distribution functions.
Lemma 2
Consider \(\widetilde{F}_{2a}\) and \(F_{2a}\), defined in (6) and (7), respectively. Under assumptions A1–A3, it holds that \(\widetilde{F}_{2a}\) converges weakly to \(F_{2a}\) a.s., i.e.,
Proof
Let \(\mathcal {C}_{\text {buc}}\) denote the set of functions \(f:\mathbb {R}\rightarrow \mathbb {R}\) bounded and uniformly continuous. In order to prove the lemma, we will show that
Let
Note that both \(\widetilde{F}_3\) and \(\widetilde{F}_4\) defined above are sequences of random functions; however, we omit n in the notation for simplicity.
Fix \( f \in \mathcal {C}_{\text {buc}}\). Defining \(I_1(f) = \left| \int f d \widetilde{F}_{2a} - \int f d \widetilde{F}_{3} \right| ,\)\(I_2(f) = \left| \int f d \widetilde{F}_{3} - \int \right. \left. f d \widetilde{F}_{4} \right| ,\) and \(I_3(f) = \left| \int f d \widetilde{F}_{4} - \int f d F_{2a} \right| \), we get that
Let us now consider each of these three terms. Since f is bounded, using arguments similar to those in the proof of Lemma 1, we have that under A1 and A2
To deal with \(I_2(f)\), notice that
Since f is uniformly continuous, given \(\varepsilon >0\), there exists \(\delta \) such that \(\vert u_1-u_2\vert <\delta \) implies \(\vert f(u_1)-f(u_2)\vert <\varepsilon \). Take K large and consider the compact set \(\mathcal {K}=\{\vert \vert \mathbf {X}\vert \vert \le K\}\). For n large enough, invoking now A3, we get that \(\sup _{\mathbf {X}\in \mathcal {K}} \vert \widehat{g}_n(\mathbf {X})-g_{\infty }(\mathbf {X})\vert <\delta \) and therefore, the right-hand side of (22) is smaller than
which implies that
It remains to show that
Notice that, as in Lemma 1, using arguments similar to those in the proof of the Glivenko–Cantelli theorem, we have that
and therefore
where \(\widetilde{C}_n=\sum _{i=1}^n A_i/\pi _{\infty }(X_i)\). Both of the sequences as the limit function presented in (26) are cumulative distribution functions. By the MAR assumption,
and, therefore, (26) implies that
Finally, since \(\widetilde{C}_n/C_n\rightarrow 1\), we conclude that (24) holds. The result stated in the lemma follows from combining (20), (21), (23) and (24). \(\square \)
The following lemma was proved in Sued and Yohai (2013), as a part of Theorem 1.
Lemma 3
Consider \(\widetilde{F}_{3a}\) and \(\widetilde{G}\), defined in (6) and \(F_{3a}\) and G defined in (8). Under assumption A3, \(\widetilde{F}_{3a}\) converges weakly to \(F_{3a}\) a.s. and also \(\widetilde{G}\) converges weakly to G a.s., i.e.,
As announced in Sect. 3, we will now show that the functional \(T_p\), presented in (2), can be defined over an enlarged family of functions, which includes cumulative distribution functions, preserving its continuity.
Lemma 4
Consider a distribution function \(F:\mathbb {R}\rightarrow [0,1]\) and \(p\in (0,1)\) such that there exists a unique value \(y_p\) with \(F(y_p)=p\), and so \(T_p(F)=y_p\), for \(T_p\) defined in (2). Let \(F_{n}:\mathbb {R} \rightarrow \mathbb {R}, n\ge 1\), be a sequence of functions such that
-
1.
\(\lim _{y \rightarrow - \infty } F_{n}(y)=0\) and \(\lim _{y \rightarrow + \infty } F_{n}(y)=1\).
-
2.
\(F_{n}\) converges uniformly to F.
Then \(T_p\) can be defined at \(F_n\) and \(\lim _{n \rightarrow \infty } T_p(F_n)=T_p(F).\)
Proof
Let \( A_{n,p}=\left\{ y \in \mathbb {R}: F_n(y) \ge p \right\} .\) By the assumptions of the lemma, \(\lim _{y \rightarrow + \infty } F_{n}(y)=1\), and therefore, \(A_{n,p}\) is not empty. Since \(\lim _{y \rightarrow - \infty } F_{n}(y)=0\) we conclude that \(A_{n,p}\) is bounded from below, and therefore \(T_p(F_n)=\inf A_{n,p}\) is well defined.
Given \(\varepsilon >0\), let \(\delta =\min \left\{ \left( F(y_{p}+\varepsilon )-F(y_{p}))/2\right) ,\left( F(y_{0})-F(y_{p}-\varepsilon )\right) /2\right\} .\) By the assumptions of the lemma, \(\delta >0\). Now, the uniform convergence of \(F_n\) to F guarantees that there exists \(n_0\) such that \(\sup _{y \in \mathbb {R}}\vert F_n(y)-F(y)\vert \le \delta , \hbox {for all}\) \(n\ge n_0\). In particular, the following inequalities hold
From (29) and (30) we conclude that for all \(n\ge n_{0}\) we have \(|y_{n}-y_p|\le \delta ,\ \)and therefore, \(y_{n}\rightarrow y_{p}\) en This concludes the proof. \(\square \)
Proof of Theorem 2
The continuity of G implies that \(F_{2a}*G\) and \(F_{3a}*G\) are both continuous cumulative distribution functions. Since weak convergence to a continuous limit distribution function implies uniform convergence (see, for example, Lemma 2.11 in Van der Vaart (2000)), Lemmas 2 and 3 imply that \(\widetilde{F}_{2a}*\widetilde{G}\) and \(\widetilde{F}_{3a}*\widetilde{G}\) converge uniformly to \(F_{2a}*G\) and \(F_{3a}*G\), respectively, a.s.
Combining these results with Theorem 1, we obtain (9). From Lemma 4, we conclude that \(T_p(\widehat{F}_{\tiny {\hbox {RSDP}}})\) is well defined. Moreover, Lemma 4 and the uniform convergence proved below imply that \(T_p(\widehat{F}_{\tiny {\hbox {RSDP}}})\) converges to \(T_p(F_0)\) a.s. \(\square \)
Proof of Theorem 3
We will show that A1–A3 are satisfied, with \(\widehat{\pi }_n(\mathbf {X})=\hbox {expit}(\widehat{\gamma }_n^{\tiny {t}}\mathbf {X})\), \(\pi _{\infty }(\mathbf {X})=\hbox {expit}(\varvec{\gamma }_\infty ^{\tiny {t}}\mathbf {X})\), \(\widehat{g}_n (\mathbf {X})=\varvec{\beta }_n^{\tiny {t}} \mathbf {X}\) and \(g_{\infty }(\mathbf {X})= \varvec{\beta }_\infty ^{\tiny {t}}\mathbf {X}\). To prove A1, note that
where \(\widetilde{\varvec{\gamma }}_n\) is an intermediate point between \(\varvec{{\widehat{\gamma }}}_n\) and \(\varvec{\gamma }_\infty \). The convergence of \(\widehat{\varvec{\gamma }}_n\) to \(\varvec{\gamma }_\infty \) a.s. combined with the assumed compactness for the support of \(\mathbf {X}\) implies the validity of A1.
A2 is satisfied since \(\hbox {expit}(\varvec{\gamma }_\infty ^{\tiny {t}}\mathbf {X})\) is continuous and \(\mathbf {X}\) has a compact support.
To prove the validity of A3, observe that \(\vert \widehat{g}_n(\mathbf {X})-g_{\infty }(\mathbf {X})\vert = \vert \{\varvec{\widehat{\beta }_n}-\varvec{\beta }_\infty \}^{\tiny {t}}\mathbf {X}\vert .\) The convergence of \(\widehat{\varvec{\beta }}_n\) to \(\varvec{\beta }_\infty \) a.s. guarantees that A3 is also satisfied.
Finally, note that if \(\mathbb {P}(A=1\mid \mathbf {X})=\hbox {expit}(\varvec{\gamma }_0^{\tiny {t}}\mathbf {X})\), then \(\varvec{\gamma }_\infty =\varvec{\gamma }_0\), and so \(\pi _{\infty }(\mathbf {X})=\mathbb {P}(A=1\mid X)\). Also, if \(g(\mathbf {X})=\varvec{\beta }_0^{\tiny {t}}\mathbf {X}\), then \(\varvec{\beta }_\infty =\varvec{\beta }_0\) implying that \(g_{\infty }(\mathbf {X})=g(\mathbf {X})\). We can now invoke Theorem 2 to conclude the proof of the theorem. \(\square \)
Rights and permissions
About this article
Cite this article
Sued, M., Valdora, M. & Yohai, V. Robust doubly protected estimators for quantiles with missing data. TEST 29, 819–843 (2020). https://doi.org/10.1007/s11749-019-00689-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-019-00689-9