Robust statistical learning with Lipschitz and convex loss functions

Chinot, Geoffrey; Lecué, Guillaume; Lerasle, Matthieu

doi:10.1007/s00440-019-00931-3

Robust statistical learning with Lipschitz and convex loss functions

Published: 02 July 2019

Volume 176, pages 897–940, (2020)
Cite this article

Probability Theory and Related Fields Aims and scope Submit manuscript

1073 Accesses
11 Citations
Explore all metrics

Abstract

We obtain estimation and excess risk bounds for Empirical Risk Minimizers (ERM) and minmax Median-Of-Means (MOM) estimators based on loss functions that are both Lipschitz and convex. Results for the ERM are derived under weak assumptions on the outputs and subgaussian assumptions on the design as in Alquier et al. (Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. arXiv:1702.01402, 2017). The difference with Alquier et al. (2017) is that the global Bernstein condition of this paper is relaxed here into a local assumption. We also obtain estimation and excess risk bounds for minmax MOM estimators under similar assumptions on the output and only moment assumptions on the design. Moreover, the dataset may also contains outliers in both inputs and outputs variables without deteriorating the performance of the minmax MOM estimators. Unlike alternatives based on MOM’s principle (Lecué and Lerasle in Ann Stat, 2017; Lugosi and Mendelson in JEMS, 2016), the analysis of minmax MOM estimators is not based on the small ball assumption (SBA) of Koltchinskii and Mendelson (Int Math Res Not IMRN 23:12991–13008, 2015). In particular, the basic example of non parametric statistics where the learning class is the linear span of localized bases, that does not satisfy SBA (Saumard in Bernoulli 24(3):2176–2203, 2018) can now be handled. Finally, minmax MOM estimators are analysed in a setting where the local Bernstein condition is also dropped out. It is shown to achieve excess risk bounds with exponentially large probability under minimal assumptions insuring only the existence of all objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods

Article Open access 08 March 2021

Eyke Hüllermeier & Willem Waegeman

Confidence distributions and hypothesis testing

Article Open access 29 March 2024

Eugenio Melilli & Piero Veronese

Supervised Classification Algorithms in Machine Learning: A Survey and Review

Notes

All figures can be reproduced from the code available at https://github.com/lecueguillaume/MOMpower.

References

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1, part 2), 137–147 (1999). Twenty-eighth Annual ACM Symposium on the Theory of Computing (Philadelphia, PA, 1996)
Article MathSciNet Google Scholar
Alquier, P., Cottet, V., Lecué, G.: Estimation bounds and sharp oracle inequalities of regularized procedures with lipschitz loss functions (2017). arXiv:1702.01402
Audibert, J.-Y., Catoni, O.: Robust linear least squares regression. Ann. Stat. 39(5), 2766–2794 (2011)
Article MathSciNet Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al.: Convex optimization with sparsity-inducing norms. Optim. Mach. Learn. 5, 19–53 (2011)
MATH Google Scholar
Baraud, Y., Birgé, L., Sart, M.: A new method for estimation and model selection: $\rho $-estimation. Invent. Math. 207(2), 425–517 (2017)
Article MathSciNet Google Scholar
Bartlett, P.L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Article MathSciNet Google Scholar
Bartlett, P.L., Bousquet, O., Mendelson, S., et al.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Article MathSciNet Google Scholar
Bartlett, P.L., Mendelson, S.: Empirical minimization. Probab. Theory Relat. Fields 135(3), 311–334 (2006)
Article MathSciNet Google Scholar
Birgé, L.: Stabilité et instabilité du risque minimax pour des variables indépendantes équidistribuées. Ann. Inst. H. Poincaré Probab. Stat. 20(3), 201–223 (1984)
MathSciNet MATH Google Scholar
Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of some recent advances. ESAIM Probab. Stat. 9, 323–375 (2005)
Article MathSciNet Google Scholar
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities. Oxford University Press, Oxford (2013). A nonasymptotic theory of independence, with a foreword by Michel Ledoux
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)
Article Google Scholar
Catoni, O.: Challenging the empirical mean and empirical variance: a deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48(4), 1148–1185 (2012)
Article MathSciNet Google Scholar
Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R.I., et al.: Sub-Gaussian mean estimators. Ann. Stat. 44(6), 2695–2725 (2016)
Article MathSciNet Google Scholar
Elsener, A., van de Geer, S.: Robust low-rank matrix estimation (2016). arXiv:1603.09071
Han, Q., Wellner, J.A.: Convergence rates of least squares regression estimators with heavy-tailed errors. Ann. Statist. 47(4), 2286–2319 (2019)
Article MathSciNet Google Scholar
Huber, P.J., Ronchetti, E.: Robust statistics. In: International Encyclopedia of Statistical Science, pp. 1248–1251. Springer, New York (2011)
Jerrum, M.R., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci. 43(2–3), 169–188 (1986)
Article MathSciNet Google Scholar
Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)
Article MathSciNet Google Scholar
Koltchinskii, V.: Empirical and Rademacher processes. In: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, pp. 17–32. Springer, New York (2011)
Koltchinskii, V.: Oracle inequalities in empirical risk minimization and sparse recovery problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg (2011). Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour (Saint-Flour Probability Summer School)
Koltchinskii, V., Mendelson, S.: Bounding the smallest singular value of a random matrix without concentration. Int. Math. Res. Not. IMRN 23, 12991–13008 (2015)
MathSciNet MATH Google Scholar
Lecué, G., Lerasle, M.: Learning from mom’s principles: Le cam’s approach. Stochast. Process. Appl. (2018). arXiv:1701.01961
Lecué, G., Lerasle, M.: Robust machine learning by median-of-means: theory and practice. Ann. Stat. (2017). arXiv:1711.10306
Lecué, G., Mendelson, S.: Performance of empirical risk minimization in linear aggregation. Bernoulli 22(3), 1520–1534 (2016)
Article MathSciNet Google Scholar
Lecué, G., Lerasle, M., Mathieu, T.: Robust classification via mom minimization (2018). arXiv:1808.03106
Ledoux, M.: The Concentration of Measure Phenomenon, Volume 89 of Mathematical Surveys and Monographs. American Mathematical Society, Providence (2001)
Ledoux, M., Talagrand, M.: Probability in Banach Spaces:Isoperimetry and Processes. Springer, New York (2013)
Google Scholar
Lugosi, G., Mendelson, S.: Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. (2019). arXiv:1608.00757
Lugosi, G., Mendelson, S.: Regularization, sparse recovery, and median-of-means tournaments (2017). arXiv:1701.04112
Lugosi, G., Mendelson, S.: Sub-gaussian estimators of the mean of a random vector (2017). To appear in Ann. Stat. arXiv:1702.00482
Mammen, E., Tsybakov, A.B.: Smooth discrimination analysis. Ann. Stat. 27(6), 1808–1829 (1999)
Article MathSciNet Google Scholar
Mendelson, S.: Learning without concentration. In: Conference on Learning Theory, pp. 25–39 (2014)
Mendelson, S.: Learning without concentration. J. ACM 62(3), Art. 21, 25 (2015)
Mendelson, S.: On multiplier processes under weak moment assumptions. In: Geometric Aspects of Functional Analysis, Volume 2169 of Lecture Notes in Math., pp. 301–318. Springer, Cham (2017)
Mendelson, S., Pajor, A., Tomczak-Jaegermann, N.: Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17(4), 1248–1282 (2007)
Article MathSciNet Google Scholar
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York (1983). Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Saumard, A.: On optimality of empirical risk minimization in linear aggregation. Bernoulli 24(3), 2176–2203 (2018)
Article MathSciNet Google Scholar
Talagrand, M.: Upper and lower bounds for stochastic processes, volume 60 of Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics (Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics). Springer, Heidelberg (2014). Modern methods and classical problems
Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)
Article MathSciNet Google Scholar
van de Geer, S.: Estimation and Testing Under Sparsity, Volume 2159 of Lecture Notes in Mathematics. Springer, Cham (2016). Lecture notes from the 45th Probability Summer School held in Saint-Four, 2015, École d’Été de Probabilités de Saint-Flour (Saint-Flour Probability Summer School)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book Google Scholar
Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
MATH Google Scholar
Zhou, W.-X., Bose, K., Fan, J., Liu, H.: A new perspective on robust m-estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann Stat. 46(5), 1904–1931 (2018). https://doi.org/10.1214/17-AOS1606

Download references

Author information

Authors and Affiliations

Palaiseau, France
Geoffrey Chinot, Guillaume Lecué & Matthieu Lerasle

Authors

Geoffrey Chinot
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Lecué
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Lerasle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Lecué.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Theorems 1, 2, 3 and 4

1.1 Proof of Theorem 1

The proof is splitted in two parts. First, we identify an event where the statistical behavior of the regularized estimator $\hat{f}^{ERM}$ can be controlled. Then, we prove that this event holds with probability at least (3). Introduce $\theta =1/(2A)$ and define the following event:

$$\begin{aligned} \Omega := \left\{ \forall f\in F \cap (f^{*} + r_2(\theta ) B_{L_2}), \quad \big |(P-P_N){{\mathcal {L}}}_f\big |\le \theta r_2^2(\theta ) \right\} \end{aligned}$$

where $\theta $ is a parameter appearing in the definition of $r_2$ in Definition 3.

Proposition 3

On the event $\Omega $, one has

$$\begin{aligned} \Vert \hat{f}^{ERM} - f^*\Vert _{L_2}&\le r_2(\theta ) \quad \text{ and }\quad P{{\mathcal {L}}}_{\hat{f}^{ERM}} \le \theta r_2^2(\theta ). \end{aligned}$$

Proof

By construction, $\hat{f}^{ERM}$ satisfies $P_N{{\mathcal {L}}}_{\hat{f}^{ERM}} \le 0 $. Therefore, it is sufficient to show that, on $\Omega $, if $\Vert f-f^{*}\Vert _{L_2} > r_2(\theta )$, then $P_N {{\mathcal {L}}}_f >0$. Let $f\in F$ be such that $\Vert f-f^{*}\Vert _{L_2 } > r_2(\theta )$. By convexity of F, there exists $f_0 \in F \cap (f^{*} + r_2(\theta )S_{L_2})$ and $\alpha > 1$ such that

$$\begin{aligned} f = f^{*} + \alpha (f_0 - f^{*}) . \end{aligned}$$

(19)

For all $i \in \{1,\ldots ,N \}$, let $\psi _i: {\mathbb {R}} \rightarrow {\mathbb {R}} $ be defined for all $u\in \mathbb {R}$ by

$$\begin{aligned} \psi _i(u) = \overline{\ell } (u + f^{*}(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i). \end{aligned}$$

(20)

The functions $\psi _i$ are such that $\psi _i(0) = 0$, they are convex because $\overline{\ell }$ is, in particular $\alpha \psi _i(u) \le \psi _i(\alpha u)$ for all $u\in {\mathbb {R}}$ and $\alpha \ge 1$ and $\psi _i(f(X_i) - f^{*}(X_i) )= \overline{\ell } (f(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i) $ so that the following holds:

$$\begin{aligned} P_N {{\mathcal {L}}}_f&= \frac{1}{N} \sum _{i=1}^{N} \psi _i \big ( f(X_i)- f^{*}(X_i) \big ) = \frac{1}{N} \sum _{i=1}^{N} \psi _i(\alpha ( f_0(X_i)- f^{*}(X_i) ))\nonumber \\&\ge \frac{\alpha }{N} \sum _{i=1}^{N} \psi _i(( f_0(X_i)- f^{*}(X_i))) = \alpha P_N {{\mathcal {L}}}_{f_0}. \end{aligned}$$

(21)

Until the end of the proof, the event $\Omega $ is assumed to hold. Since $f_0 \in F \cap (f^{*}+ r_2(\theta ) S_{L_2})$, $P_N {{\mathcal {L}}}_{f_0} \ge P{{\mathcal {L}}}_{f_0} - \theta r_2^2(\theta )$. Moreover, by Assumption 4, $P{{\mathcal {L}}}_{f_0} \ge A^{-1} \Vert f_0-f^*\Vert _{L_2 }^2 = A^{-1}r_2^2(\theta ) $, thus

$$\begin{aligned} P_N {{\mathcal {L}}}_{f_0} \ge (A^{-1} - \theta ) r_2^2(\theta ). \end{aligned}$$

(22)

From Eqs. (21) and (22), $P_N {{\mathcal {L}}}_f > 0$ since $A^{-1}>\theta $. Therefore, $\Vert \hat{f}^{ERM}-f^{*}\Vert _{L_2 } \le r_2^2(\theta )$. This proves the $L_2$-bound.

Now, as $\Vert \hat{f}^{ERM}-f^{*}\Vert _{L_2 } \le r_2^2(\theta )$, $|(P-P_N){{\mathcal {L}}}_{\hat{f}^{ERM}}|\le \theta r_2^2(\theta )$. Since $P_N{{\mathcal {L}}}_{\hat{f}^{ERM}}\le 0$,

$$\begin{aligned} P{{\mathcal {L}}}_{\hat{f}^{ERM}} = P_N{{\mathcal {L}}}_{\hat{f}^{ERM}} + (P-P_N){{\mathcal {L}}}_{\hat{f}^{ERM}}\le \theta r_2^2(\theta ). \end{aligned}$$

This show the excess risk bound. $\square $

Proposition 3 shows that $\hat{f}^{ERM}$ has the risk bounds given in Theorem 1 on the event $\Omega $. To show that $\Omega $ holds with probability (3), recall the following results from [2].

Lemma 2

[2] [Lemma 8.1] Grant Assumptions 1 and 3 . Let $F^\prime \subset F$ with finite $L_2$-diameter $d_{L_2}(F^\prime )$. For every $u>0$, with probability at least $1-2\exp (-u^2)$,

$$\begin{aligned} \sup _{f,g\in F^\prime }\left| (P-P_N)({{\mathcal {L}}}_f-{{\mathcal {L}}}_g)\right| \le \frac{16L}{\sqrt{N}} \left( w(F^\prime ) + u d_{L_2}(F^\prime )\right) . \end{aligned}$$

It follows from Lemma 2 that for any $u>0$, with probability larger that $1-2\exp (-u^2)$,

$$\begin{aligned}&\sup _{f \in F \cap (f^{*} + r_2(\theta ) B_{L_2})} \big | (P-P_N){{\mathcal {L}}}_f \big | \\&\quad \le \sup _{f,g \in F \cap (f^{*} + r_2(\theta ) B_{L_2})} \big | (P-P_N)({{\mathcal {L}}}_f-{{\mathcal {L}}}_g) \big | \\&\quad \le \frac{16L}{\sqrt{N}} \big ( w((F-f^*)\cap r_2(\theta )B_{L_2}) + ud_{L_2} ((F-f^*)\cap r_2(\theta )B_{L_2}) \big ) \end{aligned}$$

where $d_{L_2} ((F-f^*)\cap r_2(\theta )B_{L_2}) \le r_2(\theta )$. By definition of the complexity parameter (see Eq. (3)), for $u = \theta \sqrt{N} r_2(\theta )/(64L) $, with probability at least

$$\begin{aligned} 1-2\exp \big (-\theta ^2N r_2^2(\theta ) /(16^3L^2 ) \big ), \end{aligned}$$

(23)

for every f in $F\cap (f^*+ r_2(\theta )B_{L_2} )$,

$$\begin{aligned} \big | (P-P_N) {{\mathcal {L}}}_f \big | \le \theta r_2^2(\theta ). \end{aligned}$$

(24)

Together with Proposition 3, this concludes the proof of Theorem 1.

1.2 Proof of Theorem 2

The proof is splitted in two parts. First, we identify an event $\Omega _K$ where the statistical properties of ${\hat{f}}$ from Theorem 2 can be established. Next, we prove that this event holds with probability (8). Let $\alpha , \theta $ and $\gamma $ be positive numbers to be chosen later. Define

$$\begin{aligned} C_{K,r} = \max \bigg (\frac{4L^2K}{\theta ^2 \alpha N},\tilde{r}_2^2(\gamma ) \bigg ) \end{aligned}$$

where the exact form of $\alpha , \theta $ and $\gamma $ are given in Eq. (33). Set the event $\Omega _K$ to be such that

$$\begin{aligned} \Omega _K= & {} \bigg \{ \forall f \in F \cap \left( f^*+ \sqrt{C_{K,r}}B_{L_2}\right) , \exists J\subset \{1,\ldots ,K\}: |J|>K/2 \nonumber \\&\quad \text{ and } \forall k\in J, \left| (P_{B_k} - P){{\mathcal {L}}}_f \right| \le \theta C_{K,r} \bigg \}. \end{aligned}$$

(25)

1.2.1 Deterministic argument

The goal of this section is to show that, on the event $\Omega _K$, $\Vert {\hat{f}} - f^{*}\Vert _{L_2}^2 \le C_{K,r}$ and $P{{\mathcal {L}}}_{{\hat{f}}}\le 2 \theta C_{K,r}$.

Lemma 3

If there exists $\eta >0$ such that

$$\begin{aligned}&\sup _{f \in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) } \,\, \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) < - \eta \quad \text{ and } \nonumber \\&\sup _{f \in F\cap \left( f^{*}+ \sqrt{C_{K,r}} B_{L_2}\right) } \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le \eta , \end{aligned}$$

(26)

then $\Vert {\hat{f}} - f^{*} \Vert _{L_2 }^2 \le C_{K,r}$.

Proof

Assume that (26) holds, then

$$\begin{aligned} \inf _{f\in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) } \text {MOM}_K[\ell _f-\ell _{f^*}]> \eta . \end{aligned}$$

(27)

Moreover, if $T_K(f)=\sup _{g\in F}\text {MOM}_K[\ell _f-\ell _g]$ for all $f\in F$, then

$$\begin{aligned} T_K(f^{*})= & {} \sup _{f\in F \cap \left( f^{*}+ \sqrt{C_{K,r}}B_{L_2}\right) }\text {MOM}_K[\ell _{f^*}-\ell _f]\nonumber \\&\vee&\sup _{f\in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) }\text {MOM}_K[\ell _{f^*}-\ell _f]\leqslant \eta . \end{aligned}$$

(28)

By definition of $\hat{f}$ and (28), $T_K(\hat{f})\leqslant T_K(f^*)\leqslant \eta $. Moreover, by (27), any $f\in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) $ satisfies $T_K(f)\geqslant \text {MOM}_K[\ell _f-\ell _{f^*}]> \eta $. Therefore $\hat{f} \in F\cap (f^{*} + \sqrt{C_{K,r}} B_{L_2})$. $\square $

Lemma 4

Grant Assumption 6 and assume that $\theta -A^{-1}<-\theta $. On the event $\Omega _K$, (26) holds with $\eta = \theta C_{K,r}$.

Proof

Let $f\in F$ be such that $\Vert f-f^{*}\Vert _{L_2 } > C_{K,r}$. By convexity of F, there exists $f_0 \in F \cap \left( f^{*} + \sqrt{C_{K,r}} S_{L_2}\right) $ and $\alpha > 1$ such that $f = f^{*} + \alpha (f_0 - f^{*})$. For all $i \in \{1,\ldots ,N \}$, let $\psi _i: {\mathbb {R}} \rightarrow {\mathbb {R}} $ be defined for all $u\in \mathbb {R}$ by

$$\begin{aligned} \psi _i(u) = \overline{\ell } (u + f^{*}(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i). \end{aligned}$$

(29)

The functions $\psi _i$ are convex because $\overline{\ell }$ is and such that $\psi _i(0) = 0$, so $\alpha \psi _i(u) \le \psi _i(\alpha u)$ for all $u\in {\mathbb {R}}$ and $\alpha \ge 1$. As $\psi _i(f(X_i) - f^{*}(X_i) )= \overline{\ell } (f(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i)$, for any block $B_k$,

$$\begin{aligned} P_{B_k} {{\mathcal {L}}}_f&= \frac{1}{|B_k|} \sum _{i \in B_k} \psi _i \big ( f(X_i)- f^{*}(X_i) \big )= \frac{1}{|B_k|} \sum _{i \in B_k} \psi _i(\alpha ( f_0(X_i)- f^{*}(X_i) ))\nonumber \\&\ge \frac{\alpha }{|B_k|} \sum _{i \in B_k} \psi _i(( f_0(X_i)- f^{*}(X_i))) = \alpha P_{B_k} {{\mathcal {L}}}_{f_0}. \end{aligned}$$

(30)

As $f_0 \in F \cap (f^* + \sqrt{C_{K,r}} S_{L_2})$, on $\Omega _K$, there are strictly more than K / 2 blocks $B_k$ where $P_{B_k} {{\mathcal {L}}}_{f_0} \ge P{{\mathcal {L}}}_{f_0} - \theta C_{K,r}$. Moreover, from Assumption 6, $P{{\mathcal {L}}}_{f_0} \ge A^{-1} \Vert f_0-f^*\Vert _{L_2 }^2 = A^{-1}C_{K,r} $. Therefore, on strictly more than K / 2 blocks $B_k$,

$$\begin{aligned} P_{B_k} {{\mathcal {L}}}_{f_0} \ge (A^{-1} - \theta ) C_{K,r}. \end{aligned}$$

(31)

From Eqs. (30) and (31), there are strictly more than K / 2 blocks $B_k$ where $P_{B_k} {{\mathcal {L}}}_f \ge (A^{-1}- \theta ) C_{K,r} $. Therefore, on $\Omega _K$, as $(\theta - A^{-1}) < - \theta $,

$$\begin{aligned} \sup _{f \in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) } \,\, \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big )< (\theta - A^{-1}) C_{K,r}<-\theta C_{K,r}. \end{aligned}$$

In addition, on the event $\Omega _K$, for all $f \in F \cap (f^{*} + \sqrt{C_{K,r}}B_{L_2})$, there are strictly more than K / 2 blocks $B_k$ where $|(P_{B_k}-P) {{\mathcal {L}}}_f | \le \theta C_{K,r} $. Therefore

$$\begin{aligned} \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le \theta C_{K,r} - P{{\mathcal {L}}}_f \le \theta C_{K,r}. \end{aligned}$$

$\square $

Lemma 5

Grant Assumption 6 and assume that $\theta - A^{-1}<-\theta $. On the event $\Omega _K$, $P{{\mathcal {L}}}_{\hat{f}} \le 2\theta C_{K,r}$.

Proof

Assume that $\Omega _K$ holds. From Lemmas 3 and 4 , $\Vert \hat{f}-f^{*}\Vert _{L_2 } \le \sqrt{C_{K,r}}$. Therefore, on strictly more than K / 2 blocks $B_k$, $P {{\mathcal {L}}}_{\hat{f}} \le P_{B_k} {{\mathcal {L}}}_{\hat{f}} + \theta C_{K,r}$. In addition, by definition of ${\hat{f}}$ and (28) (for $\eta = \theta C_{K,r}$),

$$\begin{aligned} MOM_K(\ell _{{\hat{f}}} - \ell _{f^{*}}) \le \sup _{f \in F} MOM_K(\ell _{f^{*}} - \ell _{f}) \le \theta C_{K,r}. \end{aligned}$$

As a consequence, there exist at least K / 2 blocks $B_k$ where $P_{B_k} {{\mathcal {L}}}_{\hat{f}} \le \theta C_{K,r}$. Therefore, there exists at least one block $B_k$ where both $P {{\mathcal {L}}}_{\hat{f}} \le P_{B_k} {{\mathcal {L}}}_{\hat{f}} + \theta C_{K,r}$ and $P_{B_k} {{\mathcal {L}}}_{\hat{f}} \le \theta C_{K,r}$. Hence $P{{\mathcal {L}}}_{\hat{f}} \le 2\theta C_{K,r}$. $\square $

1.2.2 Stochastic argument

This section shows that $\Omega _K$ holds with probability at least (8).

Proposition 4

Grant Assumptions 1, 2, 5 and 6 and assume that $(1-\beta )K\ge |{{\mathcal {O}}}|$. Let $x>0$ and assume that $\beta (1-\alpha -x-8\gamma L/\theta )>1/2$. Then $\Omega _K$ holds with probability larger than $1-\exp (-x^2 \beta K/2)$.

Proof

Let ${{\mathcal {F}}}= F \cap \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) $ and set $\phi :t\in \mathbb {R}\rightarrow I \{ t\ge 2 \} + (t-1) I \{1 \le t \le 2 \}$ so, for all $t \in \mathbb {R}$, $I \{ t\ge 2 \} \le \phi (t) \le I \{ t\ge 1 \}$. Let $W_k = ((X_i,Y_i))_{i \in B_k}$, $G_f(W_k) = (P_{B_k} - P){{\mathcal {L}}}_f$. Let

$$\begin{aligned} z(f)&= \sum _{k =1}^K I \{|G_f(W_k)|\le \theta C_{K,r} \}. \end{aligned}$$

Let $\mathcal {K}$ denote the set of indices of blocks which have not been corrupted by outliers, $\mathcal {K} = \{k \in \{1,\ldots ,K \} : B_k \subset \mathcal {I}\}$ and let $f \in {{\mathcal {F}}}$. Basic algebraic manipulations show that

$$\begin{aligned} z(f)\ge & {} |\mathcal {K}| - \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ) \\&- \sum _{k \in \mathcal {K} } \mathbb {E}\phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) . \end{aligned}$$

By Assumptions 1 and 5, using that $C_{K,r}^2\ge \left\| f-f^*\right\| ^2_{L_2 }[(4L^2K)/(\theta ^2\alpha N)]$,

$$\begin{aligned}&\mathbb {E}\phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \\&\quad \le \mathbb {P} \bigg ( |G_f(W_k)| \ge \frac{\theta C_{K,r}}{2} \bigg ) \le \frac{4}{\theta ^2C_{K,r}^2} \mathbb {E}G_f(W_k)^2 = \frac{4}{\theta ^2 C_{K,r}^2} \mathbb {V}ar (P_{B_k}{{\mathcal {L}}}_f) \\&\quad \le \frac{4K^2}{\theta ^2C_{K,r}^2N^2} \sum _{i \in B_k} \mathbb {E} [{{\mathcal {L}}}_f^2(X_i,Y_i)] \le \frac{4L^2K}{\theta ^2C_{K,r}^2N}\Vert f-f^{*}\Vert ^2_{L_2 } \le \alpha . \end{aligned}$$

Therefore,

$$\begin{aligned} z(f) \ge |\mathcal {K}|(1-\alpha ) -\sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ). \end{aligned}$$

(32)

Using Mc Diarmid’s inequality [11, Theorem 6.2], for all $x>0$, with probability larger than $1-\exp (-x^2 |{{\mathcal {K}}}| /2)$,

$$\begin{aligned}&\sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ) \\&\quad \le x|\mathcal {K}| + \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ). \end{aligned}$$

Let $\epsilon _1, \ldots , \epsilon _K$ denote independent Rademacher variables independent of the $(X_i, Y_i), i\in {{\mathcal {I}}}$. By Giné-Zinn symmetrization argument,

$$\begin{aligned}&\sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ) \\&\quad \le x|\mathcal {K}| + 2 \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \end{aligned}$$

As $\phi $ is 1-Lipschitz with $\phi (0)=0$, using the contraction lemma [28, Chapter 4],

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|)&\le 2\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \frac{ G_f(W_k)}{\theta C_{K,r}} \\&= 2\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \frac{(P_{B_k}- P){{\mathcal {L}}}_f}{\theta C_{K,r}}. \end{aligned}$$

Let $(\sigma _i: i \in \cup _{k\in {{\mathcal {K}}}}B_k)$ be a family of independent Rademacher variables independent of $(\epsilon _k)_{k \in \mathcal {K}}$ and $(X_i, Y_i)_{i \in {{\mathcal {I}}}}$. It follows from the Giné-Zinn symmetrization argument that

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \frac{(P_{B_k}- P){{\mathcal {L}}}_f}{ C_{K,r}}\le 2 \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \frac{K}{N}\sum _{i \in \cup _{k\in {{\mathcal {K}}}}B_k } \sigma _i \frac{{{\mathcal {L}}}_f(X_i,Y_i)}{ C_{K,r}}. \end{aligned}$$

By the Lipschitz property of the loss, the contraction principle applies and

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k\in {{\mathcal {K}}}}B_k } \sigma _i \frac{{{\mathcal {L}}}_f(X_i,Y_i)}{ C_{K,r}} \le L\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{C_{K,r}}. \end{aligned}$$

To bound from above the right-hand side in the last inequality, consider two cases 1) $C_{K,r}= \tilde{r}_2^2(\gamma )$ or 2) $C_{K,r} = 4L^2K/(\alpha \theta ^2 N)$. In the first case, by definition of the complexity parameter $\tilde{r}_2(\gamma )$ in (6),

$$\begin{aligned}&\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{C_{K,r}} \\&\quad = \mathbb {E} \sup _{f \in F: \Vert f-f^{*}\Vert _{L_2 } \le {\tilde{r}}_2(\gamma ) } \frac{1}{\tilde{r}_2^2(\gamma )} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i (f-f^{*})(X_i)\bigg | \\&\quad \le \frac{\gamma |{{\mathcal {K}}}| N}{K}. \end{aligned}$$

In the second case,

$$\begin{aligned}&\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \frac{\sigma _i(f-f^{*})(X_i)}{C_{K,r}} \\&\quad \le \mathbb {E} \bigg [ \sup _{\begin{array}{c} f \in F:\\ \Vert f-f^{*}\Vert _{L_2 } \le \tilde{r}_2(\gamma ) \end{array}} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \frac{\sigma _i (f-f^{*})(X_i)}{\tilde{r}_2^2(\gamma )} \bigg | \\&\qquad \vee \sup _{\begin{array}{c} f \in F:\\ \tilde{r}_2(\gamma ) \le \Vert f-f^{*}\Vert _{L_2 } \le \sqrt{\frac{4L^2K}{\alpha \theta ^2 N}} \end{array} } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\frac{4L^2K}{\alpha \theta ^2 N}} \bigg | \bigg ] . \end{aligned}$$

Let $f\in F$ be such that $\tilde{r}_2(\gamma ) \le \left\| f-f^*\right\| _{L_2 }\le \sqrt{[4L^2K]/[\alpha \theta ^2 N]}$; by convexity of F, there exists $f_0\in F$ such that $\left\| f_0-f^*\right\| _{L_2 } = \tilde{r}_2(\gamma )$ and $f = f^*+\alpha (f_0-f^*)$ with $\alpha = \left\| f-f^*\right\| _{L_2 }/\tilde{r}_2(\gamma )\ge 1$. Therefore,

$$\begin{aligned} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\frac{4L^2K}{\alpha \theta ^2 N}} \bigg |&\le \frac{1}{\tilde{r}_2(\gamma ) } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\Vert f-f^{*}\Vert _{L_2 }} \bigg | \\&= \frac{1}{\tilde{r}_2^2(\gamma ) } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i (f_0-f^{*})(X_i) \bigg | \end{aligned}$$

and so

$$\begin{aligned}&\sup _{\begin{array}{c} f \in F:\\ \tilde{r}_2(\gamma ) \le \Vert f-f^{*}\Vert _{L_2 } \le \sqrt{\frac{4L^2K}{\alpha \theta ^2 N}} \end{array}} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\frac{4L^2K}{\alpha \theta ^2 N}} \bigg | \\&\quad \le \frac{1}{\tilde{r}_2^2(\gamma ) } \sup _{\begin{array}{c} f \in F:\\ \Vert f-f^{*}\Vert _{L_2 } = \tilde{r}_2(\gamma ) \end{array} } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i (f-f^{*})(X_i) \bigg |. \end{aligned}$$

By definition of $\tilde{r}_2(\gamma )$, it follows that

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \bigg |&\sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{ C_{K,r}} \bigg | \le \frac{\gamma |{{\mathcal {K}}}| N}{K}. \end{aligned}$$

Therefore, as $|\mathcal {K}| \ge K-|\mathcal {O}| \ge \beta K$, with probability larger than $1-\exp (-x^2 \beta K/2)$, for all $f\in F$ such that $\left\| f-f^*\right\| _{L_2 }\le \sqrt{C_{K,r}}$,

$$\begin{aligned} z(f) \ge |\mathcal {K}|\left( 1-\alpha - x - \frac{8 \gamma L}{\theta }\right) > \frac{K}{2}. \end{aligned}$$

(33)

$\square $

1.2.3 End of the proof of Theorem 2

Theorem 2 follows from Lemmas 3, 4, 5 and Proposition 4 for the choice of constant

$$\begin{aligned} \theta = 1/(3A) \quad \alpha = 1/24, \quad x = 1/24 ,\quad \beta = 4/7 \text{ and } \gamma = 1/(575 AL). \end{aligned}$$

1.3 Proof of Theorem 3

Let $K \in \big [ 7|{{\mathcal {O}}}|/3 , N \big ]$ and consider the event $\Omega _K$ defined in (25). It follows from the proof of Lemmas 3 and 4 that $T_K(f^*)\le \theta C_{K,r}$ on $\Omega _K$. Setting $\theta =1/(3A)$, on $\cap _{J=K}^N \Omega _J$, $f^*\in {\hat{R}}_J$ for all $J=K,\ldots , N$, so $\cap _{J=K}^N {\hat{R}}_J\ne \emptyset $. By definition of ${\hat{K}}$, it follows that ${\hat{K}}\le K$ and by definition of ${\tilde{f}}$, ${\tilde{f}} \in {\hat{R}}_K$ which means that $T_K({\tilde{f}})\le \theta C_{K,r}$. It is proved in Lemmas 3 and 4 that on $\Omega _K$, if $f\in F$ satisfies $\left\| f-f^*\right\| _{L_2 }\ge \sqrt{C_{K,r}}$ then $T_K(f)> \theta C_{K,r}$. Therefore, $\left\| {\tilde{f}}-f^*\right\| _{L_2 }\le \sqrt{C_{K,r}}$. On $\Omega _K$, since $\left\| {\tilde{f}}-f^*\right\| _{L_2 }\le \sqrt{C_{K,r}}$, $P{{\mathcal {L}}}_{{\tilde{f}}}\le 2 \theta C_{K,r}$. Hence, on $\cap _{J=K}^N \Omega _J$, the conclusions of Theorem 3 hold. Finally, by Proposition 4,

$$\begin{aligned} {\mathbb {P}}\left[ \cap _{J=K}^N \Omega _J\right] \ge 1-\sum _{J=K}^N \exp (-K/2016)\ge 1-4 \exp (-K/2016). \end{aligned}$$

1.4 Proof of Theorem 4

The proof of Theorem 4 follows the same path as the one of Theorem 2. We only sketch the different arguments needed because of the localization by the excess loss and the lack of Bernstein condition.

Define the event $\Omega _K^\prime $ in the same way as $\Omega _K$ in (25) where $C_{K,r}$ is replaced by $\bar{r}_2^2(\gamma )$ and the $L_2$ localization is replaced by the “excess loss localization”:

$$\begin{aligned} \Omega ^\prime _K= & {} \bigg \{ \forall f \in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}, \exists J\subset \{1,\ldots ,K\}: |J|>K/2 \nonumber \\&\quad \text{ and } \forall k\in J, \left| (P_{B_k} - P){{\mathcal {L}}}_f \right| \le (1/4) \bar{r}_2^2(\gamma ) \bigg \} \end{aligned}$$

(34)

where $({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )} =\{f\in F: P{{\mathcal {L}}}_{f}\le \bar{r}_2^2(\gamma )\}$. Our first goal is to show that on the event $\Omega _K^\prime $, $P{{\mathcal {L}}}_{{\hat{f}}}\le (1/4) \bar{r}_2^2(\gamma )$. We will then handle ${\mathbb {P}}[\Omega _K^\prime ]$.

Lemma 6

Grant Assumptions 1 and 2. For every $r\ge 0$, the set $({{\mathcal {L}}}_F)_r:=\{f\in F:P{{\mathcal {L}}}_f\le r\}$ is convex and relatively closed to F in $L_1(\mu )$. Moreover, if $f\in F$ is such that $P{{\mathcal {L}}}_f>r$ then there exists $f_0\in F$ and $(P{{\mathcal {L}}}_f/r)\ge \alpha >1$ such that $(f-f^*)=\alpha (f_0-f^*)$ and $P{{\mathcal {L}}}_{f_0} = r$.

Proof

Let f and g be in $({{\mathcal {L}}}_F)_r$ and $0\le \alpha \le 1$. We have $\alpha f + (1-\alpha )g\in F$ because F is convex and for all $x\in {{\mathcal {X}}}$ and $y\in \mathbb {R}$, using the convexity of $u\rightarrow {\bar{\ell }}(u+f^*(x), y)$, we have

$$\begin{aligned}&\ell _{\alpha f + (1-\alpha )g}(x,y) - \ell _{f^*}(x,y) \\&\quad = {\bar{\ell }}(\alpha (f-f^*)(x) + (1-\alpha )(g-f^*)(x) + f^*(x), y) - {\bar{\ell }}(f^*(x),y)\\&\quad \le \alpha \big ({\bar{\ell }}((f-f^*)(x)+ f^*(x), y) - {\bar{\ell }}(f^*(x),y)\big ) \\&\qquad + (1-\alpha )\big ({\bar{\ell }}((g-f^*)(x) + f^*(x), y) - {\bar{\ell }}(f^*(x),y)\big )\\&\quad =\alpha (\ell _f - \ell _{f^*}) + (1-\alpha )(\ell _g-\ell _{f^*}) \end{aligned}$$

and so $P{{\mathcal {L}}}_{\alpha f + (1-\alpha )g}\le \alpha P{{\mathcal {L}}}_f + (1-\alpha )P{{\mathcal {L}}}_g$. Given that $P{{\mathcal {L}}}_f, P{{\mathcal {L}}}_g\le r$ we also have $P{{\mathcal {L}}}_{\alpha f + (1-\alpha )g}\le r$. Therefore, $\alpha f + (1-\alpha )g\in ({{\mathcal {L}}}_F)_r$ and $({{\mathcal {L}}}_F)_r$ is convex.

For all $f,g\in F$, $|P{{\mathcal {L}}}_f - P{{\mathcal {L}}}_g|\le \left\| f-f^*\right\| _{L_1(\mu )}$ so that $f\in F\rightarrow P{{\mathcal {L}}}_f$ is continuous onto F in $L_1(\mu )$ and therefore its level sets, such as $({{\mathcal {L}}}_F)_r$, are relatively closed to F in $L_1(\mu )$.

Finally, let $f\in F$ be such that $P{{\mathcal {L}}}_f >r$. Define $\alpha _0 = \sup \{\alpha \ge 0: f^*+\alpha (f-f^*)\in ({{\mathcal {L}}}_F)_r\}$. Note that $P{{\mathcal {L}}}_{f^*+\alpha (f-f^*)}\le \alpha P{{\mathcal {L}}}_f= r$ for $\alpha = r/P{{\mathcal {L}}}_f$ so that $\alpha _0\ge r/P{{\mathcal {L}}}_f$. Since $({{\mathcal {L}}}_F)_r$ is relatively closed to F in $L_1(\mu )$, we have $f^*+\alpha _0(f-f^*)\in ({{\mathcal {L}}}_F)_r$ and in particular $\alpha _0<1$ otherwise, by convexity of $({{\mathcal {L}}}_F)_r$, we would have $f\in ({{\mathcal {L}}}_F)_r$. Moreover, by maximality of $\alpha _0$, $f_0 = f^*+\alpha _0(f-f^*)$ is such that $P{{\mathcal {L}}}_{f_0}=r$ and the results follows for $\alpha = \alpha _0^{-1}$. $\square $

Lemma 7

Grant Assumptions 1 and 2. On the event $\Omega _K^\prime $, $P{{\mathcal {L}}}_{{\hat{f}}}\le \bar{r}_2^2(\gamma )$.

Proof

Let $f\in F$ be such that $P{{\mathcal {L}}}_f > \bar{r}_2^2(\gamma )$. It follows from Lemma 6 that there exists $\alpha \ge 1$ and $f_0\in F$ such that $P{{\mathcal {L}}}_{f_0} = \bar{r}_2^2(\gamma )$ and $f-f^* = \alpha (f_0-f^*)$. According to (30), we have for every $k\in \{1, \ldots , K\}$, $P_{B_k} {{\mathcal {L}}}_f\ge \alpha P_{B_k}{{\mathcal {L}}}_{f_0}$. Since $f_0\in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}$, on the event $\Omega _K^\prime $, there are strictly more than K / 2 blocks $B_k$ such that $P_{B_k}{{\mathcal {L}}}_{f_0}\ge P{{\mathcal {L}}}_{f_0}- (1/4) \bar{r}_2^2(\gamma ) = (3/4)\bar{r}_2^2(\gamma )$ and so $P_{B_k}{{\mathcal {L}}}_{f}\ge (3/4)\bar{r}_2^2(\gamma )$. As a consequence, we have

$$\begin{aligned} \sup _{f \in F \backslash ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}} \,\, \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (-3/4) \bar{r}_2^2(\gamma ). \end{aligned}$$

(35)

Moreover, on the event $\Omega _K^\prime $, for all $f\in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}$, there are strictly more than K / 2 blocks $B_k$ such that $P_{B_k}(-{{\mathcal {L}}}_f)\le (1/4) \bar{r}_2^2(\gamma ) - P{{\mathcal {L}}}_f\le (1/4) \bar{r}_2^2(\gamma )$. Therefore,

$$\begin{aligned} \sup _{f\in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}}\text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (1/4) \bar{r}_2^2(\gamma ) . \end{aligned}$$

(36)

We conclude from (35) and (36) that $\sup _{f\in F} \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (1/4) \bar{r}_2^2(\gamma )$ and that every $f\in F$ such that $P{{\mathcal {L}}}_{f}> \bar{r}_2^2(\gamma )$ satisfies $\text {MOM}_{K}\big (\ell _{f}-\ell _{f^*}\big )\ge (3/4)\bar{r}_2^2(\gamma )$. But, by definition of ${\hat{f}}$, we have

$$\begin{aligned} \text {MOM}_{K}\big (\ell _{{\hat{f}}}-\ell _{f^*}\big )\le \sup _{f\in F} \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (1/4) \bar{r}_2^2(\gamma ) . \end{aligned}$$

Therefore, we necessarily have $P{{\mathcal {L}}}_{{\hat{f}}}\le \bar{r}_2^2(\gamma )$. $\square $

Now, we prove that $\Omega _K^\prime $ is an exponentially large event using similar argument as in Proposition 4.

Proposition 5

Grant Assumptions 1, 2 and 7 and assume that $(1-\beta )K\ge |{{\mathcal {O}}}|$ and $\beta (1-1/12-32\gamma L)>1/2$. Then $\Omega _K^\prime $ holds with probability larger than $1-\exp (-\beta K/1152)$.

Sketch of proof

The proof of Proposition 5 follows the same line as the one of Proposition 4. Let us precise the main differences. We set ${{\mathcal {F}}}^\prime = ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}$ and for all $f\in {{\mathcal {F}}}^\prime $, $z^\prime (f) = \sum _{k=1}^K I\{|G_f(W_k)|\le (1/4) \bar{r}_2^2(\gamma )\}$ where $G_f(W_k)$ is the same quantity as in the proof of Proposition 5. Let us consider the contraction $\phi $ introduced in Proposition 5. By definition of $\bar{r}_2^2(\gamma )$ and $V_K(\cdot )$, we have

$$\begin{aligned}&\mathbb {E}\phi (8 (\bar{r}_2^2(\gamma ))^{-1} | G_f(W_k)|)\\&\quad \le \mathbb {P} \bigg ( |G_f(W_k)| \ge \frac{\bar{r}_2^2(\gamma )}{8} \bigg ) \le \frac{64}{(\bar{r}_2^2(\gamma ))^2} \mathbb {E}G_f(W_k)^2 = \frac{64}{ (\bar{r}_2^2(\gamma ))^2} \mathbb {V}ar (P_{B_k}{{\mathcal {L}}}_f) \\&\quad \le \frac{64K^2}{(\bar{r}_2^2(\gamma ))^2N^2} \sum _{i \in B_k} \mathbb {V}ar_{P_i}({{\mathcal {L}}}_f) \le \frac{64K}{(\bar{r}_2^2(\gamma ))^2N} \sup \{\mathbb {V}ar_{P_i}({{\mathcal {L}}}_f):f\in {{\mathcal {F}}}^\prime , i\in {{\mathcal {I}}}\} \\&\quad \le \frac{64K}{(\bar{r}_2^2(\gamma ))^2N} \sup \{\mathbb {V}ar_{P_i}({{\mathcal {L}}}_f):P{{\mathcal {L}}}_f \le \bar{r}_2^2(\gamma ), i\in {{\mathcal {I}}}\} \le \frac{1}{24} . \end{aligned}$$

Using Mc Diarmid’s inequality, the Giné-Zinn symmetrization argument and the contraction lemma twice and the Lipschitz property of the loss function, such as in the proof of Proposition 4, we obtain with probability larger than $1-\exp (-|{{\mathcal {K}}}|/1152)$, for all $f\in {{\mathcal {F}}}^\prime $,

$$\begin{aligned} z(f)\ge |{{\mathcal {K}}}|(1-1/12) -\frac{32LK}{ N} \mathbb {E}\sup _{f\in {{\mathcal {F}}}^\prime } \frac{1}{\bar{r}_2^2(\gamma )}\left| \sum _{i\in \cup _{k\in {{\mathcal {K}}}}B_k} \sigma _i (f-f^*)(X_i)\right| . \end{aligned}$$

(37)

Now, it remains to use the definition of $\bar{r}_2^2(\gamma )$ to bound the expected supremum in the right-hand side of (37) to get

$$\begin{aligned} \mathbb {E}\sup _{f\in {{\mathcal {F}}}^\prime } \frac{1}{\bar{r}_2(\gamma )^2}\left| \sum _{i\in \cup _{k\in {{\mathcal {K}}}}B_k} \sigma _i (f-f^*)(X_i)\right| \le \frac{\gamma |{{\mathcal {K}}}|N}{K}. \end{aligned}$$

(38)

$\square $

Proof of Theorem 4

The proof of Theorem 4 follows from Lemma 7 and Proposition 5 for $\beta =4/7$ and $\gamma = 1/(768L)$. $\square $

Proof of Lemma 1

Proof

We have

$$\begin{aligned} \frac{1}{\sqrt{N}} \mathbb {E} \sup _{ f \in F: \Vert f-f^{*}\Vert _{L_2 } \le r} \sum _{i=1}^N \sigma _i (f-f^{*})(X_i)&= \mathbb {E} \sup _{ t \in \mathbb {R}^d: \mathbb {E} \langle t,X \rangle ^2 \le r^2 } \langle t, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle . \end{aligned}$$

Let $\Sigma =\mathbb {E} X^TX$ denote the covariance matrix of X and consider its SVD, $\Sigma = QDQ^T$ where $Q = [Q_1|\ldots |Q_d]\in {\mathbb {R}}^{d\times d}$ is an orthogonal matrix and D is a diagonal $d\times d$ matrix with non-negative entries. For all $t\in \mathbb {R}^d$, we have $\mathbb {E}\langle X,t \rangle ^2 = t^T \Sigma t = \sum _{j=1}^d d_j \langle t,Q_j \rangle ^2$. Then

$$\begin{aligned}&\mathbb {E} \sup _{ t \in \mathbb {R}^d: \sqrt{\mathbb {E} \langle t,X \rangle ^2 } \le r } \langle t, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle \\&\quad = \mathbb {E} \sup _{ t \in \mathbb {R}^d: \sqrt{\mathbb {E} \langle t ,X \rangle ^2 } \le r } \langle \sum _{j=1}^d \langle t,Q_j \rangle Q_j , \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle \\&\quad = \mathbb {E} \sup _{ t \in \mathbb {R}^d: \sqrt{ \sum _{j=1}^d d_j \langle t,Q_j \rangle ^2 } \le r } \sum _{j=1:d_j\ne 0}^d \sqrt{d_j} \langle t,Q_j \rangle \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle \\&\quad \le r \mathbb {E} \sqrt{ \sum _{j=1:d_j\ne 0}^d \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2} \le r \sqrt{ \mathbb {E} \sum _{j=1:d_j\ne 0}^d \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2} . \end{aligned}$$

Moreover, for any j such that $d_j\ne 0$,

$$\begin{aligned}&\mathbb {E} \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2 \\&\quad = \mathbb {E} \frac{1}{N} \sum _{k,l=1}^N \sigma _l \sigma _k \langle \frac{Q_j}{\sqrt{d_j}}, X_k \rangle \langle \frac{Q_j}{\sqrt{d_j}}, X_l \rangle = \frac{1}{N} \sum _{k=1}^N \mathbb {E} \langle \frac{Q_j}{\sqrt{d_j}}, X_k \rangle ^2 \\&\quad = \frac{1}{N} \sum _{k=1}^N \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg )^T \mathbb {E} X_k^TX_k \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg ) = \frac{1}{N} \sum _{k=1}^N \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg )^T \Sigma \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg ) \end{aligned}$$

By orthonormality, $Q^TQ_j = e_j$ and $Q_j^TQ = e_j^T$, then, for any j such that $d_j\ne 0$,

$$\begin{aligned} \mathbb {E} \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2 = \frac{1}{N} \sum _{k=1}^N \frac{1}{d_j} e_j^T D e_j = 1. \end{aligned}$$

Finally, we obtain

$$\begin{aligned} \frac{1}{\sqrt{N}} \mathbb {E} \sup _{ f \in F: \Vert f-f^{*}\Vert _{L_2 } \le r} \sum _{i=1}^N \sigma _i (f-f^{*})(X_i) \le r \sqrt{\sum _{j=1}^d \mathbf{1}_{\{d_j\ne 0\}}} = r \sqrt{\text {Rank}(\Sigma )} \end{aligned}$$

and therefore the fixed point $\tilde{r}_2(\gamma )$ is such that

$$\begin{aligned} \tilde{r}_2(\gamma )&= \inf \bigg \{ r> 0, \forall J \in \mathcal {I}: |J| \ge N/2, \\&\quad \mathbb {E}\sup _{t \in \mathbb {R}^d : \sqrt{\mathbb {E} \langle t-t^{*},X \rangle ^2 } \le r } \sum _{i \in J } \sigma _i \langle X_i,t-t^{*} \rangle \, \le r^2 |J| \gamma \bigg \}\\&\le \inf \bigg \{ r > 0, \forall J \in \mathcal {I}: |J| \ge N/2, \quad r\sqrt{\text {Rank}(\Sigma )} \le r^2 \sqrt{|J|} \gamma \bigg \} \\&\le \sqrt{\frac{\text {Rank}(\Sigma )}{2\gamma ^2N}}. \end{aligned}$$

$\square $

Proofs of the results of Sect. 5

We begin this section with a simple Lemma coming from the convexity of F.

Lemma 8

For any $f \in F$,

$$\begin{aligned} \lim _{t \rightarrow 0^+} \frac{R(f^*+t(f-f^*)) - R(f^*)}{t} \ge 0 \end{aligned}$$

where we recall that $R(f) = \mathbb {E}_{(X,Y) \sim P} [\ell _f(X,Y)]$.

Proof

Let $t \in (0,1)$. By convexity of F, $f^* + t(f-f^*) \in F$ and $R(f^*+t(f-f^*)) - R(f^*) \ge 0$ because $f^*$ minimizes the risk over F. $\square $

1.1 Proof of Theorem 6

Let $r >0$. Let $f\in F$ be such that $\left\| f-f^*\right\| _{L_2}\le r$. For all $x\in {{\mathcal {X}}}$ denote by $F_{Y|X=x}$ the conditional c.d.f. of Y given $X=x$. We have

$$\begin{aligned} \mathbb {E} \bigg [ \ell _f(X,Y) | X = x \bigg ]&= (\tau -1) \int \mathbf{1}_{y \le f(x)} (y-f(x))F_{Y|X=x}(\mathrm{d}y) \\&\quad + \tau \int \mathbf{1}_{y> f(x)} (y-f(x))F_{Y|X=x}(\mathrm{d}y) \\&= \int \mathbf{1}_{y > f(x)} (y-f(x))F_{Y|X=x}(\mathrm{d}y) \\&\quad + (\tau -1) \int \mathbf{1}_{\mathbb {R}} (y-f(x))F_{Y|X=x}(\mathrm{d}y). \end{aligned}$$

By Fubini’s theorem,

$$\begin{aligned} \int \mathbf{1}_{z \ge f(x)} (1-F_{Y|X=x}(z))\text {d}z&= \int \mathbf{1}_{z \ge f(x)}\bigg ( 1 - {\mathbb {P}}(Y \le z |X=x ) \bigg ) \mathrm{d}z \\&= \int \mathbf{1}_{z \ge f(x)} \mathbb {E} [ \mathbf{1}_{ Y> z }|X=x ] \text {d}z\\&= \int \int \mathbf{1}_{ y> z \ge f(x) } f_{Y|X=x}(y)\text {d}y \text {d}z \\&= \int \mathbf{1}_{ y> f(x) }(y-f(x)) f_{Y|X=x}(y) \text {d}y\\&= \int \mathbf{1}_{y > f(x)} (y-f(x)) F_{Y|X=x}(\mathrm{d}y). \end{aligned}$$

Therefore,

$$\begin{aligned}&\mathbb {E} \bigg [ \ell _f(X,Y) | X=x\bigg ] \\&\quad = \int \mathbf{1}_{y \ge f(x)} (1-F_{Y|X=x}(y))\text {d}y + (\tau -1) \bigg ( \int _{\mathbb {R}} yF_{Y|X=x}(\mathrm{d}y) - f(x) \bigg ) \\&\quad = g(x,f(x)) + (\tau -1 ) \int _{\mathbb {R}} yF_{Y|X=x}(\mathrm{d}y) \end{aligned}$$

where $g:(x,a)\in {{\mathcal {X}}}\times {\mathbb {R}}\rightarrow \int \mathbf{1}_{y \ge a} (1 - F_{Y|X=x}(y))\text {d}y + (1-\tau )a$. It follows that

$$\begin{aligned} P\mathcal {L}_f=\mathbb {E}[g(X,f(X))-g(X,f^*(X))]. \end{aligned}$$

(39)

Since for all $x \in {{\mathcal {X}}}$, $a \mapsto g(x,a)$ is twice differentiable, from a second order Taylor expansion we get

$$\begin{aligned} P{{\mathcal {L}}}_f&= \mathbb {E}\bigg [ g(X,f(X)) - g(X,f^*(X)) \bigg ] \\&= \mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X)) (f(X)-f^*(X)) \bigg ] \\&\quad + \frac{1}{2} \int _{x \in {{\mathcal {X}}}} \frac{\partial ^2 g(x,a)}{\partial a^2}(z_x) (f(x)-f^*(x))^2 dP_X(x) \end{aligned}$$

where for all $x\in {{\mathcal {X}}}$, $z_x$ is some point in $\big [\min (f(x), f^{*}(x)),\max (f(x), f^{*}(x)) \big ]$. For the first order term, we have

$$\begin{aligned}&\mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X)) (f(X)-f^*(X)) \bigg ] \\&\quad = \mathbb {E}\lim _{t \rightarrow 0^+} \frac{g(X,f^*(X)+ t(f(X)-f^*(X)) - g(X,f^*(X))}{t}. \end{aligned}$$

For all $x \in {{\mathcal {X}}}$, we have $[g(x,f^*(x)+t(f(x)-f^*(x))) - g(x,f^*(x))]/t \le (2-\tau ) |f^*(x)- f(x)|$ which is integrable with respect to $P_X$. Thus, by the dominated convergence theorem, it is possible to interchange integral and limit and therefore using Lemma 8, we obtian

$$\begin{aligned}&\mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X)) (f(X)-f^*(X)) \bigg ] \\&\quad = \lim _{t \rightarrow 0^+} {\mathbb {E}}\frac{g(X,f^*(X)+ t(f(X)-f^*(X)) - g(X,f^*(X))}{t} \\&\quad = \lim _{t \rightarrow 0^+} \frac{R(f^*+ t(f-f^*)) - R(f^*)}{t} \ge 0. \end{aligned}$$

Given that for all $x\in {{\mathcal {X}}}$, $\frac{\partial ^2 g (x,a)}{\partial a^2} (z)= f_{Y|X=x}(z)$ for all $z\in {\mathbb {R}}$ it follows that

$$\begin{aligned} P{{\mathcal {L}}}_f \ge \frac{1}{2} \int _{x \in {{\mathcal {X}}}} f_{Y|X=x}(z_x) (f(x)-f^*(x))^2 dP_X(x). \end{aligned}$$

Consider $A = \{ x \in \mathcal {X}, |f(x)-f^{*}(x)| \le (\sqrt{2}C')^{(2+\varepsilon )/\varepsilon } r \}$. Given that $\Vert f-f^{*}\Vert _{L_2 }\le r$, by Markov’s inequality, $P(X \in A) \geqslant 1-1/(\sqrt{2}C')^{(4+2\varepsilon )/\varepsilon }$. From Assumption 10 we get

$$\begin{aligned} \frac{2P\mathcal {L}_f}{\alpha }&\geqslant \mathbb {E} [ I_A(X) (f(X)-f^{*}(X))^2 ] =\Vert f-f^*\Vert _{L_2 }^2-\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]. \end{aligned}$$

(40)

By Holder and Markov’s inequalities,

$$\begin{aligned} \mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant & {} \big ( \mathbb {E} [ I_{A^c}(X)] \big )^{\varepsilon /(2+\varepsilon )} \big ( \mathbb {E} [ (f(X)-f^{*}(X))^{2+\varepsilon } ] \big )^{2/(2+\varepsilon )} \\\leqslant & {} \frac{ \Vert f-f^{*}\Vert _{L_{2+\varepsilon }}^2}{2(C')^2}. \end{aligned}$$

By Assumption 9, it follows that $\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant \Vert f-f^{*}\Vert _{L_2}^2/2$ and we conclude with (40).

1.2 Proof of Theorem 7

Let $r>0$. Let $f\in F$ be such that $\left\| f-f^*\right\| _{L_2}\le r$. We have

$$\begin{aligned} P{{\mathcal {L}}}_f&= \mathbb {E}_X \mathbb {E} \bigg [ \rho _H(Y-f(x)) - \rho _H(Y-f^*(x)) | X= x \bigg ] \\&= \mathbb {E}\big [g(X, f(X)) - g(X, f^*(X))\big ] \end{aligned}$$

where $g:(x,a)\in {{\mathcal {X}}}\times {\mathbb {R}}= \mathbb {E}[\rho _H(Y-a)|X=x]$. Let $F_{Y|X=x}$ denote the c.d.f. of Y given $X=x$. Since for all $x \in {{\mathcal {X}}}$, $a \mapsto g(x,a)$ is twice differentiable in its second argument (see Lemma 2.1 in [15]), a second Taylor expansion yields

$$\begin{aligned} P{{\mathcal {L}}}_f&= \mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X))(f(X)-f^*(X)) \bigg ] \\&\quad + \frac{1}{2} \int _{x \in {{\mathcal {X}}}} (f(x)-f^{*}(x))^2 \frac{\partial ^2 g(x,a)}{\partial a^2}(z_x) dP_X(x) \end{aligned}$$

where for all $x\in {{\mathcal {X}}}$, $z_x$ is some point in $[\min (f(x), f^{*}(x)), \max (f(x), f^{*}(x))]$. By Lemma 8, with the same reasoning as the one in Sect. C.1, we get

$$\begin{aligned} P{{\mathcal {L}}}_f \ge \frac{1}{2} \int _{x \in {{\mathcal {X}}}} (f(x)-f^{*}(x))^2 \frac{\partial ^2 g(x,a)}{\partial a^2}(z_x) dP_X(x) . \end{aligned}$$

Moreover, for all $z \in {\mathbb {R}}$,

$$\begin{aligned} \frac{\partial ^2 g(x,a)}{\partial a^2}(z)&= F_{Y|X=x}(z + \delta ) - F_{Y|X=x}(z-\delta ). \end{aligned}$$

Now, let $A = \{ x \in {{\mathcal {X}}}: |f(x)-f^{*}(x)| \le (\sqrt{2}C')^{(2+\varepsilon )/\varepsilon } r \} $. It follows from Assumption 10 that $P{{\mathcal {L}}}_f \ge (\alpha /2) \mathbb {E} [(f(X)-f^{*}(X))^2 I_A(X)]$. Since $\Vert f-f^{*}\Vert _{L_2 }\le r$, by Markov’s inequality, $P(X \in A) \geqslant 1-1/(\sqrt{2}C')^{(4+2\varepsilon )/\varepsilon }$. By Holder and Markov’s inequalities,

$$\begin{aligned} \mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant & {} \big ( \mathbb {E} [ I_{A^c}(X)] \big )^{\varepsilon /(2+\varepsilon )} \big ( \mathbb {E} [ (f(X)-f^{*}(X))^{2+\varepsilon } ] \big )^{2/(2+\varepsilon )} \\\leqslant & {} \frac{ \Vert f-f^{*}\Vert _{L_{2+\varepsilon }}^2}{2(C')^2}. \end{aligned}$$

By Assumption 9, it follows that $\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant \frac{\Vert f-f^{*}\Vert _{L_2}^2}{2}$, which concludes the proof.

1.3 Proof of Theorem 8

Let $r>0$. Let $f\in F$ be such that $\left\| f-f^*\right\| _{L_2}\le r$. Let $\eta (x) = P(Y=1|X=x)$. Write first that $P{{\mathcal {L}}}_f = \mathbb {E} \bigg [ g(X, f(X)) - g(X, f^*(X))\bigg ]$ where for all $x\in {{\mathcal {X}}}$ and $a\in {\mathbb {R}}$, $g(x,a) = \eta (x) \log (1+\exp (-a)) + (1-\eta (x))\log (1+\exp (a))$. From Lemma 8 and the same reasoning as in Sects. C.1 and C.2 we get

$$\begin{aligned} P{{\mathcal {L}}}_f&\ge \int _{x \in {{\mathcal {X}}}} \frac{\partial _2^2 g(x,a)}{\partial a^2}(z_x)\frac{(f(x)-f^{*}(x))^2 }{2} dP_X(x) \\&= \int _{x \in {{\mathcal {X}}}} \frac{e^{z_x}}{(1+e^{z_x})^2}\frac{(f(x)-f^{*}(x))^2 }{2} dP_X(x) \end{aligned}$$

for some $z_x \in [\min (f(x), f^{*}(x)), \max (f(x), f^{*}(x))]$. Now, let

$$\begin{aligned} A = \left\{ x \in \mathcal {X}: |f^{*}(x)|\le c_0, |f(x)-f^{*}(x)|\le (2C')^{(2+\varepsilon )/\varepsilon } r \right\} . \end{aligned}$$

On the event A we have

$$\begin{aligned} P\mathcal {L}_f \ge \frac{e^{- c_0 -(2C')^{(2+\varepsilon )/\varepsilon } r } }{2\big ( 1+ e^{c_0 + (2C')^{(2+\varepsilon )/\varepsilon } r } \big )^2 } \mathbb {E} [I_A(X) (f(X)-f^{*}(X))^2] \end{aligned}$$

Using the fact that $P(X \notin A) \le P(|f^*(X) |> c_0) + P(|f(X)-f^*(X| > (2C')^{(2+\varepsilon )/\varepsilon } r ) \le 2/(2C')^{(4+\varepsilon )/\varepsilon } $, we conclude with Assumption 9 and the same analysis as in the two previous proofs.

1.4 Proof of Theorem 9

Let $r>0$ such that $ r(\sqrt{2} C')^{(2+\varepsilon )/\varepsilon } \le 1$. Let f be in F such that $\Vert f-f^*\Vert _{L_2} \le r$. For all x in $\mathcal {X}$ let us denote $\eta (x) = {\mathbb {P}}(Y=1 | X=x) $. It is easy to verify that the Bayes estimator (which is equal to the oracle) is defined as $f^*(x) = \text{ sign }(2\eta (x)-1)$. Consider the set $A= \{ x \in {\mathcal {X}}, |f(x)-f^*(x)| \le r(\sqrt{2} C')^{(2+\varepsilon )/\varepsilon } \}$. Since $\Vert f-f^*\Vert _{L_2} \le r$, by Markov’s inequality ${\mathbb {P}}(X \in A) \ge 1-1/(\sqrt{2} C')^{(4+ 2\varepsilon )/\varepsilon }$. Let x be in A. If $f^*(x) = -1$ (i.e $2\eta (x) \le 1$) and $f(x) \le f^*(x) = -1$ we obtain

$$\begin{aligned} {\mathbb {E}} \big [ \ell _f(X,Y) | X= x \big ]- {\mathbb {E}} \big [ \ell _{f^*}(X,Y) | X= x \big ]= & {} \eta (x)(1-f(x)) - \eta (x) (1-f^*(x))\\\ge & {} \eta (x) \big (f(x)-f^*(x) \big )^2 \end{aligned}$$

where we used the fact that on A, $|f(x)-f^*(x)| \le r(\sqrt{2} C')^{(2+\varepsilon )/\varepsilon } \le 1$. Using the same analysis for the other cases we get that

$$\begin{aligned}&{\mathbb {E}} \big [ \ell _f(X,Y) | X= x \big ]- {\mathbb {E}} \big [ \ell _{f^*}(X,Y) | X= x \big ] \\&\quad \ge \min \big (\eta (x),1-\eta (x), |1-2\eta (x)| \big ) \big (f(x)-f^*(x) \big )^2 \\&\quad \ge \alpha \big (f(x)-f^*(x) \big )^2 \end{aligned}$$

Therefore,

$$\begin{aligned} \frac{P\mathcal {L}_f}{\alpha }&\geqslant \mathbb {E} [ I_A(X) (f(X)-f^{*}(X))^2 ] =\Vert f-f^*\Vert _{L_2 }^2-\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]. \end{aligned}$$

(41)

By Holder and Markov’s inequalities,

$$\begin{aligned} \mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant & {} \big ( \mathbb {E} [ I_{A^c}(X)] \big )^{\varepsilon /(2+\varepsilon )} \big ( \mathbb {E} [ (f(X)-f^{*}(X))^{2+\varepsilon } ] \big )^{2/(2+\varepsilon )} \\\leqslant & {} \frac{ \Vert f-f^{*}\Vert _{L_{2+\varepsilon }}^2}{2(C')^2}. \end{aligned}$$

By Assumption 9, it follows that $\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant \frac{\Vert f-f^{*}\Vert _{L_2}^2}{2}$ and we conclude with (41).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chinot, G., Lecué, G. & Lerasle, M. Robust statistical learning with Lipschitz and convex loss functions. Probab. Theory Relat. Fields 176, 897–940 (2020). https://doi.org/10.1007/s00440-019-00931-3

Download citation

Received: 03 April 2019
Revised: 17 June 2019
Published: 02 July 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s00440-019-00931-3

Mathematics Subject Classification

62G35

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust statistical learning with Lipschitz and convex loss functions

Abstract

Access this article

Similar content being viewed by others

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods

Confidence distributions and hypothesis testing

Supervised Classification Algorithms in Machine Learning: A Survey and Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Theorems 1, 2, 3 and 4

1.1 Proof of Theorem 1

Proposition 3

Proof

Lemma 2

1.2 Proof of Theorem 2

1.2.1 Deterministic argument

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

1.2.2 Stochastic argument

Proposition 4

Proof

1.2.3 End of the proof of Theorem 2

1.3 Proof of Theorem 3

1.4 Proof of Theorem 4

Lemma 6

Proof

Lemma 7

Proof

Proposition 5

Sketch of proof

Proof of Theorem 4

Proof of Lemma 1

Proof

Proofs of the results of Sect. 5

Lemma 8

Proof

1.1 Proof of Theorem 6

1.2 Proof of Theorem 7

1.3 Proof of Theorem 8

1.4 Proof of Theorem 9

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation