Skip to main content
Log in

Robust statistical learning with Lipschitz and convex loss functions

  • Published:
Probability Theory and Related Fields Aims and scope Submit manuscript

Abstract

We obtain estimation and excess risk bounds for Empirical Risk Minimizers (ERM) and minmax Median-Of-Means (MOM) estimators based on loss functions that are both Lipschitz and convex. Results for the ERM are derived under weak assumptions on the outputs and subgaussian assumptions on the design as in Alquier et al. (Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. arXiv:1702.01402, 2017). The difference with Alquier et al. (2017) is that the global Bernstein condition of this paper is relaxed here into a local assumption. We also obtain estimation and excess risk bounds for minmax MOM estimators under similar assumptions on the output and only moment assumptions on the design. Moreover, the dataset may also contains outliers in both inputs and outputs variables without deteriorating the performance of the minmax MOM estimators. Unlike alternatives based on MOM’s principle (Lecué and Lerasle in Ann Stat, 2017; Lugosi and Mendelson in JEMS, 2016), the analysis of minmax MOM estimators is not based on the small ball assumption (SBA) of Koltchinskii and Mendelson (Int Math Res Not IMRN 23:12991–13008, 2015). In particular, the basic example of non parametric statistics where the learning class is the linear span of localized bases, that does not satisfy SBA (Saumard in Bernoulli 24(3):2176–2203, 2018) can now be handled. Finally, minmax MOM estimators are analysed in a setting where the local Bernstein condition is also dropped out. It is shown to achieve excess risk bounds with exponentially large probability under minimal assumptions insuring only the existence of all objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. All figures can be reproduced from the code available at https://github.com/lecueguillaume/MOMpower.

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1, part 2), 137–147 (1999). Twenty-eighth Annual ACM Symposium on the Theory of Computing (Philadelphia, PA, 1996)

    Article  MathSciNet  Google Scholar 

  2. Alquier, P., Cottet, V., Lecué, G.: Estimation bounds and sharp oracle inequalities of regularized procedures with lipschitz loss functions (2017). arXiv:1702.01402

  3. Audibert, J.-Y., Catoni, O.: Robust linear least squares regression. Ann. Stat. 39(5), 2766–2794 (2011)

    Article  MathSciNet  Google Scholar 

  4. Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al.: Convex optimization with sparsity-inducing norms. Optim. Mach. Learn. 5, 19–53 (2011)

    MATH  Google Scholar 

  5. Baraud, Y., Birgé, L., Sart, M.: A new method for estimation and model selection: \(\rho \)-estimation. Invent. Math. 207(2), 425–517 (2017)

    Article  MathSciNet  Google Scholar 

  6. Bartlett, P.L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)

    Article  MathSciNet  Google Scholar 

  7. Bartlett, P.L., Bousquet, O., Mendelson, S., et al.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)

    Article  MathSciNet  Google Scholar 

  8. Bartlett, P.L., Mendelson, S.: Empirical minimization. Probab. Theory Relat. Fields 135(3), 311–334 (2006)

    Article  MathSciNet  Google Scholar 

  9. Birgé, L.: Stabilité et instabilité du risque minimax pour des variables indépendantes équidistribuées. Ann. Inst. H. Poincaré Probab. Stat. 20(3), 201–223 (1984)

    MathSciNet  MATH  Google Scholar 

  10. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of some recent advances. ESAIM Probab. Stat. 9, 323–375 (2005)

    Article  MathSciNet  Google Scholar 

  11. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities. Oxford University Press, Oxford (2013). A nonasymptotic theory of independence, with a foreword by Michel Ledoux

  12. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)

    Article  Google Scholar 

  13. Catoni, O.: Challenging the empirical mean and empirical variance: a deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48(4), 1148–1185 (2012)

    Article  MathSciNet  Google Scholar 

  14. Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R.I., et al.: Sub-Gaussian mean estimators. Ann. Stat. 44(6), 2695–2725 (2016)

    Article  MathSciNet  Google Scholar 

  15. Elsener, A., van de Geer, S.: Robust low-rank matrix estimation (2016). arXiv:1603.09071

  16. Han, Q., Wellner, J.A.: Convergence rates of least squares regression estimators with heavy-tailed errors. Ann. Statist. 47(4), 2286–2319 (2019)

    Article  MathSciNet  Google Scholar 

  17. Huber, P.J., Ronchetti, E.: Robust statistics. In: International Encyclopedia of Statistical Science, pp. 1248–1251. Springer, New York (2011)

  18. Jerrum, M.R., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci. 43(2–3), 169–188 (1986)

    Article  MathSciNet  Google Scholar 

  19. Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)

    Article  MathSciNet  Google Scholar 

  20. Koltchinskii, V.: Empirical and Rademacher processes. In: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, pp. 17–32. Springer, New York (2011)

  21. Koltchinskii, V.: Oracle inequalities in empirical risk minimization and sparse recovery problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg (2011). Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour (Saint-Flour Probability Summer School)

  22. Koltchinskii, V., Mendelson, S.: Bounding the smallest singular value of a random matrix without concentration. Int. Math. Res. Not. IMRN 23, 12991–13008 (2015)

    MathSciNet  MATH  Google Scholar 

  23. Lecué, G., Lerasle, M.: Learning from mom’s principles: Le cam’s approach. Stochast. Process. Appl. (2018). arXiv:1701.01961

  24. Lecué, G., Lerasle, M.: Robust machine learning by median-of-means: theory and practice. Ann. Stat. (2017). arXiv:1711.10306

  25. Lecué, G., Mendelson, S.: Performance of empirical risk minimization in linear aggregation. Bernoulli 22(3), 1520–1534 (2016)

    Article  MathSciNet  Google Scholar 

  26. Lecué, G., Lerasle, M., Mathieu, T.: Robust classification via mom minimization (2018). arXiv:1808.03106

  27. Ledoux, M.: The Concentration of Measure Phenomenon, Volume 89 of Mathematical Surveys and Monographs. American Mathematical Society, Providence (2001)

  28. Ledoux, M., Talagrand, M.: Probability in Banach Spaces:Isoperimetry and Processes. Springer, New York (2013)

    Google Scholar 

  29. Lugosi, G., Mendelson, S.: Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. (2019). arXiv:1608.00757

  30. Lugosi, G., Mendelson, S.: Regularization, sparse recovery, and median-of-means tournaments (2017). arXiv:1701.04112

  31. Lugosi, G., Mendelson, S.: Sub-gaussian estimators of the mean of a random vector (2017). To appear in Ann. Stat. arXiv:1702.00482

  32. Mammen, E., Tsybakov, A.B.: Smooth discrimination analysis. Ann. Stat. 27(6), 1808–1829 (1999)

    Article  MathSciNet  Google Scholar 

  33. Mendelson, S.: Learning without concentration. In: Conference on Learning Theory, pp. 25–39 (2014)

  34. Mendelson, S.: Learning without concentration. J. ACM 62(3), Art. 21, 25 (2015)

  35. Mendelson, S.: On multiplier processes under weak moment assumptions. In: Geometric Aspects of Functional Analysis, Volume 2169 of Lecture Notes in Math., pp. 301–318. Springer, Cham (2017)

  36. Mendelson, S., Pajor, A., Tomczak-Jaegermann, N.: Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17(4), 1248–1282 (2007)

    Article  MathSciNet  Google Scholar 

  37. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York (1983). Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics

  38. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  39. Saumard, A.: On optimality of empirical risk minimization in linear aggregation. Bernoulli 24(3), 2176–2203 (2018)

    Article  MathSciNet  Google Scholar 

  40. Talagrand, M.: Upper and lower bounds for stochastic processes, volume 60 of Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics (Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics). Springer, Heidelberg (2014). Modern methods and classical problems

  41. Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)

    Article  MathSciNet  Google Scholar 

  42. van de Geer, S.: Estimation and Testing Under Sparsity, Volume 2159 of Lecture Notes in Mathematics. Springer, Cham (2016). Lecture notes from the 45th Probability Summer School held in Saint-Four, 2015, École d’Été de Probabilités de Saint-Flour (Saint-Flour Probability Summer School)

  43. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  Google Scholar 

  44. Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)

    MATH  Google Scholar 

  45. Zhou, W.-X., Bose, K., Fan, J., Liu, H.: A new perspective on robust m-estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann Stat. 46(5), 1904–1931 (2018). https://doi.org/10.1214/17-AOS1606

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume Lecué.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Theorems 1, 2, 3 and 4

1.1 Proof of Theorem 1

The proof is splitted in two parts. First, we identify an event where the statistical behavior of the regularized estimator \(\hat{f}^{ERM}\) can be controlled. Then, we prove that this event holds with probability at least (3). Introduce \(\theta =1/(2A)\) and define the following event:

$$\begin{aligned} \Omega := \left\{ \forall f\in F \cap (f^{*} + r_2(\theta ) B_{L_2}), \quad \big |(P-P_N){{\mathcal {L}}}_f\big |\le \theta r_2^2(\theta ) \right\} \end{aligned}$$

where \(\theta \) is a parameter appearing in the definition of \(r_2\) in Definition 3.

Proposition 3

On the event \(\Omega \), one has

$$\begin{aligned} \Vert \hat{f}^{ERM} - f^*\Vert _{L_2}&\le r_2(\theta ) \quad \text{ and }\quad P{{\mathcal {L}}}_{\hat{f}^{ERM}} \le \theta r_2^2(\theta ). \end{aligned}$$

Proof

By construction, \(\hat{f}^{ERM}\) satisfies \(P_N{{\mathcal {L}}}_{\hat{f}^{ERM}} \le 0 \). Therefore, it is sufficient to show that, on \(\Omega \), if \(\Vert f-f^{*}\Vert _{L_2} > r_2(\theta )\), then \(P_N {{\mathcal {L}}}_f >0\). Let \(f\in F\) be such that \(\Vert f-f^{*}\Vert _{L_2 } > r_2(\theta )\). By convexity of F, there exists \(f_0 \in F \cap (f^{*} + r_2(\theta )S_{L_2})\) and \(\alpha > 1\) such that

$$\begin{aligned} f = f^{*} + \alpha (f_0 - f^{*}) . \end{aligned}$$
(19)

For all \(i \in \{1,\ldots ,N \}\), let \(\psi _i: {\mathbb {R}} \rightarrow {\mathbb {R}} \) be defined for all \(u\in \mathbb {R}\) by

$$\begin{aligned} \psi _i(u) = \overline{\ell } (u + f^{*}(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i). \end{aligned}$$
(20)

The functions \(\psi _i\) are such that \(\psi _i(0) = 0\), they are convex because \(\overline{\ell }\) is, in particular \(\alpha \psi _i(u) \le \psi _i(\alpha u)\) for all \(u\in {\mathbb {R}}\) and \(\alpha \ge 1\) and \(\psi _i(f(X_i) - f^{*}(X_i) )= \overline{\ell } (f(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i) \) so that the following holds:

$$\begin{aligned} P_N {{\mathcal {L}}}_f&= \frac{1}{N} \sum _{i=1}^{N} \psi _i \big ( f(X_i)- f^{*}(X_i) \big ) = \frac{1}{N} \sum _{i=1}^{N} \psi _i(\alpha ( f_0(X_i)- f^{*}(X_i) ))\nonumber \\&\ge \frac{\alpha }{N} \sum _{i=1}^{N} \psi _i(( f_0(X_i)- f^{*}(X_i))) = \alpha P_N {{\mathcal {L}}}_{f_0}. \end{aligned}$$
(21)

Until the end of the proof, the event \(\Omega \) is assumed to hold. Since \(f_0 \in F \cap (f^{*}+ r_2(\theta ) S_{L_2})\), \(P_N {{\mathcal {L}}}_{f_0} \ge P{{\mathcal {L}}}_{f_0} - \theta r_2^2(\theta )\). Moreover, by Assumption 4, \(P{{\mathcal {L}}}_{f_0} \ge A^{-1} \Vert f_0-f^*\Vert _{L_2 }^2 = A^{-1}r_2^2(\theta ) \), thus

$$\begin{aligned} P_N {{\mathcal {L}}}_{f_0} \ge (A^{-1} - \theta ) r_2^2(\theta ). \end{aligned}$$
(22)

From Eqs. (21) and (22), \(P_N {{\mathcal {L}}}_f > 0\) since \(A^{-1}>\theta \). Therefore, \(\Vert \hat{f}^{ERM}-f^{*}\Vert _{L_2 } \le r_2^2(\theta )\). This proves the \(L_2\)-bound.

Now, as \(\Vert \hat{f}^{ERM}-f^{*}\Vert _{L_2 } \le r_2^2(\theta )\), \(|(P-P_N){{\mathcal {L}}}_{\hat{f}^{ERM}}|\le \theta r_2^2(\theta )\). Since \(P_N{{\mathcal {L}}}_{\hat{f}^{ERM}}\le 0\),

$$\begin{aligned} P{{\mathcal {L}}}_{\hat{f}^{ERM}} = P_N{{\mathcal {L}}}_{\hat{f}^{ERM}} + (P-P_N){{\mathcal {L}}}_{\hat{f}^{ERM}}\le \theta r_2^2(\theta ). \end{aligned}$$

This show the excess risk bound. \(\square \)

Proposition 3 shows that \(\hat{f}^{ERM}\) has the risk bounds given in Theorem 1 on the event \(\Omega \). To show that \(\Omega \) holds with probability (3), recall the following results from [2].

Lemma 2

[2] [Lemma 8.1] Grant Assumptions 1 and 3 . Let \(F^\prime \subset F\) with finite \(L_2\)-diameter \(d_{L_2}(F^\prime )\). For every \(u>0\), with probability at least \(1-2\exp (-u^2)\),

$$\begin{aligned} \sup _{f,g\in F^\prime }\left| (P-P_N)({{\mathcal {L}}}_f-{{\mathcal {L}}}_g)\right| \le \frac{16L}{\sqrt{N}} \left( w(F^\prime ) + u d_{L_2}(F^\prime )\right) . \end{aligned}$$

It follows from Lemma 2 that for any \(u>0\), with probability larger that \(1-2\exp (-u^2)\),

$$\begin{aligned}&\sup _{f \in F \cap (f^{*} + r_2(\theta ) B_{L_2})} \big | (P-P_N){{\mathcal {L}}}_f \big | \\&\quad \le \sup _{f,g \in F \cap (f^{*} + r_2(\theta ) B_{L_2})} \big | (P-P_N)({{\mathcal {L}}}_f-{{\mathcal {L}}}_g) \big | \\&\quad \le \frac{16L}{\sqrt{N}} \big ( w((F-f^*)\cap r_2(\theta )B_{L_2}) + ud_{L_2} ((F-f^*)\cap r_2(\theta )B_{L_2}) \big ) \end{aligned}$$

where \(d_{L_2} ((F-f^*)\cap r_2(\theta )B_{L_2}) \le r_2(\theta )\). By definition of the complexity parameter (see Eq. (3)), for \(u = \theta \sqrt{N} r_2(\theta )/(64L) \), with probability at least

$$\begin{aligned} 1-2\exp \big (-\theta ^2N r_2^2(\theta ) /(16^3L^2 ) \big ), \end{aligned}$$
(23)

for every f in \(F\cap (f^*+ r_2(\theta )B_{L_2} )\),

$$\begin{aligned} \big | (P-P_N) {{\mathcal {L}}}_f \big | \le \theta r_2^2(\theta ). \end{aligned}$$
(24)

Together with Proposition 3, this concludes the proof of Theorem 1.

1.2 Proof of Theorem 2

The proof is splitted in two parts. First, we identify an event \(\Omega _K\) where the statistical properties of \({\hat{f}}\) from Theorem 2 can be established. Next, we prove that this event holds with probability (8). Let \(\alpha , \theta \) and \(\gamma \) be positive numbers to be chosen later. Define

$$\begin{aligned} C_{K,r} = \max \bigg (\frac{4L^2K}{\theta ^2 \alpha N},\tilde{r}_2^2(\gamma ) \bigg ) \end{aligned}$$

where the exact form of \(\alpha , \theta \) and \(\gamma \) are given in Eq. (33). Set the event \(\Omega _K\) to be such that

$$\begin{aligned} \Omega _K= & {} \bigg \{ \forall f \in F \cap \left( f^*+ \sqrt{C_{K,r}}B_{L_2}\right) , \exists J\subset \{1,\ldots ,K\}: |J|>K/2 \nonumber \\&\quad \text{ and } \forall k\in J, \left| (P_{B_k} - P){{\mathcal {L}}}_f \right| \le \theta C_{K,r} \bigg \}. \end{aligned}$$
(25)

1.2.1 Deterministic argument

The goal of this section is to show that, on the event \(\Omega _K\), \(\Vert {\hat{f}} - f^{*}\Vert _{L_2}^2 \le C_{K,r}\) and \(P{{\mathcal {L}}}_{{\hat{f}}}\le 2 \theta C_{K,r}\).

Lemma 3

If there exists \(\eta >0\) such that

$$\begin{aligned}&\sup _{f \in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) } \,\, \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) < - \eta \quad \text{ and } \nonumber \\&\sup _{f \in F\cap \left( f^{*}+ \sqrt{C_{K,r}} B_{L_2}\right) } \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le \eta , \end{aligned}$$
(26)

then \(\Vert {\hat{f}} - f^{*} \Vert _{L_2 }^2 \le C_{K,r}\).

Proof

Assume that (26) holds, then

$$\begin{aligned} \inf _{f\in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) } \text {MOM}_K[\ell _f-\ell _{f^*}]> \eta . \end{aligned}$$
(27)

Moreover, if \(T_K(f)=\sup _{g\in F}\text {MOM}_K[\ell _f-\ell _g]\) for all \(f\in F\), then

$$\begin{aligned} T_K(f^{*})= & {} \sup _{f\in F \cap \left( f^{*}+ \sqrt{C_{K,r}}B_{L_2}\right) }\text {MOM}_K[\ell _{f^*}-\ell _f]\nonumber \\&\vee&\sup _{f\in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) }\text {MOM}_K[\ell _{f^*}-\ell _f]\leqslant \eta . \end{aligned}$$
(28)

By definition of \(\hat{f}\) and (28), \(T_K(\hat{f})\leqslant T_K(f^*)\leqslant \eta \). Moreover, by (27), any \(f\in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) \) satisfies \(T_K(f)\geqslant \text {MOM}_K[\ell _f-\ell _{f^*}]> \eta \). Therefore \(\hat{f} \in F\cap (f^{*} + \sqrt{C_{K,r}} B_{L_2})\). \(\square \)

Lemma 4

Grant Assumption 6 and assume that \(\theta -A^{-1}<-\theta \). On the event \(\Omega _K\), (26) holds with \(\eta = \theta C_{K,r}\).

Proof

Let \(f\in F\) be such that \(\Vert f-f^{*}\Vert _{L_2 } > C_{K,r}\). By convexity of F, there exists \(f_0 \in F \cap \left( f^{*} + \sqrt{C_{K,r}} S_{L_2}\right) \) and \(\alpha > 1\) such that \(f = f^{*} + \alpha (f_0 - f^{*})\). For all \(i \in \{1,\ldots ,N \}\), let \(\psi _i: {\mathbb {R}} \rightarrow {\mathbb {R}} \) be defined for all \(u\in \mathbb {R}\) by

$$\begin{aligned} \psi _i(u) = \overline{\ell } (u + f^{*}(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i). \end{aligned}$$
(29)

The functions \(\psi _i\) are convex because \(\overline{\ell }\) is and such that \(\psi _i(0) = 0\), so \(\alpha \psi _i(u) \le \psi _i(\alpha u)\) for all \(u\in {\mathbb {R}}\) and \(\alpha \ge 1\). As \(\psi _i(f(X_i) - f^{*}(X_i) )= \overline{\ell } (f(X_i), Y_i) - \overline{\ell } (f^{*}(X_i), Y_i)\), for any block \(B_k\),

$$\begin{aligned} P_{B_k} {{\mathcal {L}}}_f&= \frac{1}{|B_k|} \sum _{i \in B_k} \psi _i \big ( f(X_i)- f^{*}(X_i) \big )= \frac{1}{|B_k|} \sum _{i \in B_k} \psi _i(\alpha ( f_0(X_i)- f^{*}(X_i) ))\nonumber \\&\ge \frac{\alpha }{|B_k|} \sum _{i \in B_k} \psi _i(( f_0(X_i)- f^{*}(X_i))) = \alpha P_{B_k} {{\mathcal {L}}}_{f_0}. \end{aligned}$$
(30)

As \(f_0 \in F \cap (f^* + \sqrt{C_{K,r}} S_{L_2})\), on \(\Omega _K\), there are strictly more than K / 2 blocks \(B_k\) where \(P_{B_k} {{\mathcal {L}}}_{f_0} \ge P{{\mathcal {L}}}_{f_0} - \theta C_{K,r}\). Moreover, from Assumption 6, \(P{{\mathcal {L}}}_{f_0} \ge A^{-1} \Vert f_0-f^*\Vert _{L_2 }^2 = A^{-1}C_{K,r} \). Therefore, on strictly more than K / 2 blocks \(B_k\),

$$\begin{aligned} P_{B_k} {{\mathcal {L}}}_{f_0} \ge (A^{-1} - \theta ) C_{K,r}. \end{aligned}$$
(31)

From Eqs. (30) and  (31), there are strictly more than K / 2 blocks \(B_k\) where \(P_{B_k} {{\mathcal {L}}}_f \ge (A^{-1}- \theta ) C_{K,r} \). Therefore, on \(\Omega _K\), as \((\theta - A^{-1}) < - \theta \),

$$\begin{aligned} \sup _{f \in F \backslash \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) } \,\, \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big )< (\theta - A^{-1}) C_{K,r}<-\theta C_{K,r}. \end{aligned}$$

In addition, on the event \(\Omega _K\), for all \(f \in F \cap (f^{*} + \sqrt{C_{K,r}}B_{L_2})\), there are strictly more than K / 2 blocks \(B_k\) where \(|(P_{B_k}-P) {{\mathcal {L}}}_f | \le \theta C_{K,r} \). Therefore

$$\begin{aligned} \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le \theta C_{K,r} - P{{\mathcal {L}}}_f \le \theta C_{K,r}. \end{aligned}$$

\(\square \)

Lemma 5

Grant Assumption 6 and assume that \(\theta - A^{-1}<-\theta \). On the event \(\Omega _K\), \(P{{\mathcal {L}}}_{\hat{f}} \le 2\theta C_{K,r}\).

Proof

Assume that \(\Omega _K\) holds. From Lemmas 3 and 4 , \(\Vert \hat{f}-f^{*}\Vert _{L_2 } \le \sqrt{C_{K,r}}\). Therefore, on strictly more than K / 2 blocks \(B_k\), \(P {{\mathcal {L}}}_{\hat{f}} \le P_{B_k} {{\mathcal {L}}}_{\hat{f}} + \theta C_{K,r}\). In addition, by definition of \({\hat{f}}\) and (28) (for \(\eta = \theta C_{K,r}\)),

$$\begin{aligned} MOM_K(\ell _{{\hat{f}}} - \ell _{f^{*}}) \le \sup _{f \in F} MOM_K(\ell _{f^{*}} - \ell _{f}) \le \theta C_{K,r}. \end{aligned}$$

As a consequence, there exist at least K / 2 blocks \(B_k\) where \(P_{B_k} {{\mathcal {L}}}_{\hat{f}} \le \theta C_{K,r}\). Therefore, there exists at least one block \(B_k\) where both \(P {{\mathcal {L}}}_{\hat{f}} \le P_{B_k} {{\mathcal {L}}}_{\hat{f}} + \theta C_{K,r}\) and \(P_{B_k} {{\mathcal {L}}}_{\hat{f}} \le \theta C_{K,r}\). Hence \(P{{\mathcal {L}}}_{\hat{f}} \le 2\theta C_{K,r}\). \(\square \)

1.2.2 Stochastic argument

This section shows that \(\Omega _K\) holds with probability at least (8).

Proposition 4

Grant Assumptions 1, 2, 5 and 6 and assume that \((1-\beta )K\ge |{{\mathcal {O}}}|\). Let \(x>0\) and assume that \(\beta (1-\alpha -x-8\gamma L/\theta )>1/2\). Then \(\Omega _K\) holds with probability larger than \(1-\exp (-x^2 \beta K/2)\).

Proof

Let \({{\mathcal {F}}}= F \cap \left( f^{*} + \sqrt{C_{K,r}}B_{L_2}\right) \) and set \(\phi :t\in \mathbb {R}\rightarrow I \{ t\ge 2 \} + (t-1) I \{1 \le t \le 2 \}\) so, for all \(t \in \mathbb {R}\), \(I \{ t\ge 2 \} \le \phi (t) \le I \{ t\ge 1 \}\). Let \(W_k = ((X_i,Y_i))_{i \in B_k}\), \(G_f(W_k) = (P_{B_k} - P){{\mathcal {L}}}_f\). Let

$$\begin{aligned} z(f)&= \sum _{k =1}^K I \{|G_f(W_k)|\le \theta C_{K,r} \}. \end{aligned}$$

Let \(\mathcal {K}\) denote the set of indices of blocks which have not been corrupted by outliers, \(\mathcal {K} = \{k \in \{1,\ldots ,K \} : B_k \subset \mathcal {I}\}\) and let \(f \in {{\mathcal {F}}}\). Basic algebraic manipulations show that

$$\begin{aligned} z(f)\ge & {} |\mathcal {K}| - \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ) \\&- \sum _{k \in \mathcal {K} } \mathbb {E}\phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) . \end{aligned}$$

By Assumptions 1 and 5, using that \(C_{K,r}^2\ge \left\| f-f^*\right\| ^2_{L_2 }[(4L^2K)/(\theta ^2\alpha N)]\),

$$\begin{aligned}&\mathbb {E}\phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \\&\quad \le \mathbb {P} \bigg ( |G_f(W_k)| \ge \frac{\theta C_{K,r}}{2} \bigg ) \le \frac{4}{\theta ^2C_{K,r}^2} \mathbb {E}G_f(W_k)^2 = \frac{4}{\theta ^2 C_{K,r}^2} \mathbb {V}ar (P_{B_k}{{\mathcal {L}}}_f) \\&\quad \le \frac{4K^2}{\theta ^2C_{K,r}^2N^2} \sum _{i \in B_k} \mathbb {E} [{{\mathcal {L}}}_f^2(X_i,Y_i)] \le \frac{4L^2K}{\theta ^2C_{K,r}^2N}\Vert f-f^{*}\Vert ^2_{L_2 } \le \alpha . \end{aligned}$$

Therefore,

$$\begin{aligned} z(f) \ge |\mathcal {K}|(1-\alpha ) -\sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ). \end{aligned}$$
(32)

Using Mc Diarmid’s inequality [11, Theorem 6.2], for all \(x>0\), with probability larger than \(1-\exp (-x^2 |{{\mathcal {K}}}| /2)\),

$$\begin{aligned}&\sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ) \\&\quad \le x|\mathcal {K}| + \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ). \end{aligned}$$

Let \(\epsilon _1, \ldots , \epsilon _K\) denote independent Rademacher variables independent of the \((X_i, Y_i), i\in {{\mathcal {I}}}\). By Giné-Zinn symmetrization argument,

$$\begin{aligned}&\sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \bigg ( \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) - \mathbb {E} \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \bigg ) \\&\quad \le x|\mathcal {K}| + 2 \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|) \end{aligned}$$

As \(\phi \) is 1-Lipschitz with \(\phi (0)=0\), using the contraction lemma [28, Chapter 4],

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \phi (2\theta ^{-1}C_{K,r}^{-1} | G_f(W_k)|)&\le 2\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \frac{ G_f(W_k)}{\theta C_{K,r}} \\&= 2\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \frac{(P_{B_k}- P){{\mathcal {L}}}_f}{\theta C_{K,r}}. \end{aligned}$$

Let \((\sigma _i: i \in \cup _{k\in {{\mathcal {K}}}}B_k)\) be a family of independent Rademacher variables independent of \((\epsilon _k)_{k \in \mathcal {K}}\) and \((X_i, Y_i)_{i \in {{\mathcal {I}}}}\). It follows from the Giné-Zinn symmetrization argument that

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{k \in \mathcal {K}} \epsilon _k \frac{(P_{B_k}- P){{\mathcal {L}}}_f}{ C_{K,r}}\le 2 \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \frac{K}{N}\sum _{i \in \cup _{k\in {{\mathcal {K}}}}B_k } \sigma _i \frac{{{\mathcal {L}}}_f(X_i,Y_i)}{ C_{K,r}}. \end{aligned}$$

By the Lipschitz property of the loss, the contraction principle applies and

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k\in {{\mathcal {K}}}}B_k } \sigma _i \frac{{{\mathcal {L}}}_f(X_i,Y_i)}{ C_{K,r}} \le L\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{C_{K,r}}. \end{aligned}$$

To bound from above the right-hand side in the last inequality, consider two cases 1) \(C_{K,r}= \tilde{r}_2^2(\gamma )\) or 2) \(C_{K,r} = 4L^2K/(\alpha \theta ^2 N)\). In the first case, by definition of the complexity parameter \(\tilde{r}_2(\gamma )\) in (6),

$$\begin{aligned}&\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{C_{K,r}} \\&\quad = \mathbb {E} \sup _{f \in F: \Vert f-f^{*}\Vert _{L_2 } \le {\tilde{r}}_2(\gamma ) } \frac{1}{\tilde{r}_2^2(\gamma )} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i (f-f^{*})(X_i)\bigg | \\&\quad \le \frac{\gamma |{{\mathcal {K}}}| N}{K}. \end{aligned}$$

In the second case,

$$\begin{aligned}&\mathbb {E} \sup _{f \in {{\mathcal {F}}}} \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \frac{\sigma _i(f-f^{*})(X_i)}{C_{K,r}} \\&\quad \le \mathbb {E} \bigg [ \sup _{\begin{array}{c} f \in F:\\ \Vert f-f^{*}\Vert _{L_2 } \le \tilde{r}_2(\gamma ) \end{array}} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \frac{\sigma _i (f-f^{*})(X_i)}{\tilde{r}_2^2(\gamma )} \bigg | \\&\qquad \vee \sup _{\begin{array}{c} f \in F:\\ \tilde{r}_2(\gamma ) \le \Vert f-f^{*}\Vert _{L_2 } \le \sqrt{\frac{4L^2K}{\alpha \theta ^2 N}} \end{array} } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\frac{4L^2K}{\alpha \theta ^2 N}} \bigg | \bigg ] . \end{aligned}$$

Let \(f\in F\) be such that \(\tilde{r}_2(\gamma ) \le \left\| f-f^*\right\| _{L_2 }\le \sqrt{[4L^2K]/[\alpha \theta ^2 N]}\); by convexity of F, there exists \(f_0\in F\) such that \(\left\| f_0-f^*\right\| _{L_2 } = \tilde{r}_2(\gamma )\) and \(f = f^*+\alpha (f_0-f^*)\) with \(\alpha = \left\| f-f^*\right\| _{L_2 }/\tilde{r}_2(\gamma )\ge 1\). Therefore,

$$\begin{aligned} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\frac{4L^2K}{\alpha \theta ^2 N}} \bigg |&\le \frac{1}{\tilde{r}_2(\gamma ) } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\Vert f-f^{*}\Vert _{L_2 }} \bigg | \\&= \frac{1}{\tilde{r}_2^2(\gamma ) } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i (f_0-f^{*})(X_i) \bigg | \end{aligned}$$

and so

$$\begin{aligned}&\sup _{\begin{array}{c} f \in F:\\ \tilde{r}_2(\gamma ) \le \Vert f-f^{*}\Vert _{L_2 } \le \sqrt{\frac{4L^2K}{\alpha \theta ^2 N}} \end{array}} \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{\frac{4L^2K}{\alpha \theta ^2 N}} \bigg | \\&\quad \le \frac{1}{\tilde{r}_2^2(\gamma ) } \sup _{\begin{array}{c} f \in F:\\ \Vert f-f^{*}\Vert _{L_2 } = \tilde{r}_2(\gamma ) \end{array} } \bigg | \sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i (f-f^{*})(X_i) \bigg |. \end{aligned}$$

By definition of \(\tilde{r}_2(\gamma )\), it follows that

$$\begin{aligned} \mathbb {E} \sup _{f \in {{\mathcal {F}}}} \bigg |&\sum _{i \in \cup _{k \in \mathcal {K}} B_k} \sigma _i \frac{(f-f^{*})(X_i)}{ C_{K,r}} \bigg | \le \frac{\gamma |{{\mathcal {K}}}| N}{K}. \end{aligned}$$

Therefore, as \(|\mathcal {K}| \ge K-|\mathcal {O}| \ge \beta K\), with probability larger than \(1-\exp (-x^2 \beta K/2)\), for all \(f\in F\) such that \(\left\| f-f^*\right\| _{L_2 }\le \sqrt{C_{K,r}}\),

$$\begin{aligned} z(f) \ge |\mathcal {K}|\left( 1-\alpha - x - \frac{8 \gamma L}{\theta }\right) > \frac{K}{2}. \end{aligned}$$
(33)

\(\square \)

1.2.3 End of the proof of Theorem 2

Theorem 2 follows from Lemmas 3, 4, 5 and Proposition 4 for the choice of constant

$$\begin{aligned} \theta = 1/(3A) \quad \alpha = 1/24, \quad x = 1/24 ,\quad \beta = 4/7 \text{ and } \gamma = 1/(575 AL). \end{aligned}$$

1.3 Proof of Theorem 3

Let \(K \in \big [ 7|{{\mathcal {O}}}|/3 , N \big ]\) and consider the event \(\Omega _K\) defined in (25). It follows from the proof of Lemmas 3 and 4 that \(T_K(f^*)\le \theta C_{K,r}\) on \(\Omega _K\). Setting \(\theta =1/(3A)\), on \(\cap _{J=K}^N \Omega _J\), \(f^*\in {\hat{R}}_J\) for all \(J=K,\ldots , N\), so \(\cap _{J=K}^N {\hat{R}}_J\ne \emptyset \). By definition of \({\hat{K}}\), it follows that \({\hat{K}}\le K\) and by definition of \({\tilde{f}}\), \({\tilde{f}} \in {\hat{R}}_K\) which means that \(T_K({\tilde{f}})\le \theta C_{K,r}\). It is proved in Lemmas 3 and 4 that on \(\Omega _K\), if \(f\in F\) satisfies \(\left\| f-f^*\right\| _{L_2 }\ge \sqrt{C_{K,r}}\) then \(T_K(f)> \theta C_{K,r}\). Therefore, \(\left\| {\tilde{f}}-f^*\right\| _{L_2 }\le \sqrt{C_{K,r}}\). On \(\Omega _K\), since \(\left\| {\tilde{f}}-f^*\right\| _{L_2 }\le \sqrt{C_{K,r}}\), \(P{{\mathcal {L}}}_{{\tilde{f}}}\le 2 \theta C_{K,r}\). Hence, on \(\cap _{J=K}^N \Omega _J\), the conclusions of Theorem 3 hold. Finally, by Proposition 4,

$$\begin{aligned} {\mathbb {P}}\left[ \cap _{J=K}^N \Omega _J\right] \ge 1-\sum _{J=K}^N \exp (-K/2016)\ge 1-4 \exp (-K/2016). \end{aligned}$$

1.4 Proof of Theorem 4

The proof of Theorem 4 follows the same path as the one of Theorem 2. We only sketch the different arguments needed because of the localization by the excess loss and the lack of Bernstein condition.

Define the event \(\Omega _K^\prime \) in the same way as \(\Omega _K\) in (25) where \(C_{K,r}\) is replaced by \(\bar{r}_2^2(\gamma )\) and the \(L_2\) localization is replaced by the “excess loss localization”:

$$\begin{aligned} \Omega ^\prime _K= & {} \bigg \{ \forall f \in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}, \exists J\subset \{1,\ldots ,K\}: |J|>K/2 \nonumber \\&\quad \text{ and } \forall k\in J, \left| (P_{B_k} - P){{\mathcal {L}}}_f \right| \le (1/4) \bar{r}_2^2(\gamma ) \bigg \} \end{aligned}$$
(34)

where \(({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )} =\{f\in F: P{{\mathcal {L}}}_{f}\le \bar{r}_2^2(\gamma )\}\). Our first goal is to show that on the event \(\Omega _K^\prime \), \(P{{\mathcal {L}}}_{{\hat{f}}}\le (1/4) \bar{r}_2^2(\gamma )\). We will then handle \({\mathbb {P}}[\Omega _K^\prime ]\).

Lemma 6

Grant Assumptions 1 and 2. For every \(r\ge 0\), the set \(({{\mathcal {L}}}_F)_r:=\{f\in F:P{{\mathcal {L}}}_f\le r\}\) is convex and relatively closed to F in \(L_1(\mu )\). Moreover, if \(f\in F\) is such that \(P{{\mathcal {L}}}_f>r\) then there exists \(f_0\in F\) and \((P{{\mathcal {L}}}_f/r)\ge \alpha >1\) such that \((f-f^*)=\alpha (f_0-f^*)\) and \(P{{\mathcal {L}}}_{f_0} = r\).

Proof

Let f and g be in \(({{\mathcal {L}}}_F)_r\) and \(0\le \alpha \le 1\). We have \(\alpha f + (1-\alpha )g\in F\) because F is convex and for all \(x\in {{\mathcal {X}}}\) and \(y\in \mathbb {R}\), using the convexity of \(u\rightarrow {\bar{\ell }}(u+f^*(x), y)\), we have

$$\begin{aligned}&\ell _{\alpha f + (1-\alpha )g}(x,y) - \ell _{f^*}(x,y) \\&\quad = {\bar{\ell }}(\alpha (f-f^*)(x) + (1-\alpha )(g-f^*)(x) + f^*(x), y) - {\bar{\ell }}(f^*(x),y)\\&\quad \le \alpha \big ({\bar{\ell }}((f-f^*)(x)+ f^*(x), y) - {\bar{\ell }}(f^*(x),y)\big ) \\&\qquad + (1-\alpha )\big ({\bar{\ell }}((g-f^*)(x) + f^*(x), y) - {\bar{\ell }}(f^*(x),y)\big )\\&\quad =\alpha (\ell _f - \ell _{f^*}) + (1-\alpha )(\ell _g-\ell _{f^*}) \end{aligned}$$

and so \(P{{\mathcal {L}}}_{\alpha f + (1-\alpha )g}\le \alpha P{{\mathcal {L}}}_f + (1-\alpha )P{{\mathcal {L}}}_g\). Given that \(P{{\mathcal {L}}}_f, P{{\mathcal {L}}}_g\le r\) we also have \(P{{\mathcal {L}}}_{\alpha f + (1-\alpha )g}\le r\). Therefore, \(\alpha f + (1-\alpha )g\in ({{\mathcal {L}}}_F)_r\) and \(({{\mathcal {L}}}_F)_r\) is convex.

For all \(f,g\in F\), \(|P{{\mathcal {L}}}_f - P{{\mathcal {L}}}_g|\le \left\| f-f^*\right\| _{L_1(\mu )}\) so that \(f\in F\rightarrow P{{\mathcal {L}}}_f\) is continuous onto F in \(L_1(\mu )\) and therefore its level sets, such as \(({{\mathcal {L}}}_F)_r\), are relatively closed to F in \(L_1(\mu )\).

Finally, let \(f\in F\) be such that \(P{{\mathcal {L}}}_f >r\). Define \(\alpha _0 = \sup \{\alpha \ge 0: f^*+\alpha (f-f^*)\in ({{\mathcal {L}}}_F)_r\}\). Note that \(P{{\mathcal {L}}}_{f^*+\alpha (f-f^*)}\le \alpha P{{\mathcal {L}}}_f= r\) for \(\alpha = r/P{{\mathcal {L}}}_f\) so that \(\alpha _0\ge r/P{{\mathcal {L}}}_f\). Since \(({{\mathcal {L}}}_F)_r\) is relatively closed to F in \(L_1(\mu )\), we have \(f^*+\alpha _0(f-f^*)\in ({{\mathcal {L}}}_F)_r\) and in particular \(\alpha _0<1\) otherwise, by convexity of \(({{\mathcal {L}}}_F)_r\), we would have \(f\in ({{\mathcal {L}}}_F)_r\). Moreover, by maximality of \(\alpha _0\), \(f_0 = f^*+\alpha _0(f-f^*)\) is such that \(P{{\mathcal {L}}}_{f_0}=r\) and the results follows for \(\alpha = \alpha _0^{-1}\). \(\square \)

Lemma 7

Grant Assumptions 1 and 2. On the event \(\Omega _K^\prime \), \(P{{\mathcal {L}}}_{{\hat{f}}}\le \bar{r}_2^2(\gamma )\).

Proof

Let \(f\in F\) be such that \(P{{\mathcal {L}}}_f > \bar{r}_2^2(\gamma )\). It follows from Lemma 6 that there exists \(\alpha \ge 1\) and \(f_0\in F\) such that \(P{{\mathcal {L}}}_{f_0} = \bar{r}_2^2(\gamma )\) and \(f-f^* = \alpha (f_0-f^*)\). According to (30), we have for every \(k\in \{1, \ldots , K\}\), \(P_{B_k} {{\mathcal {L}}}_f\ge \alpha P_{B_k}{{\mathcal {L}}}_{f_0}\). Since \(f_0\in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}\), on the event \(\Omega _K^\prime \), there are strictly more than K / 2 blocks \(B_k\) such that \(P_{B_k}{{\mathcal {L}}}_{f_0}\ge P{{\mathcal {L}}}_{f_0}- (1/4) \bar{r}_2^2(\gamma ) = (3/4)\bar{r}_2^2(\gamma )\) and so \(P_{B_k}{{\mathcal {L}}}_{f}\ge (3/4)\bar{r}_2^2(\gamma )\). As a consequence, we have

$$\begin{aligned} \sup _{f \in F \backslash ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}} \,\, \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (-3/4) \bar{r}_2^2(\gamma ). \end{aligned}$$
(35)

Moreover, on the event \(\Omega _K^\prime \), for all \(f\in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}\), there are strictly more than K / 2 blocks \(B_k\) such that \(P_{B_k}(-{{\mathcal {L}}}_f)\le (1/4) \bar{r}_2^2(\gamma ) - P{{\mathcal {L}}}_f\le (1/4) \bar{r}_2^2(\gamma )\). Therefore,

$$\begin{aligned} \sup _{f\in ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}}\text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (1/4) \bar{r}_2^2(\gamma ) . \end{aligned}$$
(36)

We conclude from (35) and (36) that \(\sup _{f\in F} \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (1/4) \bar{r}_2^2(\gamma )\) and that every \(f\in F\) such that \(P{{\mathcal {L}}}_{f}> \bar{r}_2^2(\gamma )\) satisfies \(\text {MOM}_{K}\big (\ell _{f}-\ell _{f^*}\big )\ge (3/4)\bar{r}_2^2(\gamma )\). But, by definition of \({\hat{f}}\), we have

$$\begin{aligned} \text {MOM}_{K}\big (\ell _{{\hat{f}}}-\ell _{f^*}\big )\le \sup _{f\in F} \text {MOM}_{K}\big (\ell _{f^{*}}-\ell _f\big ) \le (1/4) \bar{r}_2^2(\gamma ) . \end{aligned}$$

Therefore, we necessarily have \(P{{\mathcal {L}}}_{{\hat{f}}}\le \bar{r}_2^2(\gamma )\). \(\square \)

Now, we prove that \(\Omega _K^\prime \) is an exponentially large event using similar argument as in Proposition 4.

Proposition 5

Grant Assumptions 1, 2 and 7 and assume that \((1-\beta )K\ge |{{\mathcal {O}}}|\) and \(\beta (1-1/12-32\gamma L)>1/2\). Then \(\Omega _K^\prime \) holds with probability larger than \(1-\exp (-\beta K/1152)\).

Sketch of proof

The proof of Proposition 5 follows the same line as the one of Proposition 4. Let us precise the main differences. We set \({{\mathcal {F}}}^\prime = ({{\mathcal {L}}}_F)_{\bar{r}_2^2(\gamma )}\) and for all \(f\in {{\mathcal {F}}}^\prime \), \(z^\prime (f) = \sum _{k=1}^K I\{|G_f(W_k)|\le (1/4) \bar{r}_2^2(\gamma )\}\) where \(G_f(W_k)\) is the same quantity as in the proof of Proposition 5. Let us consider the contraction \(\phi \) introduced in Proposition 5. By definition of \(\bar{r}_2^2(\gamma )\) and \(V_K(\cdot )\), we have

$$\begin{aligned}&\mathbb {E}\phi (8 (\bar{r}_2^2(\gamma ))^{-1} | G_f(W_k)|)\\&\quad \le \mathbb {P} \bigg ( |G_f(W_k)| \ge \frac{\bar{r}_2^2(\gamma )}{8} \bigg ) \le \frac{64}{(\bar{r}_2^2(\gamma ))^2} \mathbb {E}G_f(W_k)^2 = \frac{64}{ (\bar{r}_2^2(\gamma ))^2} \mathbb {V}ar (P_{B_k}{{\mathcal {L}}}_f) \\&\quad \le \frac{64K^2}{(\bar{r}_2^2(\gamma ))^2N^2} \sum _{i \in B_k} \mathbb {V}ar_{P_i}({{\mathcal {L}}}_f) \le \frac{64K}{(\bar{r}_2^2(\gamma ))^2N} \sup \{\mathbb {V}ar_{P_i}({{\mathcal {L}}}_f):f\in {{\mathcal {F}}}^\prime , i\in {{\mathcal {I}}}\} \\&\quad \le \frac{64K}{(\bar{r}_2^2(\gamma ))^2N} \sup \{\mathbb {V}ar_{P_i}({{\mathcal {L}}}_f):P{{\mathcal {L}}}_f \le \bar{r}_2^2(\gamma ), i\in {{\mathcal {I}}}\} \le \frac{1}{24} . \end{aligned}$$

Using Mc Diarmid’s inequality, the Giné-Zinn symmetrization argument and the contraction lemma twice and the Lipschitz property of the loss function, such as in the proof of Proposition 4, we obtain with probability larger than \(1-\exp (-|{{\mathcal {K}}}|/1152)\), for all \(f\in {{\mathcal {F}}}^\prime \),

$$\begin{aligned} z(f)\ge |{{\mathcal {K}}}|(1-1/12) -\frac{32LK}{ N} \mathbb {E}\sup _{f\in {{\mathcal {F}}}^\prime } \frac{1}{\bar{r}_2^2(\gamma )}\left| \sum _{i\in \cup _{k\in {{\mathcal {K}}}}B_k} \sigma _i (f-f^*)(X_i)\right| . \end{aligned}$$
(37)

Now, it remains to use the definition of \(\bar{r}_2^2(\gamma )\) to bound the expected supremum in the right-hand side of (37) to get

$$\begin{aligned} \mathbb {E}\sup _{f\in {{\mathcal {F}}}^\prime } \frac{1}{\bar{r}_2(\gamma )^2}\left| \sum _{i\in \cup _{k\in {{\mathcal {K}}}}B_k} \sigma _i (f-f^*)(X_i)\right| \le \frac{\gamma |{{\mathcal {K}}}|N}{K}. \end{aligned}$$
(38)

\(\square \)

Proof of Theorem 4

The proof of Theorem 4 follows from Lemma 7 and Proposition 5 for \(\beta =4/7\) and \(\gamma = 1/(768L)\). \(\square \)

Proof of Lemma 1

Proof

We have

$$\begin{aligned} \frac{1}{\sqrt{N}} \mathbb {E} \sup _{ f \in F: \Vert f-f^{*}\Vert _{L_2 } \le r} \sum _{i=1}^N \sigma _i (f-f^{*})(X_i)&= \mathbb {E} \sup _{ t \in \mathbb {R}^d: \mathbb {E} \langle t,X \rangle ^2 \le r^2 } \langle t, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle . \end{aligned}$$

Let \(\Sigma =\mathbb {E} X^TX\) denote the covariance matrix of X and consider its SVD, \(\Sigma = QDQ^T\) where \(Q = [Q_1|\ldots |Q_d]\in {\mathbb {R}}^{d\times d}\) is an orthogonal matrix and D is a diagonal \(d\times d\) matrix with non-negative entries. For all \(t\in \mathbb {R}^d\), we have \(\mathbb {E}\langle X,t \rangle ^2 = t^T \Sigma t = \sum _{j=1}^d d_j \langle t,Q_j \rangle ^2\). Then

$$\begin{aligned}&\mathbb {E} \sup _{ t \in \mathbb {R}^d: \sqrt{\mathbb {E} \langle t,X \rangle ^2 } \le r } \langle t, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle \\&\quad = \mathbb {E} \sup _{ t \in \mathbb {R}^d: \sqrt{\mathbb {E} \langle t ,X \rangle ^2 } \le r } \langle \sum _{j=1}^d \langle t,Q_j \rangle Q_j , \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle \\&\quad = \mathbb {E} \sup _{ t \in \mathbb {R}^d: \sqrt{ \sum _{j=1}^d d_j \langle t,Q_j \rangle ^2 } \le r } \sum _{j=1:d_j\ne 0}^d \sqrt{d_j} \langle t,Q_j \rangle \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle \\&\quad \le r \mathbb {E} \sqrt{ \sum _{j=1:d_j\ne 0}^d \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2} \le r \sqrt{ \mathbb {E} \sum _{j=1:d_j\ne 0}^d \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2} . \end{aligned}$$

Moreover, for any j such that \(d_j\ne 0\),

$$\begin{aligned}&\mathbb {E} \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2 \\&\quad = \mathbb {E} \frac{1}{N} \sum _{k,l=1}^N \sigma _l \sigma _k \langle \frac{Q_j}{\sqrt{d_j}}, X_k \rangle \langle \frac{Q_j}{\sqrt{d_j}}, X_l \rangle = \frac{1}{N} \sum _{k=1}^N \mathbb {E} \langle \frac{Q_j}{\sqrt{d_j}}, X_k \rangle ^2 \\&\quad = \frac{1}{N} \sum _{k=1}^N \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg )^T \mathbb {E} X_k^TX_k \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg ) = \frac{1}{N} \sum _{k=1}^N \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg )^T \Sigma \bigg (\frac{Q_j}{\sqrt{d_j}}\bigg ) \end{aligned}$$

By orthonormality, \(Q^TQ_j = e_j\) and \(Q_j^TQ = e_j^T\), then, for any j such that \(d_j\ne 0\),

$$\begin{aligned} \mathbb {E} \langle \frac{Q_j}{\sqrt{d_j}}, \frac{1}{\sqrt{N}} \sum _{i=1}^N \sigma _i X_i \rangle ^2 = \frac{1}{N} \sum _{k=1}^N \frac{1}{d_j} e_j^T D e_j = 1. \end{aligned}$$

Finally, we obtain

$$\begin{aligned} \frac{1}{\sqrt{N}} \mathbb {E} \sup _{ f \in F: \Vert f-f^{*}\Vert _{L_2 } \le r} \sum _{i=1}^N \sigma _i (f-f^{*})(X_i) \le r \sqrt{\sum _{j=1}^d \mathbf{1}_{\{d_j\ne 0\}}} = r \sqrt{\text {Rank}(\Sigma )} \end{aligned}$$

and therefore the fixed point \(\tilde{r}_2(\gamma )\) is such that

$$\begin{aligned} \tilde{r}_2(\gamma )&= \inf \bigg \{ r> 0, \forall J \in \mathcal {I}: |J| \ge N/2, \\&\quad \mathbb {E}\sup _{t \in \mathbb {R}^d : \sqrt{\mathbb {E} \langle t-t^{*},X \rangle ^2 } \le r } \sum _{i \in J } \sigma _i \langle X_i,t-t^{*} \rangle \, \le r^2 |J| \gamma \bigg \}\\&\le \inf \bigg \{ r > 0, \forall J \in \mathcal {I}: |J| \ge N/2, \quad r\sqrt{\text {Rank}(\Sigma )} \le r^2 \sqrt{|J|} \gamma \bigg \} \\&\le \sqrt{\frac{\text {Rank}(\Sigma )}{2\gamma ^2N}}. \end{aligned}$$

\(\square \)

Proofs of the results of Sect. 5

We begin this section with a simple Lemma coming from the convexity of F.

Lemma 8

For any \(f \in F\),

$$\begin{aligned} \lim _{t \rightarrow 0^+} \frac{R(f^*+t(f-f^*)) - R(f^*)}{t} \ge 0 \end{aligned}$$

where we recall that \(R(f) = \mathbb {E}_{(X,Y) \sim P} [\ell _f(X,Y)]\).

Proof

Let \(t \in (0,1)\). By convexity of F, \(f^* + t(f-f^*) \in F\) and \(R(f^*+t(f-f^*)) - R(f^*) \ge 0\) because \(f^*\) minimizes the risk over F. \(\square \)

1.1 Proof of Theorem 6

Let \(r >0\). Let \(f\in F\) be such that \(\left\| f-f^*\right\| _{L_2}\le r\). For all \(x\in {{\mathcal {X}}}\) denote by \(F_{Y|X=x}\) the conditional c.d.f. of Y given \(X=x\). We have

$$\begin{aligned} \mathbb {E} \bigg [ \ell _f(X,Y) | X = x \bigg ]&= (\tau -1) \int \mathbf{1}_{y \le f(x)} (y-f(x))F_{Y|X=x}(\mathrm{d}y) \\&\quad + \tau \int \mathbf{1}_{y> f(x)} (y-f(x))F_{Y|X=x}(\mathrm{d}y) \\&= \int \mathbf{1}_{y > f(x)} (y-f(x))F_{Y|X=x}(\mathrm{d}y) \\&\quad + (\tau -1) \int \mathbf{1}_{\mathbb {R}} (y-f(x))F_{Y|X=x}(\mathrm{d}y). \end{aligned}$$

By Fubini’s theorem,

$$\begin{aligned} \int \mathbf{1}_{z \ge f(x)} (1-F_{Y|X=x}(z))\text {d}z&= \int \mathbf{1}_{z \ge f(x)}\bigg ( 1 - {\mathbb {P}}(Y \le z |X=x ) \bigg ) \mathrm{d}z \\&= \int \mathbf{1}_{z \ge f(x)} \mathbb {E} [ \mathbf{1}_{ Y> z }|X=x ] \text {d}z\\&= \int \int \mathbf{1}_{ y> z \ge f(x) } f_{Y|X=x}(y)\text {d}y \text {d}z \\&= \int \mathbf{1}_{ y> f(x) }(y-f(x)) f_{Y|X=x}(y) \text {d}y\\&= \int \mathbf{1}_{y > f(x)} (y-f(x)) F_{Y|X=x}(\mathrm{d}y). \end{aligned}$$

Therefore,

$$\begin{aligned}&\mathbb {E} \bigg [ \ell _f(X,Y) | X=x\bigg ] \\&\quad = \int \mathbf{1}_{y \ge f(x)} (1-F_{Y|X=x}(y))\text {d}y + (\tau -1) \bigg ( \int _{\mathbb {R}} yF_{Y|X=x}(\mathrm{d}y) - f(x) \bigg ) \\&\quad = g(x,f(x)) + (\tau -1 ) \int _{\mathbb {R}} yF_{Y|X=x}(\mathrm{d}y) \end{aligned}$$

where \(g:(x,a)\in {{\mathcal {X}}}\times {\mathbb {R}}\rightarrow \int \mathbf{1}_{y \ge a} (1 - F_{Y|X=x}(y))\text {d}y + (1-\tau )a\). It follows that

$$\begin{aligned} P\mathcal {L}_f=\mathbb {E}[g(X,f(X))-g(X,f^*(X))]. \end{aligned}$$
(39)

Since for all \(x \in {{\mathcal {X}}}\), \(a \mapsto g(x,a)\) is twice differentiable, from a second order Taylor expansion we get

$$\begin{aligned} P{{\mathcal {L}}}_f&= \mathbb {E}\bigg [ g(X,f(X)) - g(X,f^*(X)) \bigg ] \\&= \mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X)) (f(X)-f^*(X)) \bigg ] \\&\quad + \frac{1}{2} \int _{x \in {{\mathcal {X}}}} \frac{\partial ^2 g(x,a)}{\partial a^2}(z_x) (f(x)-f^*(x))^2 dP_X(x) \end{aligned}$$

where for all \(x\in {{\mathcal {X}}}\), \(z_x\) is some point in \(\big [\min (f(x), f^{*}(x)),\max (f(x), f^{*}(x)) \big ]\). For the first order term, we have

$$\begin{aligned}&\mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X)) (f(X)-f^*(X)) \bigg ] \\&\quad = \mathbb {E}\lim _{t \rightarrow 0^+} \frac{g(X,f^*(X)+ t(f(X)-f^*(X)) - g(X,f^*(X))}{t}. \end{aligned}$$

For all \(x \in {{\mathcal {X}}}\), we have \([g(x,f^*(x)+t(f(x)-f^*(x))) - g(x,f^*(x))]/t \le (2-\tau ) |f^*(x)- f(x)|\) which is integrable with respect to \(P_X\). Thus, by the dominated convergence theorem, it is possible to interchange integral and limit and therefore using Lemma 8, we obtian

$$\begin{aligned}&\mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X)) (f(X)-f^*(X)) \bigg ] \\&\quad = \lim _{t \rightarrow 0^+} {\mathbb {E}}\frac{g(X,f^*(X)+ t(f(X)-f^*(X)) - g(X,f^*(X))}{t} \\&\quad = \lim _{t \rightarrow 0^+} \frac{R(f^*+ t(f-f^*)) - R(f^*)}{t} \ge 0. \end{aligned}$$

Given that for all \(x\in {{\mathcal {X}}}\), \(\frac{\partial ^2 g (x,a)}{\partial a^2} (z)= f_{Y|X=x}(z)\) for all \(z\in {\mathbb {R}}\) it follows that

$$\begin{aligned} P{{\mathcal {L}}}_f \ge \frac{1}{2} \int _{x \in {{\mathcal {X}}}} f_{Y|X=x}(z_x) (f(x)-f^*(x))^2 dP_X(x). \end{aligned}$$

Consider \(A = \{ x \in \mathcal {X}, |f(x)-f^{*}(x)| \le (\sqrt{2}C')^{(2+\varepsilon )/\varepsilon } r \}\). Given that \(\Vert f-f^{*}\Vert _{L_2 }\le r\), by Markov’s inequality, \(P(X \in A) \geqslant 1-1/(\sqrt{2}C')^{(4+2\varepsilon )/\varepsilon }\). From Assumption 10 we get

$$\begin{aligned} \frac{2P\mathcal {L}_f}{\alpha }&\geqslant \mathbb {E} [ I_A(X) (f(X)-f^{*}(X))^2 ] =\Vert f-f^*\Vert _{L_2 }^2-\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]. \end{aligned}$$
(40)

By Holder and Markov’s inequalities,

$$\begin{aligned} \mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant & {} \big ( \mathbb {E} [ I_{A^c}(X)] \big )^{\varepsilon /(2+\varepsilon )} \big ( \mathbb {E} [ (f(X)-f^{*}(X))^{2+\varepsilon } ] \big )^{2/(2+\varepsilon )} \\\leqslant & {} \frac{ \Vert f-f^{*}\Vert _{L_{2+\varepsilon }}^2}{2(C')^2}. \end{aligned}$$

By Assumption 9, it follows that \(\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant \Vert f-f^{*}\Vert _{L_2}^2/2\) and we conclude with (40).

1.2 Proof of Theorem 7

Let \(r>0\). Let \(f\in F\) be such that \(\left\| f-f^*\right\| _{L_2}\le r\). We have

$$\begin{aligned} P{{\mathcal {L}}}_f&= \mathbb {E}_X \mathbb {E} \bigg [ \rho _H(Y-f(x)) - \rho _H(Y-f^*(x)) | X= x \bigg ] \\&= \mathbb {E}\big [g(X, f(X)) - g(X, f^*(X))\big ] \end{aligned}$$

where \(g:(x,a)\in {{\mathcal {X}}}\times {\mathbb {R}}= \mathbb {E}[\rho _H(Y-a)|X=x]\). Let \(F_{Y|X=x}\) denote the c.d.f. of Y given \(X=x\). Since for all \(x \in {{\mathcal {X}}}\), \(a \mapsto g(x,a)\) is twice differentiable in its second argument (see Lemma 2.1 in [15]), a second Taylor expansion yields

$$\begin{aligned} P{{\mathcal {L}}}_f&= \mathbb {E}\bigg [ \frac{\partial g(X,a)}{\partial a}(f^*(X))(f(X)-f^*(X)) \bigg ] \\&\quad + \frac{1}{2} \int _{x \in {{\mathcal {X}}}} (f(x)-f^{*}(x))^2 \frac{\partial ^2 g(x,a)}{\partial a^2}(z_x) dP_X(x) \end{aligned}$$

where for all \(x\in {{\mathcal {X}}}\), \(z_x\) is some point in \([\min (f(x), f^{*}(x)), \max (f(x), f^{*}(x))]\). By Lemma 8, with the same reasoning as the one in Sect. C.1, we get

$$\begin{aligned} P{{\mathcal {L}}}_f \ge \frac{1}{2} \int _{x \in {{\mathcal {X}}}} (f(x)-f^{*}(x))^2 \frac{\partial ^2 g(x,a)}{\partial a^2}(z_x) dP_X(x) . \end{aligned}$$

Moreover, for all \(z \in {\mathbb {R}}\),

$$\begin{aligned} \frac{\partial ^2 g(x,a)}{\partial a^2}(z)&= F_{Y|X=x}(z + \delta ) - F_{Y|X=x}(z-\delta ). \end{aligned}$$

Now, let \(A = \{ x \in {{\mathcal {X}}}: |f(x)-f^{*}(x)| \le (\sqrt{2}C')^{(2+\varepsilon )/\varepsilon } r \} \). It follows from Assumption 10 that \(P{{\mathcal {L}}}_f \ge (\alpha /2) \mathbb {E} [(f(X)-f^{*}(X))^2 I_A(X)]\). Since \(\Vert f-f^{*}\Vert _{L_2 }\le r\), by Markov’s inequality, \(P(X \in A) \geqslant 1-1/(\sqrt{2}C')^{(4+2\varepsilon )/\varepsilon }\). By Holder and Markov’s inequalities,

$$\begin{aligned} \mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant & {} \big ( \mathbb {E} [ I_{A^c}(X)] \big )^{\varepsilon /(2+\varepsilon )} \big ( \mathbb {E} [ (f(X)-f^{*}(X))^{2+\varepsilon } ] \big )^{2/(2+\varepsilon )} \\\leqslant & {} \frac{ \Vert f-f^{*}\Vert _{L_{2+\varepsilon }}^2}{2(C')^2}. \end{aligned}$$

By Assumption 9, it follows that \(\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant \frac{\Vert f-f^{*}\Vert _{L_2}^2}{2}\), which concludes the proof.

1.3 Proof of Theorem 8

Let \(r>0\). Let \(f\in F\) be such that \(\left\| f-f^*\right\| _{L_2}\le r\). Let \(\eta (x) = P(Y=1|X=x)\). Write first that \(P{{\mathcal {L}}}_f = \mathbb {E} \bigg [ g(X, f(X)) - g(X, f^*(X))\bigg ]\) where for all \(x\in {{\mathcal {X}}}\) and \(a\in {\mathbb {R}}\), \(g(x,a) = \eta (x) \log (1+\exp (-a)) + (1-\eta (x))\log (1+\exp (a))\). From Lemma 8 and the same reasoning as in Sects. C.1 and C.2 we get

$$\begin{aligned} P{{\mathcal {L}}}_f&\ge \int _{x \in {{\mathcal {X}}}} \frac{\partial _2^2 g(x,a)}{\partial a^2}(z_x)\frac{(f(x)-f^{*}(x))^2 }{2} dP_X(x) \\&= \int _{x \in {{\mathcal {X}}}} \frac{e^{z_x}}{(1+e^{z_x})^2}\frac{(f(x)-f^{*}(x))^2 }{2} dP_X(x) \end{aligned}$$

for some \(z_x \in [\min (f(x), f^{*}(x)), \max (f(x), f^{*}(x))]\). Now, let

$$\begin{aligned} A = \left\{ x \in \mathcal {X}: |f^{*}(x)|\le c_0, |f(x)-f^{*}(x)|\le (2C')^{(2+\varepsilon )/\varepsilon } r \right\} . \end{aligned}$$

On the event A we have

$$\begin{aligned} P\mathcal {L}_f \ge \frac{e^{- c_0 -(2C')^{(2+\varepsilon )/\varepsilon } r } }{2\big ( 1+ e^{c_0 + (2C')^{(2+\varepsilon )/\varepsilon } r } \big )^2 } \mathbb {E} [I_A(X) (f(X)-f^{*}(X))^2] \end{aligned}$$

Using the fact that \(P(X \notin A) \le P(|f^*(X) |> c_0) + P(|f(X)-f^*(X| > (2C')^{(2+\varepsilon )/\varepsilon } r ) \le 2/(2C')^{(4+\varepsilon )/\varepsilon } \), we conclude with Assumption 9 and the same analysis as in the two previous proofs.

1.4 Proof of Theorem 9

Let \(r>0\) such that \( r(\sqrt{2} C')^{(2+\varepsilon )/\varepsilon } \le 1\). Let f be in F such that \(\Vert f-f^*\Vert _{L_2} \le r\). For all x in \(\mathcal {X}\) let us denote \(\eta (x) = {\mathbb {P}}(Y=1 | X=x) \). It is easy to verify that the Bayes estimator (which is equal to the oracle) is defined as \(f^*(x) = \text{ sign }(2\eta (x)-1)\). Consider the set \(A= \{ x \in {\mathcal {X}}, |f(x)-f^*(x)| \le r(\sqrt{2} C')^{(2+\varepsilon )/\varepsilon } \}\). Since \(\Vert f-f^*\Vert _{L_2} \le r\), by Markov’s inequality \({\mathbb {P}}(X \in A) \ge 1-1/(\sqrt{2} C')^{(4+ 2\varepsilon )/\varepsilon }\). Let x be in A. If \(f^*(x) = -1\) (i.e \(2\eta (x) \le 1\)) and \(f(x) \le f^*(x) = -1\) we obtain

$$\begin{aligned} {\mathbb {E}} \big [ \ell _f(X,Y) | X= x \big ]- {\mathbb {E}} \big [ \ell _{f^*}(X,Y) | X= x \big ]= & {} \eta (x)(1-f(x)) - \eta (x) (1-f^*(x))\\\ge & {} \eta (x) \big (f(x)-f^*(x) \big )^2 \end{aligned}$$

where we used the fact that on A, \(|f(x)-f^*(x)| \le r(\sqrt{2} C')^{(2+\varepsilon )/\varepsilon } \le 1\). Using the same analysis for the other cases we get that

$$\begin{aligned}&{\mathbb {E}} \big [ \ell _f(X,Y) | X= x \big ]- {\mathbb {E}} \big [ \ell _{f^*}(X,Y) | X= x \big ] \\&\quad \ge \min \big (\eta (x),1-\eta (x), |1-2\eta (x)| \big ) \big (f(x)-f^*(x) \big )^2 \\&\quad \ge \alpha \big (f(x)-f^*(x) \big )^2 \end{aligned}$$

Therefore,

$$\begin{aligned} \frac{P\mathcal {L}_f}{\alpha }&\geqslant \mathbb {E} [ I_A(X) (f(X)-f^{*}(X))^2 ] =\Vert f-f^*\Vert _{L_2 }^2-\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]. \end{aligned}$$
(41)

By Holder and Markov’s inequalities,

$$\begin{aligned} \mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant & {} \big ( \mathbb {E} [ I_{A^c}(X)] \big )^{\varepsilon /(2+\varepsilon )} \big ( \mathbb {E} [ (f(X)-f^{*}(X))^{2+\varepsilon } ] \big )^{2/(2+\varepsilon )} \\\leqslant & {} \frac{ \Vert f-f^{*}\Vert _{L_{2+\varepsilon }}^2}{2(C')^2}. \end{aligned}$$

By Assumption 9, it follows that \(\mathbb {E} [ I_{A^c}(X) (f(X)-f^{*}(X))^2 ]\leqslant \frac{\Vert f-f^{*}\Vert _{L_2}^2}{2}\) and we conclude with (41).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chinot, G., Lecué, G. & Lerasle, M. Robust statistical learning with Lipschitz and convex loss functions. Probab. Theory Relat. Fields 176, 897–940 (2020). https://doi.org/10.1007/s00440-019-00931-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00440-019-00931-3

Mathematics Subject Classification

Navigation