Skip to main content
Log in

Statistical learning based on Markovian data maximal deviation inequalities and learning rates

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

In statistical learning theory, numerous works established non-asymptotic bounds assessing the generalization capacity of empirical risk minimizers under a large variety of complexity assumptions for the class of decision rules over which optimization is performed, by means of sharp control of uniform deviation of i.i.d. averages from their expectation, while fully ignoring the possible dependence across training data in general. It is the purpose of this paper to show that similar results can be obtained when statistical learning is based on a data sequence drawn from a (Harris positive) Markov chain X, through the running example of estimation of minimum volume sets (MV-sets) related to X’s stationary distribution, an unsupervised statistical learning approach to anomaly/novelty detection. Based on novel maximal deviation inequalities we establish, using the regenerative method, learning rate bounds that depend not only on the complexity of the class of candidate sets but also on the ergodicity rate of the chain X, expressed in terms of tail conditions for the length of the regenerative cycles. In particular, this approach fully tailored to Markovian data permits to interpret the rate bound results obtained in frequentist terms, in contrast to alternative coupling techniques based on mixing conditions: the larger the expected number of cycles over a trajectory of finite length, the more accurate the MV-set estimates. Beyond the theoretical analysis, this phenomenon is supported by illustrative numerical experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Adamczak, R., Bednorz, W.: Exponential concentration inequalities for additive functionals of Markov chains. ESAIM: PS 19, 440–481 (2015)

    Article  MathSciNet  Google Scholar 

  2. Adams, T.M., Nobel, A.B.: Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Ann. Probab. 38, 1345–1367 (2010)

    Article  MathSciNet  Google Scholar 

  3. Agarwal, A., Duchi, J.: The generalization ability of online algorithms for dependent data. IEEE Trans. Inf. Theory 59(1), 573–587 (2013)

    Article  MathSciNet  Google Scholar 

  4. Alquier, P., Wintenberger, O.: Model selection for weakly dependent time series forecasting. Bernoulli 18, 883–913 (2012)

    Article  MathSciNet  Google Scholar 

  5. Asmussen, S.: Applied probability and queues. Springer, New York (2003)

    MATH  Google Scholar 

  6. Bertail, P., Ciołek, G.: New Bernstein and Hoeffding type inequalities for regenerative Markov chains. ALEA Lat. Am. J. Probab. Math. Stat. 259, Äì–277 (2019)

    MathSciNet  MATH  Google Scholar 

  7. Bertail, P., Clémençon, S.: Edgeworth expansions for suitably normalized sample mean statistics of atomic Markov chains. Prob. Th. Rel Fields 130(3), 388–414 (2004)

    Article  MathSciNet  Google Scholar 

  8. Bertail, P., Clémençon, S.: A renewal approach to Markovian U-statistics. Math. Methods Statist. 20(2), 79–105 (2004)

    Article  MathSciNet  Google Scholar 

  9. Bertail, P., Clémençon, S.: Regenerative-block bootstrap for Markov chains. Bernoulli 12(4), 689–712 (2005)

    Article  MathSciNet  Google Scholar 

  10. Bertail, P., Clémençon, S.: Sharp bounds for the tails of functionals of Markov chains. Theory of Probability and Its Applications 54(3), 505–515 (2010)

    Article  MathSciNet  Google Scholar 

  11. Ciołek, G.: Bootstrap uniform central limit theorems for harris recurrent Markov chains. Electronic Journal of Statistics 10, 2157–2178 (2016)

    Article  MathSciNet  Google Scholar 

  12. Clémençon, S., Bertail, P., Papa, G.: Learning from survey training samples: rate bounds for Horvitz-Thompson risk minimizers. In: Proceedings of ACML’16 (2016)

  13. de la Pena, V., Giné, E.: Decoupling: from dependence to independence. Springer, Berlin (1999)

    Book  Google Scholar 

  14. Di, J., Kolaczyk, E.: Complexity-penalized estimation of minimum volume sets for dependent data. J. Multivar. Anal. 101(9), 1910–1926 (2004)

    Article  MathSciNet  Google Scholar 

  15. Einmahl, J.H.J., Mason, D.M.: Generalized quantile process. Ann. Stat. 20, 1062–1078 (1992)

    Article  Google Scholar 

  16. Giné, E., Zinn, J.: Some limit theorems for empirical processes. Ann. Probab. 12(4), 929–998 (1984). With discussion

    Article  MathSciNet  Google Scholar 

  17. Hairer, M., Mattingly, J.C.: Yet another look at harris, Äô ergodic theorem for Markov chains. Seminar on Stochastic Analysis, Random Fields and Applications VI. Progr Probab. 63, 109–117 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Hanneke, S.: Learning whenever learning is possible: Universal learning under general stochastic processes. arXiv:1706.01418 (2017)

  19. Jain, J., Jamison, B.: Contributions to Doeblin’s theory of Markov processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 8, 19–40 (1967)

    Article  MathSciNet  Google Scholar 

  20. Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 47, 1902–1914 (2001)

    Article  MathSciNet  Google Scholar 

  21. Kuznetsov, V., Mohri, M.: Generalization bounds for time series prediction with non-stationary processes. In: Proceedings of ALT’14 (2014)

  22. Massart, P.: Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse 9, 245–303 (2000)

    Article  MathSciNet  Google Scholar 

  23. McGoff, K., Nobel, A.B.: Empirical risk minimization and complexity of dunamical models. Submitted (2018)

  24. Merlevède, F., Peligrad, M.: Rosenthal-type inequalities for the maximum of partial sums of stationary processes and examples. Ann. Probab. 41, 914–960 (2013)

    Article  MathSciNet  Google Scholar 

  25. Meyn, S.P., Tweedie, R.L.: Markov chains and stochastic stability. Springer, Berlin (1996)

    MATH  Google Scholar 

  26. Montgomery-Smith, S.J.: Comparison of sums of independent identically distributed random vectors. J. Math. Anal. Appl. 14, 281–285 (1993)

    MathSciNet  MATH  Google Scholar 

  27. Nummelin, E.: A splitting technique for Harris recurrent chains. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 43, 309–318 (1978)

    Article  MathSciNet  Google Scholar 

  28. Peligrad, M.: The r-quick version of the strong law for stationary ϕ-mixing sequences. In: Almost Everywhere Convergence (Columbus, OH, 1988). Academic Press, Boston (1989)

  29. Petrov, V.V.: Limit theorems of probability theory: sequences of independent random variables. Oxford studies in probability. Clarendon Press, Oxford (1995)

    MATH  Google Scholar 

  30. Polonik, W.: Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications 69(1), 1–24 (1997)

    Article  MathSciNet  Google Scholar 

  31. Revuz, D.: Markov chains. 2nd edition, North-Holland (1984)

  32. Rosenthal, H.P.: On the subspaces of lp (p > 2) spanned by sequences of independent random variables. Israel J. Math. 8, 273–303 (1970)

    Article  MathSciNet  Google Scholar 

  33. Scott, C., Nowak, R.: Learning minimum volume sets. J. Mach. Learn. Res. 7, 665–704 (2006)

    MathSciNet  MATH  Google Scholar 

  34. Shao, Q.: Maximal inequalities for partial sums of ρ-mixing sequences. Ann. Probab. 23, 948–965 (1995)

    Article  MathSciNet  Google Scholar 

  35. Steinwart, I., Christmann, A.: Fast learning from non-i.i.d. observations. NIPS 22, 1768–1776 (2009)

    Google Scholar 

  36. Steinwart, I., Hush, D., Scovel, C.: Learning from dependent observations. J. Multivar. Anal. 100(1), 175–194 (2009)

    Article  MathSciNet  Google Scholar 

  37. Thorisson, H.: Coupling, stationarity and regeneration. Springer, Berlin (2000)

    Book  Google Scholar 

  38. Tuominen, P.K., Tweedie, R.: Subgeometric rates of convergence of f-ergodic Markov chains. Adv. Appl. Probab. 26, 775–798 (1994)

    Article  MathSciNet  Google Scholar 

  39. Utev, S.A.: Sums of random variables with ϕ-mixing. Sib. Adv. Math. 1, 124–155 (1991)

    MATH  Google Scholar 

  40. van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer, Berlin (1996)

    Book  Google Scholar 

  41. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)

    Article  Google Scholar 

  42. Viennet, G.: Inequalities for absolutely regular sequences: application to density estimation. Probab. Theory Relat. Fields 107, 467–492 (1997)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research was supported by a public grant as part of the Investissement d’avenir, project reference ANR-11-LABX-0056-LMH. Gabriela Ciołek was also supported by the Polish National Science Centre NCN (grant No. UMO2016/23/N/ST1/01355 ) and (partly) by the Ministry of Science and Higher Education. This research has also been conducted as part of the project Labex MME-DII (ANR11-LBX-0023-01). Part of this research was conducted during a stay of Gabriela Ciołek at Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephan Clémençon.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Technical proofs

Appendix: Technical proofs

1.1 Moment and probability inequalities in the i.i.d. setup

Since main probabilistic results of the paper are established by means of the regenerative approach, see Section 2.1, their proof are partly based on certain moment/probability inequalities in the i.i.d. case, which we recall below for clarity. Rosenthal’s inequality for i.i.d. random variables can be found in [32]. The version stated below (see Theorem 2.10 from [29]) seems to be more appropriate regarding the statistical learning applications considered in this paper.

Theorem 6.1

LetX1,⋯ ,Xnbe integrable centeredi.i.d. random variables andp ≥ 2.Assume that\(\mathbb {E}|X_{i}|^{p}<\infty \). Then, we have:𝜖 > 0,

$$ \mathbb{P}\left( \left\vert \frac{1}{n}\sum\limits_{i=1}^{n}X_{i}\right\vert \geq \epsilon \right) \leq\frac{c_{p} \mathbb{E}|X_{1}|^{p}}{\epsilon^{p}n^{p/2}}, $$

where \(c_{p} = 2\max \nolimits \left (p^{p}, p^{p/2 + 1}e^{p} {\int \nolimits }_{0}^{\infty } x^{p/2-1}(1-x)^{-p}dx \right )\).

The constant cp documented above is due to [32]. The second result recalled here is Montgomery-Smith’s inequality, see [26].

Theorem 6.2 (Montgomery-Smith’s inequality)

LetX1,⋯ ,Xnbeintegrable centered i.i.d. random variables. Then,for\(1 \leq k \leq n < \infty \)andallt > 0, wehave

$$ \mathbb{P}\left( \max_{1 \leq k \leq n} \left|\sum\limits_{i=1}^{k} X_{i}\right| > t\right) \leq 9 \mathbb{P}\left( \left|\sum\limits_{i=1}^{n} X_{i}\right| > t/30\right). $$

Lemma 6.1

Suppose that Assumption 3.3 holds. Then we have

$$ \mathbb{P}_{\nu}\left( n^{1/2}\left( \frac{l_{n}}{n}-\frac{1}{\mathbb{E}_{A}[\tau_{A}]}\right)\geq N\right) \leq \frac{4^{p} (2^{p} + 1) }{\mathbb{E}_{A}[\tau_{A}]^{p} N^{p}} + \frac{4^{p} (2^{p} + 1) }{ N^{p/2} n^{p/4}}. $$

The proof is a simple generalization of Lemma 3.6 in [6] and thus omitted.

1.2 Proof of theorem 3.5

We prove the version of the result stated below, more specific.

Theorem 6.3 (Polynomial tail maximal inequality for regenerative Markov chains)

Assume that Assumptions 3.3, 3.4 and 3.2 are satisfied bychain\(X= (X_{n})_{n \in \mathbb {N}}\). Then, wehave for anyx > 0, any0 < 𝜖 < x/2, anyN > 0 and foralln ≥ 1 that

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \sup_{f \in \mathcal{F}} \left| \frac{1}{n} {\sum}_{i=1}^{n} \bar{f}(\mathcal{B}_{i})\right| \geq x\right) &\leq& \mathcal{N}_{1}\left( \epsilon, \mathcal{F}\right) \left[ \frac{3^{p}\left( \mathbb{E}_{\nu}\left[\left( \tilde{H}(\mathcal{B}_{1})\right)^{p} \right]+\mathbb{E}_{A}\left[\left( \tilde{H}(\mathcal{B}_{1})\right)^{p} \right]\right)}{ n^{p} (x-2\epsilon)^{p}} \right.\\ && \left. + \frac{18 \times 90^{p} C_{p} \mathbb{E}_{A}\left[(\tilde{H}(\mathcal{B}_{1}))^{p}\right] }{n^{p/2}(x-2\epsilon)^{p} }\right.\\ &&\left.+\frac{6^{p} C_{p} \mathbb{E}_{A}\left[(\tilde{H}(\mathcal{B}_{1}))^{p}\right] N^{p}}{n^{3p/4}(x-2\epsilon)^{p}}\right.\\ &&\left.+ \mathbb{P}_{\nu}\left( n^{1/2}\left( \frac{l_{n}}{n}-\frac{1}{\mathbb{E}_{A}[\tau_{A}]}\right)\geq N\right)\right], \end{array} $$

where\(C_{p} = 24\max \nolimits \left (p^{p}, p^{p/2 + 1}e^{p} {\int \nolimits }_{0}^{\infty } x^{p/2-1}(1-x)^{-p}dx \right )\)and\(\tilde {H}=H+\mu (H)\).

Proof

The techniques used in the poof are similar to those in the proof of Theorem 3.14 in [6].Uniform covering We choose functions g1,g2,…,gM in class \(\mathcal {F}\) defining an 𝜖-covering of \(\mathcal {F}\), where \(M= \mathcal {N}_{1}(\epsilon , \mathcal {F})\), such that

$$ \min_{j} \vert|f-\mu(f)-g_{j}+ \mu(g_{j})|\vert_{L_{1}(Q)} \leq 2\epsilon \textit{ for all } f \in \mathcal{F}, $$

where Q is any discrete probability measure. We also assume that g1,g2,…,gM satisfy conditions 3.3, 3.4 and 3.2. By f we mean the gj that achieves the minimum. Next, by definition of uniform covering numbers, we obtain

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{\nu} \left( \sup_{f \in \mathcal{F}} \left| \frac{1}{n} \sum\limits_{i=1}^{n} (f(X_{i}) - \mu(f))\right| \geq x \right) \\ &&\quad\leq \mathbb{P}_{\nu}\left( \sup_{f \in \mathcal{F}} \left[ \left|\frac{1}{n} \sum\limits_{i=1}^{n}|f(X_{i}) - \mu(f) -f^{*}(X_{i}) + \mu(f^{*})\right| + \left|\frac{1}{n}{\sum}_{i=1}^{n}|f^{*}(X_{i}) - \mu(f^{*})|\right| \right] \geq x\right)\\ &&\quad \leq \mathbb{P}_{\nu} \left( \max_{j \in \{1, \ldots, \mathcal{N}_{1}(\epsilon, \mathcal{F})\}} \left| \frac{1}{n} \sum\limits_{i=1}^{n} g_{j}(X_{i}) - \mu(g_{j})\right| \geq x - 2\epsilon\right)\\ &&\quad \leq \mathcal{N}_{1}\left( \epsilon, \mathcal{F}\right) \max_{j \in \{1, \ldots, \mathcal{N}_{1}(\epsilon, \mathcal{F})\}}\mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{n} g_{j}(X_{i}) - \mu(g_{j})\right| \geq x-2\epsilon\right). \end{array} $$

We introduce the notation

$$ \overline{g}_{j} = g_{j} - \mu(g_{j}), j\in\{1, \ldots, M \}. $$

Hence, rather than considering any \(f \in \mathcal {F}\), we may work with the functions \(g_{j} \in \mathcal {F}\) only.Decomposition Consider now the following decomposition

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{n} \bar{g}_{j}(X_{i})\right| \geq x-2\epsilon\right) \leq \mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{\tau_{A}} \bar{g}_{j}(X_{i})\right| \geq (x-2\epsilon)/3\right) \\ && + \mathbb{P}_{A}\left( \frac{1}{n}\left|\sum\limits_{i=1}^{l_{n}} \bar{g}_{j}(B_{i})\right| \geq (x-2\epsilon) /3 \right) + \mathbb{P}_{A}\left( \frac{1}{n}\left|\sum\limits_{i=\tau_{A}(l_{n}-1)}^{n}\bar{g}_{j}(X_{i}) \right| \geq (x-2\epsilon)/3\right). \end{array} $$

We control separately each term on the right hand side of the above inequality. Bounds for the first and the last non-regenerative blocks can be easily obtained using Markov inequality:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{\tau_{A}} \bar{g}_{j}(X_{i})\right| \geq \frac{x-2\epsilon}{3}\right) \leq \frac{3^{p}\mathbb{E}_{\nu}\left[\left|{\sum}_{i=1}^{\tau_{A}} \bar{g}_{j}(X_{i})\right|^{p} \right]}{ n^{p} (x-2\epsilon)^{p}}\leq \frac{3^{p}\mathbb{E}_{\nu}\left[\left( {\sum}_{i=1}^{\tau_{A}} \tilde{H}(X_{i})\right)^{p} \right]}{ n^{p} (x-2\epsilon)^{p}}. \end{array} $$

We deal in a similar fashion with the last non-regenerative block:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \left|\sum\limits_{i=1+\tau_{A}(l_{n})}^{n} \bar{g}_{j}(X_{i}) \right|\geq \frac{x-2\epsilon}{3}\right) &\leq& \mathbb{P}_{\nu}\left( \sum\limits_{i=1+\tau_{A}(l_{n})}^{n} \left|\bar{g}_{j}\right|(X_{i}) \geq \frac{x-2\epsilon}{3}\right)\\ &\leq& \mathbb{P}_{\nu}\left( \sum\limits_{i=1+\tau_{A}(l_{n})}^{\tau_{A}(l_{n}+1)} \left|\bar{g}_{j}\right|(X_{i}) \geq \frac{x-2\epsilon}{3}\right)\\ &\leq& \frac{3^{p}\mathbb{E}_{A}\left[ \left( \tilde{H}(\mathcal{B}_{1})\right)^{p}\right]}{ n^{p} (x-2\epsilon)^{p}}. \end{array} $$

The control of the term in the middle is more challenging. Note that

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{A}\left( \frac{1}{n}\left|\sum\limits_{i=1}^{l_{n}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon) /3 \right) & \leq& \mathbb{P}_{A} \left( \frac{1}{n}\left|\sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon)/6 \right) \\ && +\mathbb{P}_{A}\left( \frac{1}{n}\left| \sum\limits_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon)/6 \right), \end{array} $$

where \(l_{n_{1}} = \min \nolimits (\lfloor {n/ \mathbb {E}_{A}[\tau _{A}]}\rfloor , l_{n})\) and \(l_{n_{2}}= \max \nolimits (\lfloor {n/\mathbb {E}_{A}[\tau _{A}]}\rfloor , l_{n})\).

Polynomial tail inequality for i.i.d. random variables We may apply Theorem 6.1 in order to obtain

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{A} \left( \frac{1}{n}\left|{\sum}_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon)/6 \right) \leq N_{1}\left( \epsilon, \mathcal{F}\right) \frac{C_{p} \mathbb{E}_{A}\left[ |\tilde{H}(\mathcal{B}_{1})|^{p}\right] }{n^{p/2}(6(x-2\epsilon)^{p}}. \end{array} $$
(13)

Truncation The control of \({\sum }_{l_{n_{1}}}^{l_{n_{2}}} \bar {g}_{j}({}_{i})\) is slightly more challenging due to the fact that ln is random and correlated itself with the blocks. Observe firstly that since we expect the number of terms in this sum to be at most of the order \(\sqrt {n},\) this term should be much smaller than the leading term (1) and be thus asymptotically negligible. We have

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{A}\left( \left| {\sum}_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \!\geq\! \frac{x-2\epsilon}{6} \right) &\!\leq\!& \mathbb{P}_{A}\left( \left| {\sum}_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \!\geq\! \frac{x - 2\epsilon}{6}, \sqrt{n}\left[\frac{l_{n}}{n} - \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right] \!\leq\! N \right) \\ && + \mathbb{P}_{\nu}\left( \sqrt{n}\left[\frac{l_{n}}{n} - \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right]>N \right) = I + II. \end{array} $$
(14)

First, we bound term I in (14) using Montgomery-Smith’s inequality and the fact that if

$$\sqrt{n}\left[ \frac{l_{n}}{n}- \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right] \leq N, \textit{ then } l_{n_{2}} - l_{n_{1}} \leq \sqrt{n}N.$$

Note that it is sufficient to consider the case where \(\lfloor {n/\mathbb {E}_{A}[\tau _{A}]}\rfloor < l_{n}\) only. In what follows we rely on the following observation:

$$ l_{n} = \sup\left\{s: \sum\limits_{i=1}^{s} l(\mathcal{B}_{i}) \leq n\right\}. $$

Thus,

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{A}\left( \left| \sum\limits_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, \sqrt{n}\left[\frac{l_{n}}{n} - \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right] \leq N \right) \\ \\&&\quad = \sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left| \sum\limits_{i=\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor+k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, l_{n} = \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k\right)\\ &&\quad= \sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left| {\sum}_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, \sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k}l(\mathcal{B}_{i}) \leq n < {\sum}_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k +1}l(\mathcal{B}_{i}) \right) \end{array} $$

and by exchangeability of the blocks we have

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left| \sum\limits_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, \sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k}l(\mathcal{B}_{i}) \leq n < \sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k +1}l(\mathcal{B}_{i}) \right)\\ &&\quad= \sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left|\sum\limits_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, l_{n}= \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor + k \right) \\&&\quad =\sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left|\sum\limits_{i=1}^{l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor = k \right) \\ &&\quad = \mathbb{P}_{A} \left( \left|\sum\limits_{i=1}^{l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right|\geq \frac{x-2\epsilon}{6}, l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor \leq N\sqrt{n}\right). \end{array} $$

Montgomery-Smith’s inequality Now, we use Montgomery-Smith’s inequality to get

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{A} \left( \left|\sum\limits_{i=1}^{l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right|\geq \frac{x-2\epsilon}{6}, l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor \leq N\sqrt{n}\right) \\ &&\quad= \mathbb{P}_{A}\left( \max_{1 \leq k \leq N\sqrt{n}} \left| \sum\limits_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6} \right) \\&&\quad \leq 9 \mathbb{P}_{A}\left( \left| \sum\limits_{i=1}^{N\sqrt{n}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{180}\right) \\ &&\quad \leq \frac{18 \times 90^{p} C_{p} \mathbb{E}_{A} \left[\tilde{H}(\mathcal{B}_{1})\right] \times N^{p}}{(x-2\epsilon)^{p} n^{3p/4}}. \end{array} $$

Finally, term II is directly controlled by means of Lemma 6.1. □

1.3 Proof of theorem 3.6

Before detailing the proof, we recall Massart’s Finite Class Lemma (see [22], Lemma 5.2, page 300) which is involved in our argument.

Lemma 6.2

Let\(\mathcal {A}\)be some finitesubset of\(\mathbb {R}^{n}\). Let N denotethe cardinality of\(\mathcal {A}\)andlet\(R= \sup _{a \in \mathcal {A}}\left [{\sum }_{i=1}^{n} {a_{i}^{2}} \right ]^{1/2},\)then

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[sup_{a \in \mathcal{A}} \sum\limits_{i=1}^{n} a_{i}\epsilon_{i}\right] \leq R \sqrt{2log N}. \end{array} $$
(15)

Montgomery-Smith’s inequality

In order to deal with the random character of the number of blocks ln − 1, apply Montgomery-Smith’s inequality:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \sup_{f\in\mathcal{F}}\left\vert \frac{1}{n}\sum\limits_{j=1}^{l_{n}-1}\bar{f}(\mathcal{B}_{j})\right\vert \geq \epsilon\right) &\leq&\mathbb{P}_{A}\left( \max_{k\leq n}\sup_{f\in\mathcal{F}}\left\vert \frac{1}{n}\sum\limits_{j=1}^{k}\bar{f}(\mathcal{B}_{j})\right\vert \geq \epsilon\right) \\ &\leq& 9 \mathbb{P}_{A}\left( \sup_{f\in\mathcal{F}}\left\vert \frac{1}{n} \sum\limits_{j=1}^{n}\bar{f}(\mathcal{B}_{j})\right\vert \geq\frac{t}{30}\right). \end{array} $$

Integrating over t > 0 then yields:

$$ \mathbb{E}_{A}\left[\sup_{f \in \mathcal{F}} \left| \frac{1}{n}\sum\limits_{j=1}^{l_{n}-1}\bar{f}(\mathcal{B}_{j})\right| \right] \leq 270 \mathbb{E}_{A}\left[\sup_{f \in \mathcal{F}} \left| \frac{1}{n}\sum\limits_{j=1}^{n}\bar{f}(\mathcal{B}_{j})\right| \right]. $$

Ghost sample of regeneration blocks and randomization

In the following, we consider \({}^{\prime }=({}_{1}^{\prime }, \ldots , {}_{n}^{\prime })\) an independent copy of \({}=({}_{1}, \ldots , {}_{n})\) (a ’ghost’ sample). Let (𝜖1,…,𝜖n) be independent Rademacher variables. Let

$$ \|l\|_{P_{\mathcal{B}}}= \frac{{\sum}_{i=1}^{n}l(\mathcal{B}_{i})}{n \mathbb{E}_{A}[\tau_{A}]}. $$

Note that, for any M > 0, we have

$$ \begin{array}{@{}rcl@{}} 270 \mathbb{E}_{A}\left[\sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})-\mu(f(\mathcal{B}_{1}))\right| \right] &\leq& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})\epsilon_{i}\right|\right] \\ & \leq& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})\epsilon_{i}\right|\mathbb{I}_{\|l\|_{P_{\mathcal{B}}}}\leq M\mathbb{E}_{A}[\tau_{A}]\right] \\& +& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})\epsilon_{i}\right|\mathbb{I}_{\|l\|_{P_{\mathcal{B}}}}> M\mathbb{E}_{A}[\tau_{A}]\right] \\& =& I + II. \end{array} $$

Uniform covering for \(\mathcal {F}\)

We consider an uniform 𝜖-covering g1,…,gW, where \(W= \mathcal {N}_{1}(x/M\mathbb {E}_{A}[\tau _{A}], \mathcal {F})\) and

$$ \min_{j} \vert|f-\mu(f)-g_{j}+ \mu(g_{j})|\vert_{L_{1}(Q)} \leq \epsilon \textit{ for all } f \in \mathcal{F} $$

and Q is any discrete probability measure. We also assume that g1,g2,…,gW belong to \(\mathcal {F}\) and satisfy Assumption 3.3. By f is meant the gj achieving the minimum. Then,

$$ \begin{array}{@{}rcl@{}} I &\!\leq\!& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left\{ \sup_{f \in \mathcal{F}}\left[\left|\frac{1}{n}{\sum}_{i=1}^{n} (f(\mathcal{B}_{i}) - \mu(f) - f^{*}(\mathcal{B}_{i})\right.\right.\right.\\ &&\left.\left.\left.+ \mu(f^{*}))\epsilon_{i} \right| + \left|\frac{1}{n}{\sum}_{i=1}^{n} (f^{*}(\mathcal{B}_{i}) - \mu(f^{*}))\epsilon_{i} \right|\right]\right\}\\ &\!\leq\!& 540 \left[\!\epsilon + \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[\!\mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right) \max_{1\leq j\leq \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)} \left|\! \frac{1}{n}{\sum}_{i=1}^{n}g_{j}(\mathcal{B}_{i})\epsilon_{i}\right|\!\mathbb{I}_{\|l\|_{P_{\mathcal{B}}}}\right.\right.\\ &\!\leq\!&\left.\left.\vphantom{\left| \frac{1}{n}{\sum}_{i=1}^{n}g_{j}(\mathcal{B}_{i})\epsilon_{i}\right|} M\mathbb{E}_{A}[\tau_{A}]\right]\right]. \end{array} $$
(16)

Massart’s finite class lemma

In what follows we will use Massart’s Finite Class Lemma (Lemma 6.2). We bound (4) by applying directly (3):

$$ \begin{array}{@{}rcl@{}} (16) &\leq &\mathbb{E}_{A} \left[\epsilon + \max_{1\leq j\leq \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)}\left( \frac{1}{n} {\sum}_{i=1}^{n}(g_{j}(\mathcal{B}_{i})^{2}) \right)^{1/2} \times \sqrt{\frac{2log \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)}{n} } \right] \\ &\leq &540 \left[ \epsilon + \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)\times \mathbb{E}_{A}[F(\mathcal{B}_{1})^{2}]^{1/2} \sqrt{\frac{2log \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)}{n} }\right]. \end{array} $$

We now derive an upper bound for II.

$$ \begin{array}{@{}rcl@{}} II &\leq &540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \left( \frac{1}{n}{\sum}_{i=1}^{n}H(\mathcal{B}_{i})\right)^{2}\right]^{1/2} \left[\mathbb{P}\left( \|l\|_{P_{\mathcal{B}}}> M\mathbb{E}_{A}[\tau_{A}]\right)\right]^{1/2} \\&=& 540 \mathbb{E}_{A}[H(\mathcal{B}_{1})^{2}]^{1/2} \times \left[\mathbb{P}\left( \|l\|_{P_{\mathcal{B}}}> M\mathbb{E}_{A}[\tau_{A}]\right)\right]^{1/2}. \end{array} $$

Since we have

$$ \|l\|_{P_{\mathcal{B}}}= \frac{{\sum}_{i=1}^{n}l(\mathcal{B}_{i})}{n \mathbb{E}_{A}[\tau_{A}]}, $$

one may write

$$ \begin{array}{@{}rcl@{}} \left( \mathbb{P}\left[ \|l\|_{P_{\mathcal{B}}} -1 \geq M-1\right]\right)^{1/2}&=& \left[\mathbb{P}\left( \frac{1}{n} \frac{{\sum}_{i=1}^{n}l(\mathcal{B}_{i})-\mathbb{E}_{A}[\tau_{A}] }{\mathbb{E}_{A}[\tau_{A}]}-1 \geq M-1\right) \right]^{1/2}\\ & \leq& \frac{[Var[l(\mathcal{B}_{1})]]^{1/2}}{n^{1/2}\mathbb{E}_{A}[{\tau_{A}^{2}}]^{1/2}\sqrt{M-1}} \end{array} $$

by virtue of Markov inequality combined with the fact that \(\mathbb {E}_{A}[l({}_{1})]^{2}<\infty \).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Clémençon, S., Bertail, P. & Ciołek, G. Statistical learning based on Markovian data maximal deviation inequalities and learning rates. Ann Math Artif Intell 88, 735–757 (2020). https://doi.org/10.1007/s10472-019-09670-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-019-09670-6

Keywords

Mathematics Subject Classification (2010)

Navigation