Control variate selection for Monte Carlo integration

Leluc, Rémi; Portier, François; Segers, Johan

doi:10.1007/s11222-021-10011-z

Control variate selection for Monte Carlo integration

Published: 25 June 2021

Volume 31, article number 50, (2021)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

479 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Monte Carlo integration with variance reduction by means of control variates can be implemented by the ordinary least squares estimator for the intercept in a multiple linear regression model with the integrand as response and the control variates as covariates. Even without special knowledge on the integrand, significant efficiency gains can be obtained if the control variate space is sufficiently large. Incorporating a large number of control variates in the ordinary least squares procedure may however result in (i) a certain instability of the ordinary least squares estimator and (ii) a possibly prohibitive computation time. Regularizing the ordinary least squares estimator by preselecting appropriate control variates via the Lasso turns out to increase the accuracy without additional computational cost. The findings in the numerical experiment are confirmed by concentration inequalities for the integration error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Principles of confounder selection

Article Open access 06 March 2019

Notes

https://github.com/RemiLELUC/ControlVariateSelection.git
The convex parabola $x \mapsto a x^2 - b x - c$ has zeroes at $x_{\pm } = (b \pm \sqrt{b^2 + 4ac}) / (2a)$, and $x_-< 0< x_+ < \sqrt{b^2 + 4ac}/a$.

References

Asuncion, A., Newman, D.: UCI machine learning repository. (2007) https://archive.ics.uci.edu/ml/index.php
Avramidis, A.N., Wilson, J.R.: A splitting scheme for control variates. Op. Res. Lett. 14(4), 187–198 (1993)
Article MathSciNet Google Scholar
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)
Article MathSciNet Google Scholar
Belomestny, D., Iosipoi, L., Moulines, E., Naumov, A., Samsonov, S.: Variance reduction for Markov chains with application to MCMC. Stat. Comput. 30, 973–997 (2020)
Article MathSciNet Google Scholar
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)
Article MathSciNet Google Scholar
Boucheron, S., Lugosi, G., Massart, P.: Conc. Inequal. Oxford University Press, Oxford (2013)
Google Scholar
Brooks, S.P., Catchpole, E.A., Morgan, B.J.T.: Bayesian animal survival estimation. Stat. Sci. 15(4), 357–376 (2000)
Article MathSciNet Google Scholar
Brosse, N., Durmus, A., Meyn, S., Moulines, É., Radhakrishnan, A.: Diffusion approximations and control variates for MCMC. (2018) arXiv preprint arXiv:1808.01665
Caflisch, R.E.: Monte Carlo and quasi-Monte Carlo methods. Acta Numer. 7, 1–49 (1998)
Article MathSciNet Google Scholar
Davis, R.A., do Rêgo Sousa, T., Klüppelberg, C.: Indirect inference for time series using the empirical characteristic function and control variates. J. Time Ser. Anal. (2019)
Glasserman, P.: Monte Carlo Methods in Financial Engineering, vol. 53. Springer, Berlin (2013)
MATH Google Scholar
Glynn, P.W., Szechtman, R.: Some new perspectives on the method of control variates. In: Monte Carlo and Quasi-Monte Carlo Methods 2000, Springer, pp 27–49 (2002)
Gobet, E., Labart, C.: Solving BSDE with adaptive control variate. SIAM J. Numer. Anal. 48(1), 257–277 (2010)
Article MathSciNet Google Scholar
Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1(1), 75–89 (1988)
Article Google Scholar
Gower, R., Le Roux, N., Bach, F.: Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 707–715 (2018)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. In: Conference on learning theory, JMLR Workshop and Conference Proceedings, pp 9–1 (2012)
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. Found. Comput. Math. 14(3), 569–600 (2014)
Article MathSciNet Google Scholar
Javanmard, A., Montanari, A.: Debiasing the lasso: optimal sample size for Gaussian designs. Ann. Stat. 46(6A), 2593–2622 (2018)
Article MathSciNet Google Scholar
Jie, T., Abbeel, P.: On a connection between importance sampling and the likelihood ratio policy gradient. In: Advances in Neural Information Processing Systems, pp 1000–1008 (2010)
Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.: A short note on concentration inequalities for random vectors with subGaussian norm. arXiv preprint arXiv:1902.03736 (2019)
Lebreton, J.D., Burnham, K.P., Clobert, J., Anderson, D.R.: Modeling survival and testing biological hypotheses using marked animals: a unified approach with case studies. Ecol Monogr 62(1), 67–118 (1992)
Article Google Scholar
Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., Liu, Q.: Action-dependent control variates for policy optimization via stein identity. In: ICLR 2018 Conference (2018)
Marzolin, G.: Polygynie du Cincle plongeur (Cinclus cinclus) dans les côtes de Lorraine. Oiseau et la Revue Francaise d’Ornithologie 58(4), 277–286 (1988)
Google Scholar
Newey, W.K.: Convergence rates and asymptotic normality for series estimators. J. Econom. 79(1), 147–168 (1997)
Article MathSciNet Google Scholar
Nott, D.J., Drovandi, C.C., Mengersen, K., Evans, M., et al.: Approximation of Bayesian predictive $p$-values with regression ABC. Bayesian Anal. 13(1), 59–83 (2018)
Article MathSciNet Google Scholar
Oates, C.J., Girolami, M., Chopin, N.: Control functionals for Monte Carlo integration. J. R. Stat. Soc. Ser. B 79(3), 695–718 (2017). (Statistical Methodology)(Statistical Methodology)(Statistical Methodology)(Statistical Methodology)
Article MathSciNet Google Scholar
Owen, A., Zhou, Y.: Safe and effective importance sampling. J. Am. Stat. As. 95(449), 135–143 (2000)
Article MathSciNet Google Scholar
Owen, A.B.: Monte Carlo Theory, Methods and Examples (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Portier, F., Delyon, B.: Asymptotic optimality of adaptive importance sampling. Adv. Neural Inf. Process. Syst. 31, 3134–3144 (2018)
Google Scholar
Portier, F., Segers, J.: Monte Carlo integration with a growing number of control variates. J. Appl. Probab. 56, 1168–1186 (2019)
Article MathSciNet Google Scholar
Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. In: Artificial Intelligence and Statistics, pp 814–822 (2014)
Rudin, W.: Real and Complex Analysis. Tata McGraw-Hill Education, New York (2006)
MATH Google Scholar
South, L.F., Oates, C.J., Mira, A., Drovandi, C.: Regularised zero-variance control variates for high-dimensional variance reduction. arXiv preprint arXiv:1811.05073 (2018)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser.B 58(1), 267–288 (1996). Methodological
MathSciNet MATH Google Scholar
Tibshirani, R., Wainwright, M., Hastie, T.: Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, Boca Raton (2015)
MATH Google Scholar
Tropp, J.A.: An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(1-2):1–230, (2015) arXiv:1501.01571
van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electr. J. Stat. 3, 1360–1392 (2009)
MathSciNet MATH Google Scholar
Wang, C., Chen, X., Smola, A.J., Xing, E.P.: Variance reduction for stochastic gradient optimization. In: Advances in Neural Information Processing Systems, pp 181–189 (2013)
Zhang, A., Brown, L.D., Cai, T.T.: Semi-supervised inference: general theory and estimation of means. Ann. Stat. 47(5), 2538–2566 (2019)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors are grateful to the Associate Editor and two anonymous Reviewers for their many valuable comments and interesting suggestions. Further, the authors gratefully acknowledge the exceptionally careful proofreading done by Aigerim Zhuman (UCLouvain).

Author information

Authors and Affiliations

Télécom Paris, Institut Polytechnique de Paris, 19 Place Marguerite Perey, F-91120, Palaiseau, France
Rémi Leluc & François Portier
UCLouvain, LIDAM/ISBA, Voie du Roman Pays 20/L1.04.01, 1348, Louvain-la-Neuve, Belgium
Johan Segers

Authors

Rémi Leluc
View author publications
You can also search for this author in PubMed Google Scholar
François Portier
View author publications
You can also search for this author in PubMed Google Scholar
Johan Segers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rémi Leluc.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Auxiliary results

Lemma 1

(Sub-Gaussian) Let $X_1,\ldots ,X_n $ be independent and identically distributed random variables in $({\mathscr {X}}, {\mathscr {A}})$ with distribution P. Let $\varphi _1,\ldots ,\varphi _p $ be real-valued functions on ${\mathscr {X}}$ such that $P(\varphi _k) = 0$ and $\varphi _k \in {\mathscr {G}}(\tau ^2)$ for all $k=1,\ldots ,p$. Then for all $\delta > 0$, we have with probability at least $1-\delta $,

$$\begin{aligned} \max _{k=1, \ldots ,p} \left|\sum _{i=1}^n \varphi _k(X_i) \right|\leqslant \sqrt{2n\tau ^2\log (2p/\delta )}. \end{aligned}$$

Proof

For each $k=1,\ldots ,p$, the centered random variable $\sum _{i=1}^n \varphi _k(X_i)$ is sub-Gaussian with variance factor $n \tau ^2$. By the union bound and by Chernoff’s inequality, we have, for each $t > 0$,

$$\begin{aligned} {{\mathbb {P}}}\left( \max _{k=1, \ldots ,p} \left| \sum _{i=1}^n \varphi _k(X_i) \right|> t\right)&\leqslant \sum _{k=1}^p {{\mathbb {P}}}\left( \left| \sum _{i=1}^n \varphi _k(X_i) \right| > t\right) \\&\leqslant 2p \exp \left( \frac{-t^2}{2n\tau ^2}\right) . \end{aligned}$$

Set $t = \sqrt{2n\tau ^2\log (2p/\delta )}$ to find the result. $\square $

Lemma 2

(Smallest eigenvalue lower bound) Let $X_1,\ldots ,X_n$ be independent and identically distributed random variables in $({\mathscr {X}}, {\mathscr {A}})$ with distribution P. Let $g = (g_1, \ldots , g_p)^T$ in $L_2(P)^p$ be such that the $p \times p$ Gram matrix $G = P(g g^T)$ satisfies $\lambda _{\min }(G) > 0$. Define the transformation ${{\tilde{g}}} = G^{-1/2} g$ and put $B_{{{\tilde{g}}}} := \sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle {{\tilde{g}}}(x)}\right\Vert _2^2$. Let $\delta , \eta \in (0, 1)$. For $\delta \in (0, 1)$, the empirical Gram matrix ${\hat{G}}_n = P_n(g g^T)$ satisfies, with probability at least $1-\delta $,

$$\begin{aligned} \lambda _{\min }({{\hat{G}}}_n) > \left( 1 - \sqrt{2 B_{{\tilde{g}}} n^{-1} \log (p/\delta )}\right) \lambda _{\min }(G). \end{aligned}$$

Proof

Suppose that the result is true in the special case that G is the identity matrix. In case of a general Gram matrix G, we could then apply the result for the special case to the vector of functions ${{\tilde{g}}} = G^{-1/2} g$, whose Gram matrix is the identity matrix. We would get that $\lambda _{\min }( P_n({{\tilde{g}}} {{\tilde{g}}}^T)) > 1-\eta $ with probability at least $1-\delta $. Since $P_n({{\tilde{g}}} {{\tilde{g}}}^T) = G^{-1/2} {\hat{G}}_n G^{-1/2}$ and since $u^T G^{-1} u \leqslant 1 / \lambda _{\min }(G)$ for every unit vector $u \in {\mathbb {R}}^p$, we would have

$$\begin{aligned} \lambda _{\min } \left( P_n ({{\tilde{g}}} {{\tilde{g}}}^T)\right)&= \min _{u^Tu = 1} \left\{ u^T P_n({{\tilde{g}}} {{\tilde{g}}}^T) u \right\} \\&= \min _{u^Tu = 1} \left\{ \frac{(G^{-1/2}u)^T {\hat{G}}_n G^{-1/2}u}{(G^{-1/2}u)^T G^{-1/2} u} u^T G^{-1} u \right\} \\&\leqslant \lambda _{\min }({\hat{G}}_n) / \lambda _{\min }(G). \end{aligned}$$

It would then follow that

$$\begin{aligned} \lambda _{\min }({\hat{G}}_n) \geqslant \lambda _{\min }(P_n ({{\tilde{g}}} \tilde{g}^T)) \, \lambda _{\min }(G) \geqslant (1-\eta ) \lambda _{\min }(G), \end{aligned}$$

as required. Hence we only need to show the result for $G = I$, in which case ${{\tilde{g}}} = g$.

We apply the matrix Chernoff inequality in (Tropp 2015, Theorem 5.1.1) to the random matrices $n^{-1} g(X_i)g(X_i)^T$. These matrices are independent and symmetric with dimension $p \times p$. Their minimum and maximum eigenvalues are between 0 and $L = B_g/n$, with $B_g = \sup _{x \in {\mathscr {X}}} \lambda _{\max } (g(x) g(x)^T) = \sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle g(x)}\right\Vert ^2_2$. Their sum is equal to $P_n(g g^T) = {\hat{G}}_n$, whose expectation is $G = I$ by assumption. In the notation of the cited theorem, we have $\mu _{\min } = \lambda _{\min }(G) = 1$, and thus, by Eq. (5.1.5) in that theorem, we have, for $\eta \in [0, 1)$,

$$\begin{aligned} {{\mathbb {P}}}\{\lambda _{\min }({\hat{G}}_n) \leqslant 1 - \eta \} \leqslant p \left[ \frac{\exp (-\eta )}{(1-\eta )^{1-\eta }}\right] ^{n/B_g}. \end{aligned}$$

The term in square brackets is bounded above by $\exp (-\eta ^2/2)$. Indeed, we have, for $\eta \in [0, 1)$,

$$\begin{aligned} \frac{e^{-\eta }}{(1-\eta )^{1-\eta }} = \exp \{-\eta - (1-\eta ) \log (1-\eta ) \} \end{aligned}$$

and

$$\begin{aligned} \eta + (1-\eta ) \log (1-\eta )&= \eta - (1-\eta ) \int _0^\eta \frac{\mathrm {d}t}{1-t} \\&= \int _0^{\eta } \left( 1 - \frac{1-\eta }{1-t}\right) \, \mathrm {d}t \\&= \int _0^{\eta } \frac{\eta - t}{1-t} \, \mathrm {d}t \\&\geqslant \int _0^\eta (\eta - t) \, \mathrm {d}t = \frac{\eta ^2}{2}. \end{aligned}$$

It follows that

$$\begin{aligned} {{\mathbb {P}}}\{\lambda _{\min }({\hat{G}}_n) \leqslant 1 - \eta \} \leqslant p \exp \left( - \frac{\eta ^2 n}{2 B_g} \right) . \end{aligned}$$

Solving $p \exp \left( - \frac{\eta ^2 n}{2 B_g} \right) = \delta $ in $\eta $, we find that, with probability at least $1-\delta $,

$$\begin{aligned} \lambda _{\min }({\hat{G}}_n) > 1 - \sqrt{2B_g n^{-1} \log (p/\delta )}. \end{aligned}$$

$\square $

Lemma 3

(Upper bound of moments) Let X be a random variable such that ${{\mathbb {E}}}(|X|^{2p}) \leqslant 2^{p+1} p!$ for every integer $p \geqslant 1$. Then

$$\begin{aligned} \forall \lambda \in {\mathbb {R}}, \qquad 1 + \sum _{k=2}^\infty \frac{\lambda ^k}{k!} {{\mathbb {E}}}(|X|^k) \leqslant \exp (9 \lambda ^2 / 4), \end{aligned}$$

(17)

in which it is implicitly understood that the series on the left-hand side converges.

Proof

Let $\lambda \in {\mathbb {R}}$. We split the series in terms with even and odd indices k, leading to

$$\begin{aligned}&1 + \sum _{k=2}^\infty \frac{\lambda ^k}{k!} {{\mathbb {E}}}(|X|^k) \\&\quad = 1 + \sum _{p=1}^\infty \frac{\lambda ^{2p}}{(2p)!} {{\mathbb {E}}}(|X|^{2p}) + \sum _{p=1}^\infty \frac{\lambda ^{2p+1}}{(2p+1)!} {{\mathbb {E}}}(|X|^{2p+1}). \end{aligned}$$

We will bound the series on the odd indices in terms of the series on the even indices.

Since the geometric mean of two nonnegative numbers is bounded by their arithmetic mean, we have, for all $x \geqslant 0$ and all $a > 0$,

$$\begin{aligned} |x| = \sqrt{ \frac{1}{a} \cdot a x^2 } \leqslant \frac{1}{2} \left( \frac{1}{a} + a x^2 \right) . \end{aligned}$$

Applying the previous inequality to $x = \lambda X$ and scalars $a_p > 0$ to be chosen later,

$$\begin{aligned}&\sum _{p=1}^\infty \frac{\lambda ^{2p+1}}{(2p+1)!} {{\mathbb {E}}}(|X|^{2p+1}) \\&\quad \leqslant \sum _{p=1}^\infty \frac{\lambda ^{2p}}{(2p+1)!} {{\mathbb {E}}}\left[ |X|^{2p} \frac{1}{2} \left( \frac{1}{a_p} + a_p (\lambda X)^2\right) \right] \\&\quad = \sum _{p=1}^\infty \frac{\lambda ^{2p}}{2a_p} \frac{{{\mathbb {E}}}( |X|^{2p})}{(2p+1)!} + \sum _{p=1}^\infty \frac{a_p}{2} \frac{\lambda ^{2p+2}}{(2p+1)!} {{\mathbb {E}}}( |X|^{2p+2}) \\&\quad = \sum _{p=1}^\infty \frac{\lambda ^{2p}}{2a_p} \frac{{{\mathbb {E}}}( |X|^{2p})}{(2p+1)!} + \sum _{p=2}^\infty \frac{a_{p-1}}{2} \frac{\lambda ^{2p}}{(2p-1)!} {{\mathbb {E}}}( |X|^{2p}) \\&\quad = \sum _{p=1}^\infty \left( \frac{1}{2a_p(2p+1)} + p a_{p-1} \mathbb {1}_{\{p \geqslant 2\}} \right) \frac{\lambda ^{2p}}{(2p)!} {{\mathbb {E}}}(|X|^{2p}). \end{aligned}$$

Here, $\mathbb {1}$ denotes an indicator function. We obtain

$$\begin{aligned}&\sum _{k=2}^\infty \frac{\lambda ^k}{k!} {{\mathbb {E}}}(|X|^k) \\&\quad \leqslant \sum _{p=1}^\infty \left( 1 + \frac{1}{2a_p(2p+1)} + p a_{p-1} \mathbb {1}_{\{p \geqslant 2\}} \right) \frac{\lambda ^{2p}}{(2p)!} {{\mathbb {E}}}(|X|^{2p}). \end{aligned}$$

Define $b_p = a_p (2p+1)$ and use the hypothesis on ${{\mathbb {E}}}(|X|^{2p})$ to see that, for any constants $b_p > 0$,

$$\begin{aligned}&1 + \sum _{k=2}^\infty \frac{\lambda ^k}{k!} {{\mathbb {E}}}(|X|^k) \\&\quad \leqslant 1 + \sum _{p=1}^\infty \left( 1 + \frac{1}{2b_p} + \frac{p}{2p-1} b_{p-1} \mathbb {1}_{\{p \geqslant 2\}} \right) \frac{\lambda ^{2p}}{(2p)!} 2^{p+1} p!. \end{aligned}$$

The objective is to find a constant $c > 0$ as small as possible and such that the right-hand side is bounded by $\exp (c \lambda ^2) = 1 + \sum _{p=1}^\infty c^p \lambda ^{2p} / p!$. Comparing coefficients, this means that we need to determine scalars $b_p > 0$ and $c > 0$ in such a way that for all $p = 1, 2,\ldots $

$$\begin{aligned} \left( 1 + \frac{1}{2b_p} + \frac{p}{2p-1} b_{p-1} \mathbb {1}_{\{p \geqslant 2\}} \right) \frac{2^{p+1} p!}{(2p)!} \leqslant \frac{c^p}{p!}, \end{aligned}$$

or, equivalently,

$$\begin{aligned} \left( 2 + \frac{1}{b_p} + \frac{2p}{2p-1} b_{p-1} \mathbb {1}_{\{p \geqslant 2\}} \right) \prod _{j=1}^p \frac{j}{p+j} \leqslant (c/2)^p. \end{aligned}$$

The case $p = 1$ gives

$$\begin{aligned} 2 + \frac{1}{b_1} \leqslant c, \end{aligned}$$

(18)

showing that, with this proof technique, we will always find $c > 2$. Setting $b_p \equiv b > 0$ for all integer $p \geqslant 1$ and $c = 2 + 1/b$, inequality (18) is automatically satisfied, so it remains to find $b > 0$ such that forall $p = 2, 3, \ldots $

$$\begin{aligned} \left( 2 + \frac{1}{b} + \frac{2p}{2p-1} b\right) \prod _{j=1}^p \frac{j}{p+j} \leqslant (c/2)^p \quad \text {with } c = 2 + \frac{1}{b}. \end{aligned}$$

The left-hand side is decreasing in p whereas the right-hand side is increasing in p. It is thus sufficient to have the inequality satisfied for $p = 2$, i.e.,

$$\begin{aligned} \left( 2 + \frac{1}{b} + \frac{4b}{3} \right) \frac{1}{6} \leqslant \left( 1 + \frac{1}{2b}\right) ^2. \end{aligned}$$

(19)

Equating both sides leads to a nonlinear equation in b that can be solved numerically, giving the root $b \approx 4.006156$. With $b = 4$, inequality (19) is satisfied, as can be checked directly ($91/72 \leqslant 81/64$). We conclude that $c = 2 + 1/4 = 9/4$ is a valid choice. $\square $

Note that the series in (17) starts at $k = 2$. If also ${{\mathbb {E}}}(X) = 0$, the left-hand side in (17) is an upper bound for ${{\mathbb {E}}}(\exp (\lambda X))$, and we obtain the following corollary.

Corollary 1

Let Z be a centered random variable such that

$$\begin{aligned} \forall p \in {\mathbb {N}}^\star , \quad {{\mathbb {E}}}(|Z|^{2p}) \leqslant 2^{p+1} p! \end{aligned}$$

Then $\log {{\mathbb {E}}}(\exp (\lambda Z)) \leqslant 9 \lambda ^2/4$ for all $\lambda \in {\mathbb {R}}$, i.e., $Z \in {\mathscr {G}}(9/2)$.

Lemma 4

Let (X, Y) be a pair of uncorrelated random variables. If $X \in {\mathscr {G}}(\nu )$ and $|Y| \leqslant \kappa $ for some $\nu > 0$ and $\kappa > 0$, then $XY \in {\mathscr {G}}((9/2)\kappa ^2 \nu )$.

Proof

The random variable $X/\sqrt{\nu }$ is sub-Gaussian with variance factor 1. As on page 25 in Boucheron et al. (2013), this implies that ${{\mathbb {P}}}(|X/\sqrt{\nu }| > t) \leqslant 2 \exp (-t^2/2)$ for all $t \geqslant 0$ and thus ${{\mathbb {E}}}[|X/\sqrt{\nu }|^{2p}] \leqslant 2^{p+1} p!$ for all integer $p \geqslant 1$ (see (Boucheron et al. 2013, Theorem 2.1)).

Let $Z = XY / (\sqrt{\nu } \kappa )$. Since X is centered and X and Y are uncorrelated, XY is centred too, and therefore also Z. From the previous paragraph, we have ${{\mathbb {E}}}(|Z|^{2p}) \leqslant {{\mathbb {E}}}(|X/\sqrt{\nu }|^{2p}) \leqslant 2^{p+1} p!$ for all integer $p \geqslant 1$. Corollary 1 gives for all $\lambda \in {\mathbb {R}}$ that $\log {{\mathbb {E}}}(\exp (\lambda Z)) \leqslant 9\lambda ^2/4$, from which

$$\begin{aligned} \log {{\mathbb {E}}}(\exp (\lambda XY)) = \log {{\mathbb {E}}}(\exp (\lambda \sqrt{\nu } \kappa Z)) \leqslant \frac{9}{4} \lambda ^2 \nu \kappa ^2. \end{aligned}$$

$\square $

Lemma 5

(Upper bound for norm-subGaussian random vector) Let X be a d-dimensional random vector with zero-mean and such that ${{\mathbb {P}}}(\left\Vert {\textstyle X}\right\Vert _2 \geqslant t) \leqslant 2 \exp \left( -t^2/(2\sigma ^2) \right) $ for all $t \geqslant 0$. Then the random matrix Y defined by

$$\begin{aligned} Y =\begin{bmatrix} 0 &{} X^T \\ X &{} 0 \end{bmatrix} \in {\mathbb {R}}^{(d+1) \times (d+1)} \end{aligned}$$

(20)

satisfies ${{\mathbb {E}}}(\exp (\theta Y)) \preceq \exp (c\theta ^2\sigma ^2) I$ for any $\theta \in {\mathbb {R}}$, with $c = 9/4$, where I denotes the identity matrix.

Proof

The non-zero eigenvalues of Y are $\left\Vert {\textstyle X}\right\Vert $ and $-\left\Vert {\textstyle X}\right\Vert $. The non-zero eigenvalues of $Y^k$ are thus $\left\Vert {\textstyle X}\right\Vert ^k$ and $(-\left\Vert {\textstyle X}\right\Vert )^k$ for integer $k \geqslant 1$. It follows that $Y^k \preceq \left\Vert {\textstyle X}\right\Vert ^k I$ for all integer $k \geqslant 1$, and therefore also ${{\mathbb {E}}}(Y^k) \preceq {{\mathbb {E}}}(\left\Vert {\textstyle X}\right\Vert ^k) I$ for all integer $k \geqslant 1$. Furthermore, the operator norm of $Y^k$ is bounded by $\left\Vert {\textstyle Y^k}\right\Vert \leqslant \left\Vert {\textstyle X}\right\Vert ^k$.

Since ${{\mathbb {E}}}(Y) = 0$, we get, for any $\theta \in {\mathbb {R}}$,

$$\begin{aligned}&{{\mathbb {E}}}(\exp (\theta Y)) = I + \sum _{k=2}^\infty \frac{\theta ^k}{k!} {{\mathbb {E}}}(Y^k) \\&\quad \preceq \left( 1 + \sum _{k=2}^\infty \frac{\theta ^k}{k!} {{\mathbb {E}}}(\left\Vert {\textstyle X}\right\Vert ^k) \right) I = \left( 1 + \sum _{k=2}^\infty \frac{(\theta \sigma )^k}{k!} {{\mathbb {E}}}(\xi ^k) \right) I, \end{aligned}$$

where $\xi = \left\Vert {\textstyle X}\right\Vert /\sigma $. The first series converges in operator norm since $\left\Vert {\textstyle {{\mathbb {E}}}(Y^k) }\right\Vert \leqslant {{\mathbb {E}}}(\left\Vert {\textstyle Y^k}\right\Vert ) \leqslant {{\mathbb {E}}}(\left\Vert {\textstyle X}\right\Vert ^k)$.

By assumption, ${{\mathbb {P}}}(\xi > t) = {{\mathbb {P}}}(\left\Vert {\textstyle X}\right\Vert \geqslant \sigma t) \leqslant 2e^{-t^2/2}$ for all $t \geqslant 0$ and thus ${{\mathbb {E}}}(|\xi |^{2p}) \leqslant 2^{p+1} p!$ for all integer $p \geqslant 1$ But then we can apply Lemma 3 with $\lambda = \theta \sigma $ and $X = \xi $, completing the proof. $\square $

The following result is a special case of Jin et al. (2019, Corollary 7). Our contribution is to make the constant c in the cited result explicit. In passing, we correct an inaccuracy in the proof of Jin et al. (2019, Lemma 4), in which it was incorrectly claimed that the odd moments of a certain random matrix Y as in our Lemma 5 are all zero.

Lemma 6

(Hoeffding inequality for norm-subGaussian random vectors) Let the d-dimensional random vectors $Z_1,\ldots ,Z_n$ be independent, have mean zero, and satisfy

$$\begin{aligned} \forall t \geqslant 0, \forall i = 1,\ldots ,n, \quad {{\mathbb {P}}}(\left\Vert {\textstyle Z_i}\right\Vert _2 \geqslant t) \leqslant 2 \exp \left( - \tfrac{t^2}{2\sigma ^2} \right) \end{aligned}$$

(21)

for some $\sigma > 0$. Then for any $\delta > 0$, with probability at least $1-\delta $, we have

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n Z_i}\right\Vert _2 \leqslant 3 \sqrt{n \sigma ^2 \log (2d/\delta )}. \end{aligned}$$

Proof

Given Corollary 7 in Jin et al. (2019), the only thing to prove is that their constant can be set equal to 3. Their Corollary 7 follows from their Lemma 6 in which it is shown that when the matrix Y defined in (20) satisfies

$$\begin{aligned} \forall \theta \in {\mathbb {R}}, \qquad {{\mathbb {E}}}[\exp (\theta Y)] \preceq \exp (c \theta ^2 \sigma ^2) I, \end{aligned}$$

then we have for any $\theta >0$, with probability at least $(1-\delta )$,

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n Z_i}\right\Vert _2 \leqslant c \cdot \theta n \sigma ^2 + \frac{1}{\theta }\log (2d/\delta ). \end{aligned}$$

Taking $\theta = \sqrt{\log (2d/\delta )/(cn\sigma ^2)}$ yields

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n Z_i}\right\Vert _2 \leqslant 2\sqrt{c} \sqrt{n \sigma ^2 \log (2d/\delta )}, \end{aligned}$$

and we conclude with Lemma 5 ($c=9/4,2\sqrt{c}=3$). $\square $

Proof of Theorem 1

The proof is organized as follows. We first provide an upper bound on the error (Step 1). This bound involves the norm of the error made on the rescaled coefficients and is controlled in Step 2. Then (Step 3), we construct an event that has probability at least $1-\delta $ on which we can control the terms that appear in the upper bound of Step 2. Collecting all the inequalities, we will arrive at the stated bound (Step 4).

Step 1. — Since $f = P(f) + \beta ^\star (f)^T h + \epsilon $, the oracle estimate of P(f), which uses the unknown, optimal coefficient vector $\beta ^\star (f)$, is

$$\begin{aligned} {\hat{\alpha }}_n^{\mathrm {or}}(f) = P_n[f - \beta ^\star (f)^T h] = P(f) + P_n(\epsilon ). \end{aligned}$$

The difference between the OLS and oracle estimates is

$$\begin{aligned} {\hat{\alpha }}_n^{\mathrm {ols}}(f) - {\hat{\alpha }}_n^{\mathrm {or}}(f) = \bigl (\beta ^\star (f) - {\hat{\beta }}_{n}^{\mathrm {ols}}(f)\bigr )^T \, P_n(h). \end{aligned}$$

Let $G = P(h h^\top )$ be the $m \times m$ Gram matrix. By assumption, G is positive definite. Write

$$\begin{aligned} \eta ^\star&= G^{1/2} \beta ^\star (f),&{\hat{\eta }}&= G^{1/2} {\hat{\beta }}_{n}^{\mathrm {ols}}(f),&\hbar&= G^{-1/2} h(f). \end{aligned}$$

The estimation error of the OLS estimator can thus be decomposed as

$$\begin{aligned}&n \bigl ( {\hat{\alpha }}_n^{\mathrm {ols}}(f) - P(f) \bigr ) \\&\quad = n \bigl ({\hat{\alpha }}_n^{\mathrm {or}}(f) - P(f) \bigr ) + \bigl (\beta ^\star (f) - {\hat{\beta }}_{n}^{\mathrm {ols}}(f)\bigr )^T \, n \, P_n(h) \\&\quad = \sum _{i=1}^n \epsilon (X_i) + \bigl (\beta ^*(f) - {\hat{\beta }}_{n}^{\mathrm {ols}}(f)\bigr )^T \sum _{i=1}^n h(X_i) \\&\quad = \sum _{i=1}^n \epsilon (X_i) + (\eta ^\star - {\hat{\eta }})^\top \sum _{i=1}^n \hbar (X_i). \end{aligned}$$

By the triangle and Cauchy–Schwarz inequalities,

$$\begin{aligned}&n \left|{\textstyle {{\hat{\alpha }}}_n^{\mathrm {ols}} (f) - P(f) }\right| \leqslant \left|{\textstyle \sum _{i=1}^n \epsilon (X_i) }\right|\nonumber \\&\quad + \left\Vert {\textstyle \eta ^\star - {\hat{\eta }}}\right\Vert _2 \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i) }\right\Vert _2. \end{aligned}$$

(22)

Step 2. — We will show that, if $\lambda _{\min }(P_n(\hbar \hbar ^\top )) > \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2$, then

$$\begin{aligned} \left\Vert {\textstyle {\hat{\eta }} - \eta ^\star }\right\Vert _2 \leqslant \frac{\left\Vert {\textstyle P_n(\hbar \epsilon )}\right\Vert _2 + \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2 \left|{\textstyle P_n(\epsilon )}\right|}{\lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2}. \end{aligned}$$

(23)

and thus, by (22),

$$\begin{aligned}&\left|{\textstyle {{\hat{\alpha }}}_n^{\mathrm {ols}} (f) - P(f) }\right| \nonumber \\&\quad \leqslant \left|{\textstyle P_n(\epsilon )}\right| + \frac{\left\Vert {\textstyle P_n(\hbar \epsilon )}\right\Vert _2 + \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2 \left|{\textstyle P_n(\epsilon )}\right|}{\lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2} \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2\nonumber \\ \end{aligned}$$

(24)

Step 2.1 — Considered the column-centered $n \times m$ design matrices

$$\begin{aligned} H_c&= H - \mathbb {1}_n P_n(h)^\top = \bigl (h_j(X_i) - P_n(h_j)\bigr )_{i,j}, \\ {\bar{H}}_c = H_c G^{-1/2}&= {\bar{H}}- \mathbb {1}_n P_n(\hbar )^\top = \bigl (\hbar _j(X_i) - P_n(\hbar _j)\bigr )_{i,j}. \end{aligned}$$

Since ${\bar{H}}^\top \mathbb {1}_n = n P_n(\hbar )$, we have

$$\begin{aligned} {\bar{H}}_c^\top {\bar{H}}_c&= {\bar{H}}^\top {\bar{H}}- n P_n(\hbar ) P_n(\hbar )^\top \\&= n \left( P_n(\hbar \hbar ^\top ) - P_n(\hbar ) P_n(\hbar )^\top \right) . \end{aligned}$$

As a consequence, for $u \in {\mathbb {R}}^m$,

$$\begin{aligned} u^\top {\bar{H}}_c^\top {\bar{H}}_c u&= n \left( u^\top P_n(\hbar \hbar ^\top ) u - (P_n(\hbar )^\top u)^2 \right) \\&\geqslant n \left( \lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2 \right) \left\Vert {\textstyle u}\right\Vert _2^2. \end{aligned}$$

by the Cauchy–Schwarz inequality. In particular, $u^\top {\bar{H}}_c^\top {\bar{H}}_c u$ is non-zero for non-zero $u \in {\mathbb {R}}^m$, so that ${\bar{H}}_c^\top {\bar{H}}_c$ is invertible, and so is the matrix

$$\begin{aligned} H_c^\top H_c = G^{1/2} {\bar{H}}_c {\bar{H}}_c G^{1/2}. \end{aligned}$$

Also, the smallest eigenvalue of ${\bar{H}}_c^\top {\bar{H}}_c$ is bounded from below by

$$\begin{aligned} \lambda _{\min }({\bar{H}}_c^\top {\bar{H}}_c) \geqslant n \left( \lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2 \right) > 0. \end{aligned}$$

The largest eigenvalue of the inverse matrix $({\bar{H}}_c^\top {\bar{H}}_c)^{-1}$ is then bounded from above by

$$\begin{aligned} \lambda _{\max }\bigl (({\bar{H}}_c^\top {\bar{H}}_c)^{-1}\bigr ) \leqslant \frac{1}{n \left( \lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2 \right) }. \end{aligned}$$

(25)

Step 2.2. — Write $\epsilon _c^{(n)} = (\epsilon (X_i) - P_n(\epsilon ))_{i=1}^n$ for the centered vector of error terms. Recall $f_c^{(n)} = (f(X_i) - P_n(f))_{i=1}^n$, the centered vector of samples from the integrand. As $f = P(f) + h^\top \beta ^*(f) + \epsilon $, we have

$$\begin{aligned} f_c^{(n)} = H_c \beta ^\star (f) + \epsilon _c^{(n)}. \end{aligned}$$

From the characterization (4) of the OLS estimate of the coefficient vector and since $H_c^\top H_c$ is invertible,

$$\begin{aligned} {\hat{\beta }}_{n}^{\mathrm {ols}}(f)&= (H_c^\top H_c)^{-1} H_c^\top f_c^{(n)} \\&= (H_c^\top H_c)^{-1} H_c^\top \left( H_c \beta ^\star (f) + \epsilon _c^{(n)} \right) \\&= \beta ^\star (f) + (H_c^\top H_c)^{-1} H_c^\top \epsilon _c^{(n)}. \end{aligned}$$

We obtain

$$\begin{aligned} {\hat{\eta }} - \eta ^\star&= G^{1/2} \left( {\hat{\beta }}_{n}^{\mathrm {ols}}(f) - \beta ^\star (f)\right) \nonumber \\&= G^{1/2} (H_c^\top H_c)^{-1} H_c^\top \epsilon _c^{(n)} \nonumber \\&= ({\bar{H}}_c^\top {\bar{H}}_c)^{-1} {\bar{H}}_c^\top \epsilon _c^{(n)}. \end{aligned}$$

(26)

Step 2.3. — We combine the results from Steps 2.1 and 2.2. From the upper bound (25) and the identity (26), we obtain

$$\begin{aligned} \left\Vert {\textstyle {\hat{\eta }} - \eta ^\star }\right\Vert _2 \leqslant \frac{\left\Vert {\textstyle {\bar{H}}_c^\top \epsilon _c^{(n)}}\right\Vert _2}{n \left( \lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2 \right) } \end{aligned}$$

Finally, as ${\bar{H}}_c = (\hbar _j(X_i) - P_n(\hbar _j))_{i,j}$, we find

$$\begin{aligned} n^{-1} \left\Vert {\textstyle {\bar{H}}_c^\top \epsilon _c^{(n)}}\right\Vert _2&= n^{-1} \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i) \epsilon (X_i) - P_n(\hbar ) \sum _{i=1}^n \epsilon (X_i)}\right\Vert _2 \\&= \left\Vert {\textstyle P_n(\hbar \epsilon ) - P_n(\hbar ) P_n(\epsilon )}\right\Vert _2 \\&\leqslant \left\Vert {\textstyle P_n(\hbar \epsilon )}\right\Vert _2 + \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2 \left|{\textstyle P_n(\epsilon )}\right|. \end{aligned}$$

Equation (23) follows.

Step 3. — In view of (24), we need to ensure that $\left|{\textstyle P_n(\epsilon )}\right|$, $\left\Vert {\textstyle P_n(\hbar )}\right\Vert $ and $\left\Vert {\textstyle P_n(\hbar \epsilon )}\right\Vert _2$ are small and that $\lambda _{\min }(P_n(\hbar \hbar ^\top ))$ is large. Let $\delta >0$. We construct an event with probability at least $1-\delta $ on which four inequalities hold simultaneously. Recall $B = \sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle \hbar (x)}\right\Vert _2^2$, defined in (7).

Step 3.1. — Because $\epsilon \in {\mathscr {G}}(\tau ^2)$, Chernoff’s inequality (or Lemma 1 with $p = 1$) implies that with probability at least $1 - \delta /4$,

$$\begin{aligned} \left|{\textstyle \sum _{i=1}^n \epsilon (X_i)}\right| \leqslant \sqrt{2n \tau ^2\log (8/\delta )}. \end{aligned}$$

(27)

Step 3.2. — For the term $\left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i)}\right\Vert _2$, we apply the vector Bernstein bound in (Hsu et al. 2014, Lemma 9). On the one hand $\sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle \hbar (x)}\right\Vert _2 \leqslant \sqrt{B}$ and on the other hand

$$\begin{aligned} \sum _{i=1}^n {{\mathbb {E}}}[ \left\Vert {\textstyle \hbar (X_i)}\right\Vert _2^2 ] = \sum _{i=1}^n \sum _{j=1}^m P(\hbar _j^2) = nm. \end{aligned}$$

The cited vector Bernstein bound gives

$$\begin{aligned} \forall t \geqslant 0, {{\mathbb {P}}}\left[ \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i)}\right\Vert _2 > \sqrt{nm} \left( 1 + \sqrt{8t}\right) + \tfrac{4}{3} t \sqrt{B} \right] \leqslant e^{-t}. \end{aligned}$$

Setting $t = \log (4/\delta )$, we find that, with probability at least $1-\delta /4$, we have

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i)}\right\Vert _2 \leqslant \sqrt{nm} \left( 1 + \sqrt{8\log (4/\delta )}\right) + \tfrac{4}{3} \log (4/\delta ) \sqrt{B}. \end{aligned}$$

Since $\log (4/\delta ) \geqslant \log (4)$, we have

$$\begin{aligned} 1 + \sqrt{8 \log (4/\delta )} \leqslant 4 \sqrt{\log (4/\delta )} \end{aligned}$$

and thus

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i)}\right\Vert _2&\leqslant 4 \sqrt{nm \log (4/\delta )} + \tfrac{4}{3} \log (4/\delta ) \sqrt{B} \\&= 4 \sqrt{\log (4/\delta )} \left( \sqrt{nm} + \tfrac{1}{3} \sqrt{B \log (4/\delta )} \right) . \end{aligned}$$

The condition on n easily implies that

$$\begin{aligned} \tfrac{1}{3} \sqrt{B \log (4/\delta )} \leqslant \tfrac{1}{4} \sqrt{nm} \end{aligned}$$

and thus

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i)}\right\Vert _2 \leqslant 5 \sqrt{nm \log (4/\delta )}. \end{aligned}$$

(28)

Step 3.3. — To control $\left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i) \epsilon (X_i)}\right\Vert _2$, we apply Lemma 6 with $Z_i = \hbar (X_i) \epsilon (X_i)$. The random vectors $\hbar (X_i) \epsilon (X_i)$ for $i = 1, \ldots , n$ are independent and identically distributed and have mean zero. Since $\left\Vert {\textstyle \hbar (X_i)}\right\Vert _2 \leqslant \sqrt{B}$ by (7) and since $\epsilon \in {\mathscr {G}}(\tau ^2)$ by Assumption 1, we have, for all $t > 0$,

$$\begin{aligned} {{\mathbb {P}}}[ \left\Vert {\textstyle \hbar (X_i) \epsilon (X_i)}\right\Vert _2> t ]&\leqslant {{\mathbb {P}}}[ \sqrt{B} \left|{\textstyle \epsilon (X_i)}\right| > t ] \\&\leqslant 2 \exp \left( - \tfrac{t^2}{2 B \tau ^2} \right) , \end{aligned}$$

and (21) holds with $\sigma ^2 = B \tau ^2$. Lemma 6 then implies that, with probability at least $1-\delta /4$ and $c=3$ that

$$\begin{aligned} \left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i) \epsilon (X_i) }\right\Vert _2 \leqslant c \sqrt{n B \tau ^2 \log (8m/\delta )}. \end{aligned}$$

(29)

Step 3.4. — Recall the $n \times m$ matrix $H = (h_j(X_i))_{i,j}$ and put

$$\begin{aligned} {\bar{H}}= H G^{-1/2} = (\hbar _j(X_i))_{i,j}. \end{aligned}$$

The empirical Gram matrix of the vector $\hbar = (\hbar _1, \ldots , \hbar _m)^\top \in L_2(P)^m$ based on the sample $X_1, \ldots , X_n$ is

$$\begin{aligned} P_n(\hbar \hbar ^\top ) = n^{-1} {\bar{H}}^T {\bar{H}}. \end{aligned}$$

We apply Lemma 2 with $g = {\tilde{g}} = \hbar $, $p = m$, and $\delta $ replaced by $\delta /4$. We find that, with probability at least $1-\delta /4$,

$$\begin{aligned} \forall u \in {\mathbb {R}}^m, \ \left\Vert {\textstyle {\bar{H}}u}\right\Vert _2^2&= n \, u^\top P_n(\hbar \hbar ^\top ) u \nonumber \\&\geqslant n \left( 1 - \sqrt{2 B n^{-1} \log (4m/\delta )} \right) \left\Vert {\textstyle u}\right\Vert ^2_2. \end{aligned}$$

(30)

Since $P_n(\hbar \hbar ^\top ) = n^{-1} {\bar{H}}^\top {\bar{H}}$, it follows that

$$\begin{aligned} \lambda _{\min }(P_n(\hbar \hbar ^\top )) \geqslant 1 - \sqrt{2 B n^{-1} \log (4m/\delta )} \geqslant \tfrac{2}{3} \end{aligned}$$

(31)

as the assumption on n implies that $2 B n^{-1} \log (4m/\delta ) \leqslant 1/9$.

By the union bound, the inequalities (27), (28), (29), and (30) hold simultaneously on an event with probability at least $1-\delta $. For the remainder of the proof, we work on this event, denoted by E.

Step 4. — We combine the bound (24) on the estimation error with the bounds valid on the event E constructed in Step 3. By (31), we have

$$\begin{aligned} \lambda _{\min }(P_n(\hbar \hbar ^\top )) - \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2 \geqslant \tfrac{2}{3} - 25 m n^{-1} \log (4/\delta ) \geqslant \tfrac{1}{3} \end{aligned}$$

since the assumption on n implies that $25 m n^{-1} \log (4/\delta ) \leqslant 1/3$. As $B \geqslant m \geqslant 1$, we have

$$\begin{aligned}&\left\Vert {\textstyle P_n(\hbar \epsilon )}\right\Vert _2 + \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2 \left|{\textstyle P_n(\epsilon )}\right| \\&\quad \leqslant c \sqrt{n^{-1} B \tau ^2 \log (8m/\delta )} \\&\qquad +5 \sqrt{n^{-1} m \log (4/\delta )} \cdot \sqrt{2 n^{-1} \tau ^2 \log (8/\delta )} \\&\quad \leqslant \sqrt{n^{-1} B \tau ^2 \log (8m/\delta )} \left( c + 5\sqrt{2n^{-1} \log (4/\delta )} \right) \\&\quad \leqslant (c+\sqrt{2/3}) \sqrt{n^{-1} B \tau ^2 \log (8m/\delta )}, \end{aligned}$$

since, by assumption, $n \geqslant 75m \log (4/\delta )$ which implies that $\sqrt{n^{-1} \log (4/\delta )} \leqslant 1/(5\sqrt{3})$. We find

$$\begin{aligned}&\left|{\textstyle {\hat{\alpha }}_{n}^{\mathrm {ols}}(f) - P(f)}\right| \leqslant \sqrt{2\tau ^2 n^{-1} \log (8/\delta )} \\&\qquad +\frac{1}{1/3} \cdot (c+\sqrt{2/3}) \sqrt{n^{-1} B \tau ^2 \log (8m/\delta )} \cdot 5 \sqrt{m n^{-1} \log (4/\delta )} \\&\quad = \sqrt{2\tau ^2 n^{-1} \log (8/\delta )} \\&\qquad +15 (c+\sqrt{2/3}) n^{-1} \sqrt{B \tau ^2 m \log (8m/\delta ) \log (4/\delta )}, \end{aligned}$$

and the value $c=3$ gives $15 (c+\sqrt{2/3}) \approx 57.2 <58$ which is the bound stated in Theorem 1. $\square $

Proof of Theorem 2

For a vector $\beta \in {\mathbb {R}}^m$ and for a non-empty set $S \subset \{1, \ldots , m\}$, write $\beta _S = (\beta _k)_{k \in S}$. For any matrix $A \in {\mathbb {R}}^{n\times m}$ and $k\in \{1,\ldots , m\}$, let $A_k $ denote its k-th column and if $S = \{k_1,\ldots ,k_\ell \} \subset \{1,\ldots ,m\}$ with $k_1< \ldots < k_\ell $, write $A_S = (A_{k_1},\ldots ,A_{ k_\ell }) \in {\mathbb {R}}^{n \times \ell }$.

The proof is organized in a similar way as the one of Theorem 1. We first provide an initial upper bound on the error (Step 1). Then we construct an event that (Step 2) has probability at least $1-\delta $ and (Steps 3, 4, 5) on which we can control each of the terms of the previous upper bound. The combination of all steps to deduce the final statement is made clear in Step 6.

Step 1. — As in the proof of Theorem 1, with ${\hat{\beta }}_{n}^{\mathrm {ols}}(f)$ replaced by ${\hat{\beta }}_{n}^{\mathrm {lasso}}(f)$, the estimation error of the LASSO estimator can be decomposed as

$$\begin{aligned}&n \left( {\hat{\alpha }}_n^{\mathrm {lasso}}(f) - P(f) \right) \\&\quad = \sum _{i=1}^n \epsilon (X_i) + \bigl (\beta ^*(f) - {\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\bigr )^T \sum _{i=1}^n h(X_i). \end{aligned}$$

Writing ${\hat{u}} = {\hat{\beta }}_{n}^{\mathrm {lasso}}(f) - \beta ^\star (f)$, we get, by the triangle and Hölder inequalities,

$$\begin{aligned} n \left| {{\hat{\alpha }}}_n^{\mathrm {lasso}} (f) - P(f) \right| \leqslant \left| \sum _{i=1}^n \epsilon (X_i) \right| + \left\Vert {\textstyle {\hat{u}}}\right\Vert _1 \max _{k=1,\ldots ,m} \left|\sum _{i=1}^n h_k(X_i)\right|. \end{aligned}$$

(32)

Step 2. — Let $\delta > 0$. We construct an event, E, with probability at least $1-\delta $ on which four inequalities, namely (33), (34), (35) and (36), hold simultaneously.

Since $\epsilon \in {\mathscr {G}}( \tau ^2)$, we can apply Lemma 1 with $p=1$ to get that, with probability at least $1 - \delta /4$,
$$\begin{aligned} \left| \sum _{i=1}^n \epsilon (X_i) \right| \leqslant \sqrt{2n \tau ^2\log (8/\delta )}. \end{aligned}$$
(33)
In view of (Boucheron et al. 2013, Lemma 2.2) and Assumption 2, we have $h_k\in {\mathscr {G}}(U_h^2)$ for all $k=1,\ldots , m$. Hence we can apply Lemma 1 with $p=m$ to get that, with probability at least $1-\delta /4$,
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \right| \leqslant \sqrt{2n U_h^2\log (8m/\delta )}. \end{aligned}$$
(34)
By virtue of Assumptions 1 and 2, we can apply Lemma 4 to find $h_k \epsilon \in {\mathscr {G}}(C \tau ^2 U_h^2)$ with $C = 9/2$. Hence we can apply Lemma 1 to get that, with probability at least $1-\delta /4$,
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \epsilon (X_i) \right| \leqslant \sqrt{2nC\tau ^2 U_h^2\log (8m/\delta ))}. \end{aligned}$$
(35)
In view of (Boucheron et al. 2013, Lemma 2.2) and Assumptions 2 and 6, we have $h_{k} h_{l} - P (h_k h_l) \in {\mathscr {G}}(U_h^4)$ for all $k,l \in \{1,\ldots , m \}$. Hence we can apply Lemma 1 with $p=m^2$ to get that, with probability at least $1-\delta /4$,
$$\begin{aligned} \max _{\begin{array}{c} 1\leqslant k \leqslant m \\ 1\leqslant l \leqslant m \end{array}} \left| \sum _{i=1}^n \{ h_k(X_i)h_l(X_i) - P (h_k h_l)\} \right| \leqslant \sqrt{2n U_h^4\log (8m^2/\delta )}. \end{aligned}$$
Denote by $\varDelta = (P_n- P) \{h h^T\} $. Because by assumption $2 (\ell ^\star / \gamma ^\star ) \sqrt{2 U_h^4\log (8m^2/\delta )} \leqslant \sqrt{n}$, we have that
$$\begin{aligned} (\ell ^\star / \gamma ^\star ) \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\leqslant 1 / 2. \end{aligned}$$
Remark that
$$\begin{aligned} \forall u\in {\mathbb {R}}^{m}, \quad n^{-1} \left\Vert {\textstyle Hu}\right\Vert _2^2 - u^TG u&= u^T \varDelta u. \end{aligned}$$
Then, following (Bickel et al. 2009, equation (3.3)), use the inequality $ |u^T \varDelta u|\leqslant \left\Vert {\textstyle u}\right\Vert _1^2 \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|$, to obtain that, with probability $1-\delta /4$, for all $u\in {\mathscr {C}} (S^\star ; 3)$,
$$\begin{aligned} \left\Vert {\textstyle Hu}\right\Vert _2^2 / n&\geqslant u^TG u - \left\Vert {\textstyle u}\right\Vert _1^2 \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\\&\geqslant u^TG u - \left\Vert {\textstyle u}\right\Vert _2^2 \ell ^\star \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\\&\geqslant u^TG u - (u^TG u) (\ell ^\star / \gamma ^\star ) \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\\&\geqslant (u^TG u ) / 2. \end{aligned}$$
It follows that with probability at least $1-\delta /4$,
$$\begin{aligned} \left\Vert {\textstyle H u}\right\Vert _2^2 \geqslant ( n\gamma ^\star /2)\left\Vert {\textstyle u}\right\Vert _2^2 . \end{aligned}$$
(36)

Step 3. — We claim that, on the event E, we have

$$\begin{aligned} \forall u\in {\mathscr {C}} (S^\star ; 3),\quad \left\Vert {\textstyle H_c u}\right\Vert _2^2 \geqslant (n \gamma ^\star /4) \left\Vert {\textstyle u}\right\Vert _2^2 \end{aligned}$$

(37)

We have

$$\begin{aligned} H_{c}^T H_{c} = H^T H - n \, P_n(h) \, P_n(h)^T \end{aligned}$$

and thus,

$$\begin{aligned} \left\Vert {\textstyle H_{c} u}\right\Vert _2^2 \geqslant \left\Vert {\textstyle H u}\right\Vert _2^2 - n \max _{k=1,\ldots , m} | P_n(h_k)| ^2 \left\Vert {\textstyle u}\right\Vert _1^2. \end{aligned}$$

We treat both terms on the right-hand side. On the one hand, we just have obtained a lower bound for the first term. On the other hand, in view of (34) and because $\left\Vert {\textstyle u}\right\Vert _1^2\leqslant 16 \left\Vert {\textstyle u_{S^\star }}\right\Vert _1^2\leqslant 16 \ell ^\star \left\Vert {\textstyle u}\right\Vert _2^2$, we have

$$\begin{aligned} \left\Vert {\textstyle u}\right\Vert _1^2 \max _{k=1,\ldots , m} | P_n(h_k)| ^2&= \left\Vert {\textstyle u}\right\Vert _1^2 n^{-2}\cdot \max _{k \in S^\star } \Bigl | \sum _{i=1}^n h_k(X_i) \Bigr |^2 \\&\leqslant 16 \ell ^\star \left\Vert {\textstyle u}\right\Vert _2^2 \cdot n^{-2}\cdot 2nU_h^2 \log (8m/\delta ) \\&\leqslant \left\Vert {\textstyle u}\right\Vert _2^2 \gamma ^\star / 4 \end{aligned}$$

as $n \geqslant (16 \times 8) \ell ^\star (U_h^{2}/\gamma ^\star ) \log (8m/\delta )$ by assumption. In combination with (36), we find

$$\begin{aligned} \left\Vert {\textstyle H_{c} u}\right\Vert _2^2 \geqslant n(\gamma ^\star /2) \left\Vert {\textstyle u}\right\Vert _2^2 - n(\gamma ^\star /4) \left\Vert {\textstyle u}\right\Vert _2^2 = n (\gamma ^\star /4) \left\Vert {\textstyle u}\right\Vert _2^2. \end{aligned}$$

Step 4. — We claim that, on the event E, we have

$$\begin{aligned} \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty \leqslant (3+\sqrt{2}/8) \sqrt{\log (8m/\delta )} U_h \tau \sqrt{n}. \end{aligned}$$

(38)

Indeed, on the left-hand side in (38) we have in virtue of (33), (34) and (35),

$$\begin{aligned}&\left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty \\&\quad = \max _{k=1,\ldots , m} \left| \sum _{i=1}^n (h_k(X_i) - P_n(h_k)) (\epsilon (X_i) - P_n(\epsilon )) \right| \\&\quad = \max _{k=1,\ldots , m} \left| \left( \sum _{i=1}^n h_k(X_i) \epsilon (X_i)\right) - n P_n(h_k) P_n(\epsilon ) \right| \\&\quad \leqslant \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \epsilon (X_i) \right| \\&\quad + n^{-1} \left| \sum _{i=1}^n \epsilon (X_i) \right| \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \right| \\&\quad \leqslant \sqrt{2n C\tau ^2 U_h^2 \log (8m/\delta )} \\&\qquad +n^{-1}\sqrt{2n \tau ^2\log (8/\delta )} \sqrt{2n U_h^2\log (8m/\delta )} \\&\quad = \sqrt{2n C\tau ^2 U_h^2\log (8m/\delta )} \left( 1 + \sqrt{2 \log (8/\delta ) / (Cn)} \right) . \end{aligned}$$

Since $\ell ^\star \geqslant 1$ and $\ell ^\star U_h^{2} \geqslant \sum _{k\in S^\star } P(h_k^2) \geqslant \gamma ^\star $, the assumed lower bound on n implies that $n \geqslant 128 \log (8/\delta )$. As $C = 9/2$, the factor $\sqrt{2C}(1 + \sqrt{2 \log (8/\delta ) / (Cn)})$ is bounded by $3 + \sqrt{2}/8$ and we get (38).

Step 5. — Recall ${\hat{u}} = {\hat{\beta }}_{n}^{\mathrm {lasso}}(f) - \beta ^{\star }(f)$. We claim that, on the event E, we have

$$\begin{aligned} \left\Vert {\textstyle {\hat{u}}}\right\Vert _1 \leqslant 48 \lambda \ell ^\star / \gamma ^\star . \end{aligned}$$

(39)

To prove this result, we shall rely on the following lemma.

Lemma 7

If $n \lambda \geqslant 2 \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _{\infty }$ then, writing ${\hat{u}} = {\hat{\beta }}_{n}^{\mathrm {lasso}}(f) - \beta ^{\star }(f)$, we have ${\hat{u}} \in {\mathscr {C}}(S^\star ; 3)$ and

$$\begin{aligned} \left\Vert {\textstyle H_c {\hat{u}}}\right\Vert _2^2 \leqslant 3 n {\lambda } \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1. \end{aligned}$$

(40)

Proof

This is just a reformulation of the reasoning on p. 298 in (Tibshirani et al. 2015) with a slightly sharper upper bound. The vector ${\hat{\nu }}$ at the right-hand side of their Eq. (11.23) can be replaced by ${\hat{\nu }}_S$. For the sake of completeness, we provide the details.

In the proof we use the short-cuts $\beta ^\star = \beta ^\star (f)$ and ${\hat{\beta }}_{n}^{\mathrm {lasso}}= {\hat{\beta }}_{n}^{\mathrm {lasso}}(f)$. Recall $\epsilon _c^{(n)} = f_c^{(n)} - H_c \beta ^\star (f)$ and define

$$\begin{aligned} G(u)&= \Vert f^{(n)}_c-H_c(\beta ^\star + u) \Vert _2^2 /(2n) + \lambda \Vert \beta ^\star + u\Vert _1 \\&= \Vert \epsilon _c^{(n)} - H_c u \Vert _2^2 /(2n) + \lambda \Vert \beta ^\star + u\Vert _1. \end{aligned}$$

Because $G({{\hat{u}}} )\leqslant G(0)$, we have

$$\begin{aligned} \Vert H_c {{\hat{u}}} \Vert _2^2/(2n) \leqslant {\hat{u}}^T H_c^T \epsilon _c^{(n)} / n + \lambda ( \Vert \beta ^\star \Vert _1 -\Vert \beta ^\star + {{\hat{u}}} \Vert _1 ) \end{aligned}$$

From the triangle inequality

$$\begin{aligned} \Vert (\beta ^\star - (-{\hat{u}}))_{S^\star } \Vert _1\geqslant | \Vert \beta ^\star _{S^\star }\Vert _1 - \Vert {\hat{u}}_{S^\star } \Vert _1| \geqslant \Vert \beta ^\star _{S^\star }\Vert _1 - \Vert {\hat{u}}_{S^\star } \Vert _1, \end{aligned}$$

implying that

$$\begin{aligned}&\Vert \beta ^\star \Vert _1 -\Vert \beta ^\star + {{\hat{u}}} \Vert _1 \\&\quad = \Vert \beta ^\star \Vert _1 -\Vert (\beta ^\star + {{\hat{u}}})_{S^\star } \Vert _1 - \Vert ( \beta ^\star + {{\hat{u}}})_{\overline{S^\star }} \Vert _1 \\&\quad \leqslant \Vert \beta ^\star \Vert _1 - \Vert \beta ^\star _{S^\star }\Vert _1 + \Vert {\hat{u}}_{S^\star } \Vert _1 - \Vert (\beta ^\star +{\hat{u}} )_{\overline{S^\star }} \Vert _1\\&\quad = \Vert {\hat{u}}_{S^\star } \Vert _1 - \Vert {\hat{u}} _{\overline{S^\star }} \Vert _1. \end{aligned}$$

From Hölder’s inequality, we get

$$\begin{aligned} \left|{\textstyle {\hat{u}}^T H_c^T \epsilon _c^{(n)}}\right| \leqslant \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty \cdot \left\Vert {\textstyle {\hat{u}}}\right\Vert _1, \end{aligned}$$

which leads to

$$\begin{aligned} \left\Vert {\textstyle H_c {{\hat{u}}}}\right\Vert _2^2/(2n) \leqslant \Vert H_c^T \epsilon _c^{(n)} \Vert _\infty \Vert {\hat{u}}\Vert _1/n + \lambda ( \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1 - \left\Vert {\textstyle {\hat{u}}_{\overline{S^\star }}}\right\Vert _1 ). \end{aligned}$$

Consequently, because $\left\Vert {\textstyle H_c^T\epsilon _c^{(n)}}\right\Vert _\infty / n \leqslant \lambda /2$ by assumption, we obtain

$$\begin{aligned} 0\leqslant \Vert H_c {{\hat{u}}} \Vert _2^2/(2n)&\leqslant \lambda ( \Vert {\hat{u}}\Vert _1/2 + \Vert {{\hat{u}}}_{S^\star } \Vert _1 - \Vert {\hat{u}} _{\overline{S^\star }} \Vert _1 ) \\&= (\lambda /2) ( 3 \Vert {{\hat{u}}}_{S^\star } \Vert _1 - \Vert {\hat{u}} _{\overline{S^\star }} \Vert _1 ). \end{aligned}$$

The right-hand side must be nonnegative, whence $\left\Vert {\textstyle {\hat{u}}_{\overline{S^\star }}}\right\Vert _1 \leqslant 3 \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1$, i.e., ${\hat{u}} \in {\mathscr {C}}(S; 3)$. The bound in (40) follows as well. $\square $

On the event E, the conclusion of Lemma 7 is valid because the bound on $\left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty $ in (38) and the assumption on $\lambda $ in Theorem 2 together imply that $\lambda \geqslant 2 \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty / n$. The cone property of Lemma 7 yields ${\hat{u}} \in {\mathscr {C}}(S^\star ; 3)$ so that

$$\begin{aligned} \left\Vert {\textstyle {\hat{u}}}\right\Vert _1 = \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1 + \left\Vert {\textstyle {\hat{u}}_{\overline{S^\star }}}\right\Vert _1 \leqslant 4 \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1. \end{aligned}$$

(41)

Thanks to (37) and Lemma 7, and since $\left|{\textstyle S^\star }\right| = \ell ^\star $, we get

$$\begin{aligned} \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1^2&\leqslant \ell ^\star \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _2^2 \\&\leqslant \ell ^\star \left\Vert {\textstyle {\hat{u}}}\right\Vert _2^2 \\&\leqslant \ell ^\star \cdot n^{-1} (4/\gamma ^\star ) \left\Vert {\textstyle H_{c} {\hat{u}}}\right\Vert _2^2 \\&\leqslant \ell ^\star \cdot n^{-1} (4/\gamma ^\star ) \cdot 3n\lambda \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1 = 12 \ell ^\star (\lambda /\gamma ^\star ) \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1. \end{aligned}$$

It follows that $\left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1 \leqslant 12 \ell ^\star \lambda /\gamma ^\star $. In combination with (41), we find (39).

Step 6. — Equation (32) gave a bound on the estimation error involving three terms. On the event E, these terms were shown to be bounded in (33), (34), and (39). It follows that, on E, we finally have

$$\begin{aligned}&n \left| {{\hat{\alpha }}}_n^{\mathrm {lasso}} (f) - P(f) \right| \\&\quad \leqslant \sqrt{2n\tau ^2 \log (8/\delta )} + 48 \lambda \ell ^\star / \gamma ^\star \cdot \sqrt{2nU_h^2 \log (8m/\delta )}. \end{aligned}$$

Divide by n and use $48\sqrt{2} < 68$ to obtain (9). $\square $

Proof of Theorem 3

Recall that $S^\star = \{ j = 1, \ldots , m : \beta _j^\star (f) \ne 0\}$ with $\ell ^\star = \left|{\textstyle S^\star }\right|$ and that $\overline{S^\star } = \{1,\ldots ,m\} \setminus S^\star $. Further, $H_{c,S^\star }$ is the $n \times \ell ^\star $ matrix having columns $H_{c,k}$ for $k \in S^\star $, where $H_{c,k}$ is the k-th column of $H_c$.

Step 1. — We first establish some (non-probabilistic) properties of ${\hat{\beta }}_{n}^{\mathrm {lasso}}(f)$. To this end, we consider the linear regression of the non-active control variates on the active ones: for $k \in \overline{S^\star } = \{ j = 1, \ldots , m : \beta _j^\star (f) = 0 \}$, this produces the coefficient vector

$$\begin{aligned} {{\hat{\theta }}}_{n}^{(k)} \in \mathop {{\text {arg min}}}\limits _{\theta \in {\mathbb {R}}^{\ell ^\star }} \left\Vert {\textstyle H_{c,k} - H_{c,S^{\star }} \theta }\right\Vert _2. \end{aligned}$$

Further, we consider the OLS oracle estimate ${\hat{\beta }}_{n}^{\star }$, which is the OLS estimator based upon the active control variables only, i.e.,

$$\begin{aligned} {\hat{\beta }}_{n}^{\star } \in \mathop {{\text {arg min}}}\limits _{\beta \in {\mathbb {R}}^{\ell ^\star }} \Vert f^{(n)}_c - H_{c,S^{\star }} \beta \Vert _2. \end{aligned}$$

Our assumptions will imply that, with large probability, $H_{c,S^{\star }}$ has rank $\ell ^*$, in which case

$$\begin{aligned} {{\hat{\theta }}}_{n}^{(k)}&= (H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1} H_{c,S^{\star }}^T H_{c,k}, \\ {\hat{\beta }}_{n}^{\star }&= (H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1} H_{c,S^{\star }}^T f_c^{(n)}. \end{aligned}$$

The following lemma provides a number of (non-probabilistic) properties of ${\hat{\beta }}_{n}^{\mathrm {lasso}}(f)$, given certain conditions on $H_c$ and $\epsilon _c^{(n)}$. Recall that a norm $\left\Vert {\textstyle \,\cdot \,}\right\Vert $ on ${\mathbb {R}}^p$ induces a matrix norm on ${\mathbb {R}}^{p \times p}$ via $\left\Vert {\textstyle A}\right\Vert = \sup \{ \left\Vert {\textstyle A u}\right\Vert : u \in {\mathbb {R}}^p, \left\Vert {\textstyle u}\right\Vert = 1 \}$ for $A \in {\mathbb {R}}^{p \times p}$.

Lemma 8

If $H_{c,S^{\star }}$ has rank $\ell ^\star $ and if there exists $\kappa \in (0, 1]$ such that

$$\begin{aligned}&\max _{k\in \overline{S^\star }} \left\Vert {\textstyle {{\hat{\theta }}}_{n}^{(k)}}\right\Vert _1 \leqslant 1 - \kappa , \end{aligned}$$

(42)

$$\begin{aligned}&\max _{k\in \overline{S^\star }} \left|{\textstyle (H_{c,k} - H_{c,S^{\star }} {\hat{\theta }}_{n}^{(k)})^T \epsilon _c^{(n)}}\right|&\leqslant \kappa \lambda n, \end{aligned}$$

(43)

then the minimizer ${\hat{\beta }}_{n}^{\mathrm {lasso}}(f)$ in (5) is unique, with support ${\text {supp}}({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)) \subset S^\star $, and it satisfies

$$\begin{aligned}&\max _{k\in S^{\star }} \left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f) - {\beta }_k^\star (f)}\right| \nonumber \\&\quad \leqslant \max _{k\in S^{\star }} \left|{\textstyle {\hat{\beta }}_{n,k}^{\star } - \beta ^\star _k (f)}\right| + n\lambda \left\Vert {\textstyle (H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}}\right\Vert _\infty . \end{aligned}$$

(44)

Proof

The proof of the previous result is actually contained in Tibshirani et al. (2015). The uniqueness of the LASSO solution and the property that it does not select inactive covariates follows directly from the proof of their Theorem 11.3. The only difference is that, in our case, the inequality (43) is an assumption whereas in Tibshirani et al. (2015) it is a property of the Gaussian fixed design model. The approach in Tibshirani et al. (2015) is based upon checking the strict dual feasibility condition. The bound (44) is Eq. (11.37) in Tibshirani et al. (2015). $\square $

We slightly modify Lemma 8 to make the conditions (42) and (43) easier to check and to make the bound (44) easier to use.

Lemma 9

If there exists $\nu > 0$ such that

$$\begin{aligned} \forall u\in {\mathbb {R}}^{\ell ^\star }, \qquad \left\Vert {\textstyle H_{c,S^{\star }} u}\right\Vert _2^2 \geqslant n \nu \left\Vert {\textstyle u}\right\Vert _2^2, \end{aligned}$$

(45)

and if there exists $\kappa \in (0,1]$ such that

$$\begin{aligned} \frac{\ell ^\star }{\nu n} \max _{k\in \overline{S^\star }} \max _{j\in S^{\star }} \left|{\textstyle H_{c,j}^T H_{c,k}}\right|&\leqslant 1 - \kappa , \end{aligned}$$

(46)

$$\begin{aligned} \max _{k=1,\ldots , m } \left|{\textstyle H_{c,k}^T \epsilon _c^{(n)}}\right|&\leqslant \frac{1}{2} \kappa \lambda n, \end{aligned}$$

(47)

then the minimizer ${\hat{\beta }}_{n}^{\mathrm {lasso}}(f)$ in (5) is unique, with support satisfying ${\text {supp}}({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)) \subset S^\star $, and it holds that

$$\begin{aligned} \max _{k\in S^{\star }} \left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f) - {\beta }_k^\star (f)}\right| \leqslant (1 + \kappa /2) \sqrt{\ell ^\star } \lambda / \nu . \end{aligned}$$

(48)

Proof

By (45), the smallest eigenvalue of the $\ell ^\star \times \ell ^\star $ matrix $H_{c,S^\star }^T H_{c,S^\star }$ is positive, so that it is invertible and $H_{c,S^\star }$ has rank $\ell ^\star $.

We show that (46) implies (42). For each $k \in \overline{S^\star }$, the vector ${\hat{\theta }}_n^{(k)}$ has length $\ell ^\star $, so that

$$\begin{aligned} \left\Vert {\textstyle {{\hat{\theta }}}_{n}^{(k)}}\right\Vert _1 \leqslant \sqrt{\ell ^{\star }} \left\Vert {\textstyle \hat{\theta }_{n}^{(k)}}\right\Vert _2. \end{aligned}$$

Because ${{\hat{\theta }}}_{n}^{(k)} $ is an OLS estimate, using that the largest eigenvalue of $(H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}$ being bounded from above by $(n \nu )^{-1}$, we obtain

$$\begin{aligned} \left\Vert {\textstyle {{\hat{\theta }}}_{n}^{(k)} }\right\Vert _2 = \left\Vert {\textstyle (H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1} H_{c,S^{\star }}^T H_{c,k} }\right\Vert _2 \leqslant \frac{1}{n \nu } \left\Vert {\textstyle H_{c,S^{\star }}^T H_{c,k} }\right\Vert _2 \end{aligned}$$

Since $\left\Vert {\textstyle x}\right\Vert _2 \leqslant \sqrt{m} \, \left\Vert {\textstyle x}\right\Vert _\infty $ for $x \in {\mathbb {R}}^m$, we can conclude that

$$\begin{aligned} \left\Vert {\textstyle {{\hat{\theta }}}_{n} ^{(k)}}\right\Vert _2 \leqslant \frac{\sqrt{\ell ^\star }}{\nu n} \max _{j\in S^{\star }} \left|{\textstyle H_{c,j}^T H_{c,k}}\right|. \end{aligned}$$

Combining the two bounds, we find that (46) indeed implies (42).

Next we show that (47) implies (43). For $k \in \overline{S^\star }$, we have

$$\begin{aligned}&\left|{\textstyle (H_{c,k} - H_{c,S^{\star }} {{\hat{\theta }}}_{n}^{(k)})^T \epsilon _c^{(n)}}\right| \\&\quad \leqslant \left|{\textstyle H_{c,k}^T \epsilon _c^{(n)}}\right| + \left|{\textstyle ({{\hat{\theta }}}_{n}^{(k)})^T H_{c,S^{\star }}^T \epsilon _c^{(n)}}\right|\\&\quad \leqslant \left|{\textstyle H_{c,k}^T \epsilon _c^{(n)}}\right| + \left\Vert {\textstyle {{\hat{\theta }}}_{n}^{(k)}}\right\Vert _1 \max _{j \in S^{\star }} \left|{\textstyle H_{c,j}^T \epsilon _c^{(n)}}\right|. \end{aligned}$$

Using (42) and (47) we deduce (43).

The conditions of Lemma 8 have been verified, and so its conclusion holds. We simplify the two terms in the upper bound (44). First, we use that

$$\begin{aligned} \left\Vert {\textstyle {\hat{\beta }}_{n}^\star - \beta ^\star (f)}\right\Vert _2&= \left\Vert {\textstyle (H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1} H_{c,S^{\star }}^T \epsilon _c^{(n)} }\right\Vert _2 \\&\leqslant \frac{\sqrt{\ell ^\star }}{ \nu n } \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty . \end{aligned}$$

Second, for any matrix $A \in {\mathbb {R}}^{p\times p}$, we have $\left\Vert {\textstyle A}\right\Vert _\infty \leqslant \sqrt{p} \left\Vert {\textstyle A}\right\Vert _2$ (e.g., (Horn and Johnson 2012, page 365)), and this we apply to $(H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}$. In this way, the upper bound in (44) is dominated by

$$\begin{aligned}&\left\Vert {\textstyle {\hat{\beta }}_ {n}^{\star } - \beta ^\star (f)}\right\Vert _2 + n\lambda \cdot \sqrt{\ell ^\star } \left\Vert {\textstyle (H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1} }\right\Vert _2 \\&\quad \leqslant \frac{\sqrt{\ell ^\star }}{n \nu } \max _{k \in S^\star } \left|{\textstyle H_{c,k}^T \epsilon _c^{(n)}}\right| + n\lambda \cdot \sqrt{\ell ^\star } \cdot \frac{1}{n \nu }, \end{aligned}$$

since the largest eigenvalue of $(H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}$ is at most $(n \nu )^{-1}$. Use (47) to further simplify the right-hand side, yielding (48). $\square $

Step 2. — Let $\delta \in (0, 1)$ and $n = 1, 2, \ldots $. In a similar way as in the proof of Theorem 1, we construct an event of probability at least $1-\delta $. This time, we need five inequalities to hold simultaneously.

Because $\epsilon \in {\mathscr {G}}( \tau ^2)$, with probability at least $1-\delta /5$,
$$\begin{aligned} \left| \sum _{i=1}^n \epsilon (X_i) \right| \leqslant \sqrt{2n \tau ^2 \log (10/\delta )}. \end{aligned}$$
(49)
In view of (Boucheron et al. 2013, Lemma 2.2) and Assumption 2, we have $h_k\in {\mathscr {G}}(U_h^2)$ for all $k=1,\ldots , m$. Hence we can apply Lemma 1 with $p=m$ to get that, with probability at least $1-\delta /5$,
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \right| \leqslant \sqrt{2n U_h^2\log (10m/\delta )}. \end{aligned}$$
(50)
By virtue of Assumptions 1 and 2, we can apply Lemma 4 to have $h_k \epsilon \in {\mathscr {G}}(C U_h^2 \tau ^2 )$, where $C = 9/2$. Hence we can apply Lemma 1 to get that, with probability at least $1-\delta /5$,
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \epsilon (X_i) \right| \leqslant \sqrt{2C n \tau ^2 U_h^2\log (10m/\delta ))}. \end{aligned}$$
(51)
Recall that $B^\star = \sup _{x \in {\mathscr {X}}} h_{S^\star }^T(x) G_{S^\star }^{-1} h_{S^\star }(x)$ with
$$\begin{aligned} B^\star \leqslant \lambda _{\max }(G_{S^\star }^{-1}) \sup _{x \in {\mathscr {X}}} h_{S^\star }^T(x) h_{S^\star }(x) \leqslant \ell ^\star U_h^2 / \gamma ^{\star \star }, \end{aligned}$$
The assumption on n easily implies that $n \geqslant 8 B^{\star } \log (5\ell ^\star /\delta )$. Applying Lemma 2 with $p=\ell ^\star $, $g = h_{S^\star }$, and $\delta $ replaced by $\delta /5$, we find that, with probability at least $1-\delta /5$,
$$\begin{aligned} \left\Vert {\textstyle H_{S^{\star }} u}\right\Vert _2^2 \geqslant n\gamma ^{\star \star } \left\Vert {\textstyle u}\right\Vert ^2_2/2, \qquad \forall u\in {\mathbb {R}}^{\ell ^*}. \end{aligned}$$
(52)
Finally, because $\left|{\textstyle h_j(x)}\right| \leqslant U_h$ for all $x \in {\mathscr {X}}$ and $j \in \{1,\ldots ,m\}$ and because $P(h_k h_j) = 0$ for all $(k, j) \in \overline{S^\star } \times S^\star $, we have $h_kh_j \in {\mathscr {G}}(U_h^4)$ for such k and j, and thus, with probability at least $1-\delta /5$,
$$\begin{aligned} \max _{k\in \overline{S^\star }} \max _{j\in S^{\star }} \left| \sum _{i=1}^n h_k(X_i) h_j(X_i) \right| \leqslant \sqrt{2nU_h^4 \log (10 \ell ^\star m /\delta ) }. \end{aligned}$$
(53)

By the union bound, the event, say E, on which (49), (50), (51), (52) and (53) are satisfied simultaneously has probability at least $1-\delta $. We work on the event E for the rest of the proof.

Step 3. — On the event E, we have

$$\begin{aligned} \forall u\in {\mathbb {R}}^{\ell ^\star }, \qquad \left\Vert {\textstyle H_{c,S^{\star }} u}\right\Vert _2^2 \geqslant n \alpha \gamma ^{\star \star } \left\Vert {\textstyle u}\right\Vert _2^2, \end{aligned}$$

(54)

where $\alpha \in (0, 1/2)$ is an absolute constant whose value will be fixed in Step 6(ii). We have

$$\begin{aligned} H_{c,S^{\star }}^T H_{c,S^{\star }} = H_{S^\star }^T H_{S^\star } - n \, P_n(h_{S^\star }) \, P_n(h_{S^\star })^T \end{aligned}$$

and thus, by the Cauchy–Schwarz inequality and by (52),

$$\begin{aligned} \left\Vert {\textstyle H_{c,S^{\star }} u}\right\Vert _2^2&\geqslant \left\Vert {\textstyle H_{S^\star } u}\right\Vert _2^2 - n \left\Vert {\textstyle P_n(h_{S^\star })}\right\Vert _2^2 \left\Vert {\textstyle u}\right\Vert _2^2 \\&\geqslant n \left( \gamma ^{\star \star } / 2 - \left\Vert {\textstyle P_n(h_{S^\star })}\right\Vert _2^2\right) \left\Vert {\textstyle u}\right\Vert _2^2. \end{aligned}$$

In view of (50), we have

$$\begin{aligned} \left\Vert {\textstyle P_n(h_{S^\star })}\right\Vert _2^2 \leqslant \frac{\ell ^\star }{n^2} 2nU_{h}^2 \log (10m/\delta ) = 2\ell ^\star \log (10m/\delta ) U_{h}^2 / n. \end{aligned}$$

We thus get

$$\begin{aligned} \left\Vert {\textstyle H_{c,S^{\star }} u}\right\Vert _2^2 \geqslant n \gamma ^{\star \star } \left[ \frac{1}{2} - \frac{2\ell ^\star \log (10m/\delta ) U_{h}^2 / \gamma ^{\star \star }}{n} \right] \left\Vert {\textstyle u}\right\Vert _2^2 \end{aligned}$$

A sufficient condition for (54) is thus that the term in square brackets is at least $\alpha $, i.e.,

$$\begin{aligned} n \geqslant \frac{2}{1/2 - \alpha } \ell ^\star \log (10m/\delta ) U_{h}^2 / \gamma ^{\star \star } \end{aligned}$$

Since $\ell ^\star \geqslant 1$ and $U_{h}^2 \geqslant \gamma ^{\star \star }$, a condition of the form

$$\begin{aligned} n \geqslant \rho \log (10\ell ^\star m/\delta ) [\ell ^\star (U_h^2 / \gamma ^{\star \star })]^2 \end{aligned}$$

(55)

is thus sufficient, with much to spare, provided $\rho > 2/(1/2-\alpha )$. In Step 6(ii), we will choose $\alpha $ in such a way that the constant $\rho = 70$ appearing in the statement of the theorem is sufficient.

Step 4. — On the event E, we have

$$\begin{aligned}&\max _{k\in \overline{S^\star }} \max _{j\in S^{\star }} | H_{c,j} ^T H_{c,k}| \nonumber \\&\quad \leqslant \sqrt{2nU_h^4 \log (10 \ell ^\star m /\delta ) } + 2 U_h^2\log (10m/\delta ). \end{aligned}$$

(56)

Indeed, denote $A = \overline{S^\star } \times S^\star $, in virtue of (50) and (53), the left-hand side is bounded by

$$\begin{aligned}&\max _{(k,j) \in A} \left| \left( \sum _{i=1}^n h_k(X_i) h_j(X_i)\right) - n P_n(h_k) P_n(h_j) \right| \\&\quad \leqslant \max _{(k,j) \in A} \left| \sum _{i=1}^n h_k(X_i) h_j(X_i) \right| + \frac{1}{n} \max _{k\in \overline{S^\star }} \left| \sum _{i=1}^n h_k (X_i) \right| \max _{j\in S^{\star }} \left| \sum _{i=1}^n h_j(X_i) \right| \\&\quad \leqslant \max _{(k,j) \in A} \left| \sum _{i=1}^n h_k(X_i) h_j(X_i) \right| + \frac{1}{n} \max _{k = 1,\ldots m} \left| \sum _{i=1}^n h_k (X_i) \right| ^2 \\&\quad \leqslant \sqrt{2nU_h^4 \log (10 \ell ^\star m /\delta ) } + \frac{1}{n} 2n U_h^2\log (10m/\delta ), \end{aligned}$$

which is (56).

Step 5. — On the event E, we have

$$\begin{aligned}&\left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty \nonumber \\&\quad \leqslant \sqrt{2n C \tau ^2 U_h^2\log (10m/\delta )} \left( 1 + \sqrt{2 \log (10/\delta ) / (Cn)} \right) .\nonumber \\ \end{aligned}$$

(57)

The proof is the same as the first part of the one (38).

Step 6. — We will verify that on the event E, the three assumptions of Lemma 9 are satisfied with $\kappa = 1/2$ and $\nu = \alpha \gamma ^{\star \star }$, with $\alpha $ as in Step 3.

(i)
Eq. (45) with $\nu = \alpha \gamma ^{\star \star }$ is just (54).
(ii)
Eq. (46) with $\nu = \alpha \gamma ^{\star \star }$ and $\kappa = 1/2$ follows from (56) provided we have
$$\begin{aligned}&\frac{\ell ^\star }{\alpha \gamma ^{\star \star } n} \left( \sqrt{2nU_h^4 \log (10 \ell ^\star m /\delta ) } + 2 U_h^2\log (10m/\delta ) \right) \\&\quad \leqslant 1 - \frac{1}{2}. \end{aligned}$$
To check whether this is satisfied, we will make use of the elementary inequality^{Footnote 2}
$$\begin{aligned} \forall (a,b,c) \in (0, \infty )^3, \, \forall x \geqslant \sqrt{b^2 + 4ac}/a, \quad a x^2 \geqslant bx + c. \end{aligned}$$
with $x = \sqrt{n}$ and
$$\begin{aligned} a= & {} \alpha \gamma ^{\star \star } / (2\ell ^\star ), \\ b= & {} \sqrt{2 U_h^4 \log (10 \ell ^\star m / \delta )}, \\ c= & {} 2U_h^2 \log (10m/\delta ). \end{aligned}$$
Sufficient is that $n = x^2$ is bounded from below by $(b^2 + 4ac)/a^2 = (b/a)^2 + 4c/a$, which is
$$\begin{aligned}&\frac{2 U_h^4 \log (10 \ell ^\star m / \delta )}{(\alpha \gamma ^{\star \star } / (2\ell ^\star ))^2} + 4 \frac{2U_h^2 \log (10m/\delta )}{\alpha \gamma ^{\star \star } / (2\ell ^\star )} \\&\quad =\frac{8}{\alpha ^2} \log (10 \ell ^\star m/\delta ) \left( \frac{\ell ^\star U_h^2}{ \gamma ^{\star \star }}\right) ^2\\&+\frac{16}{\alpha } \log (10 m/\delta ) \left( \frac{\ell ^\star U_h^2}{ \gamma ^{\star \star }}\right) . \end{aligned}$$
But $\ell ^\star \geqslant 1$ and $\gamma ^{\star \star } \leqslant (1/\ell ^\star ) \sum _{j \in S^\star } P(h_j^2) \leqslant U_h^2$, so that a sufficient condition is that
$$\begin{aligned} n \geqslant \left( \frac{8}{\alpha ^2} + \frac{16}{\alpha }\right) \log (10 \ell ^\star m/\delta ) [\ell ^\star (U_h^2 / \gamma ^{\star \star })]^2. \end{aligned}$$
The constant $\rho $ in (55) must thus be such that
$$\begin{aligned} \rho \geqslant \max \left( \frac{2}{1/2 - \alpha }, \frac{8}{\alpha ^2} + \frac{16}{\alpha } \right) . \end{aligned}$$
The minimum of the right-hand side as a function of $\alpha \in (0, 1/2)$ occurs at $\alpha = \sqrt{2}/3$ and is equal to $2/(1/2 - \sqrt{2}/3) \approx 69.9$. Taking $\rho = 70$ as in the assumption on n is thus sufficient.
(iii)
Eq. (47) with $\kappa = 1/2$ follows from (57), since
$$\begin{aligned} \sqrt{n \tau ^2 U_h^2\log (10m/\delta )} \left( \sqrt{2C} + 2\sqrt{ \log (10/\delta ) / n} \right) \leqslant \lambda n /4 \end{aligned}$$
by the assumed lower bound on $\lambda $. Indeed, since $\ell ^\star \geqslant 1$ and $U_{h}^{2} \geqslant \gamma ^{\star \star }$, the assumed lower bounds on n imply that $n \geqslant 70 \log (10/\delta )$, so that $ 2\sqrt{ \log (10/\delta ) / n}$ is bounded by $2/\sqrt{70}$; recall $C = 9/2$. Since $4 \cdot (3 + 2/\sqrt{70}) \approx 12.9$, the assumed lower bound for $\lambda $ suffices.

Step 7. — By the previous step, the conclusions of Lemma 9 with $\kappa = 1/2$ and $\nu = \alpha \gamma ^{\star \star }$ hold on the event E, where $\alpha = \sqrt{2}/3$ was specified in Step 6(ii). The minimizer ${\hat{\beta }}_{n}^{\mathrm {lasso}}$ in (5) is thus unique and we have ${\text {supp}}({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)) \subset S^\star $.

To show the reverse inclusion, we need to verify that $\left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f)}\right| > 0$ for all $k \in S^\star $. To this end, we apply (48) with $\kappa = 1/2$ and $\nu = \alpha \gamma ^{\star \star }$, which becomes

$$\begin{aligned} \max _{k \in S^\star } \left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f) - \beta _k^\star (f)}\right| \leqslant (5/4) \sqrt{\ell ^{\star }} \lambda / (\alpha \gamma ^{\star \star }). \end{aligned}$$

For any $k\in S^\star $, we thus have

$$\begin{aligned} \left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f)}\right| \geqslant \min _{j \in S^{\star }} \left|{\textstyle \beta _j^\star (f)}\right| - (5/(4\alpha )) \sqrt{\ell ^{\star }} \lambda / \gamma ^{\star \star }. \end{aligned}$$

But for $\alpha = \sqrt{2}/3$, we have approximately $5/(4\alpha ) \approx 2.65$. Since $\min _{j \in S^\star } \left|{\textstyle \beta _j^\star (f)}\right| > 3 \sqrt{\ell ^{\star }} \lambda / \gamma ^{\star \star }$ by the assumed upper bound for $\lambda $, we find $\left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f)}\right| > 0$, as required. $\square $

Proof of Theorem 4

Recall that the LSLASSO estimator is defined as an OLS estimate computed on the active variables selected by the LASSO based on a subsample of size $N \in \{1,\ldots ,n\}$. Let ${\hat{\beta }}_N^{\mathrm {lasso}}(f)$ denote the LASSO coefficient vector in (5) based on the subsample $X_1, \ldots , X_N$ and let

$$\begin{aligned} {\hat{S}}_N = {\text {supp}}({\hat{\beta }}_N^{\mathrm {lasso}}(f)) = \{ k \in \{1, \ldots , m\} \,:\, {\hat{\beta }}_{N,k}^{\mathrm {lasso}}(f)>0 \} \end{aligned}$$

denote the estimated active set of ${\hat{\ell }} = |{\hat{S}}_N|$ control variates. The LSLASSO estimate ${\hat{\alpha }}_n^{\mathrm {lslasso}}(f)$ based on the full sample $X_1, \ldots , X_n$ is defined as the OLS estimator based on the control variates $h_k$ for $k \in {\hat{S}}_N$: writing $H_{{\hat{S}}_N}$ for the $n \times {\hat{\ell }}$ matrix with columns $(h_k(X_i))_{i=1}^n$ with $k \in {\hat{S}}_N$, we have

$$\begin{aligned} \bigl ( {{\hat{\alpha }}}_n^{\mathrm {lslasso}}(f), \hat{\beta }_n^{\mathrm {lslasso}}(f) \bigr ) \in \mathop {{\text {arg min}}}\limits _{(\alpha , \beta ) \in {\mathbb {R}}\times {\mathbb {R}}^{{{\hat{\ell }}}}} \left\Vert {\textstyle f^{(n)}-\alpha \mathbb {1}_n-H_{{\hat{S}}_N} \beta }\right\Vert _{2}^{2}, \end{aligned}$$

Therefore, we can derive a concentration inequality by combining the support recovery property (Theorem 3) along with the concentration inequality for the OLS estimate (Theorem 1) using only the active control variates.

Let $\delta >0$ and $n \geqslant 1$. We construct an event with probability at least $1-\delta $ on which the support recovery property and the concentration inequality for the OLS estimate hold simultaneously. Recall that $S^\star = {\text {supp}}(\beta ^\star (f))$ is the true set of $\ell ^\star = \left|{\textstyle S^\star }\right|$ active control variables.

Thanks to Theorem 3, with probability at least $1-\delta /2$,
$$\begin{aligned} {{\hat{S}}}_N = S^\star . \end{aligned}$$
(58)
Indeed, the conditions on N and $\lambda $ in Theorem 4 are such that we can apply Theorem 3 with n and $\delta $ replaced by N and $\delta /2$, respectively.
Thanks to Theorem 1, with probability at least $1-\delta /2$,
$$\begin{aligned} \left|{\textstyle {\hat{\alpha }}_{n}^{\mathrm {ols}}(f,h_{S^\star }) - P(f)}\right|\leqslant & {} \sqrt{2 \log (16/\delta )} \frac{\tau }{\sqrt{n}} \nonumber \\&+ 58 \sqrt{B^\star \ell ^\star \log (16 \ell ^\star /\delta ) \log (8/\delta )}\frac{\tau }{n}. \end{aligned}$$
(59)
where for any $S\subset \{1,\ldots , m\}$, $ {\hat{\alpha }}_{n}^{\mathrm {ols}}(f,h_{S})$ is the OLS estimate of P(f) based on the control variates $h_{S}$. Indeed, we apply Theorem 1 with h and $\delta $ replaced by $h_{S^\star }$ and $\delta /2$, respectively. The required lower bound on n is now $n \geqslant \max \left( 18 B^\star \log (8 \ell ^\star /\delta ),75 \ell ^\star \log (8/\delta ) \right) $. By assumption we have $N \geqslant 75 [\ell ^\star (U_h^2/\gamma ^{\star \star })]^2 \log (20 \ell ^\star /\delta )$. The required lower bound is already satisfied for N, and thus certainly by n.

By the union bound, the event on which (58) and (59) are satisfied simultaneously has probability at least $1-\delta $. On this event, we can, by definition of ${\hat{\alpha }}_n^\mathrm {lslasso}(f)$ and by (58), write the integration error as

$$\begin{aligned} \left|{\textstyle {{\hat{\alpha }}}_n^\mathrm {lslasso}(f)- P(f)}\right|&= \left|{\textstyle {\hat{\alpha }}_{n}^{\mathrm {ols}}(f,h_{{\hat{S}}_N}) - P(f)}\right| \\&= \left|{\textstyle {\hat{\alpha }}_{n}^{\mathrm {ols}}(f,h_{S^\star }) - P(f)}\right|. \end{aligned}$$

But the right-hand side is bounded by (59), yielding (13), as required.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leluc, R., Portier, F. & Segers, J. Control variate selection for Monte Carlo integration. Stat Comput 31, 50 (2021). https://doi.org/10.1007/s11222-021-10011-z

Download citation

Received: 04 July 2020
Accepted: 27 March 2021
Published: 25 June 2021
DOI: https://doi.org/10.1007/s11222-021-10011-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Control variate selection for Monte Carlo integration

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Principles of confounder selection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Auxiliary results

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Corollary 1

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Proof of Theorem 1

Proof of Theorem 2

Lemma 7

Proof

Proof of Theorem 3

Lemma 8

Proof

Lemma 9

Proof

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation