Abstract
Monte Carlo integration with variance reduction by means of control variates can be implemented by the ordinary least squares estimator for the intercept in a multiple linear regression model with the integrand as response and the control variates as covariates. Even without special knowledge on the integrand, significant efficiency gains can be obtained if the control variate space is sufficiently large. Incorporating a large number of control variates in the ordinary least squares procedure may however result in (i) a certain instability of the ordinary least squares estimator and (ii) a possibly prohibitive computation time. Regularizing the ordinary least squares estimator by preselecting appropriate control variates via the Lasso turns out to increase the accuracy without additional computational cost. The findings in the numerical experiment are confirmed by concentration inequalities for the integration error.
Similar content being viewed by others
Notes
The convex parabola \(x \mapsto a x^2 - b x - c\) has zeroes at \(x_{\pm } = (b \pm \sqrt{b^2 + 4ac}) / (2a)\), and \(x_-< 0< x_+ < \sqrt{b^2 + 4ac}/a\).
References
Asuncion, A., Newman, D.: UCI machine learning repository. (2007) https://archive.ics.uci.edu/ml/index.php
Avramidis, A.N., Wilson, J.R.: A splitting scheme for control variates. Op. Res. Lett. 14(4), 187–198 (1993)
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)
Belomestny, D., Iosipoi, L., Moulines, E., Naumov, A., Samsonov, S.: Variance reduction for Markov chains with application to MCMC. Stat. Comput. 30, 973–997 (2020)
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)
Boucheron, S., Lugosi, G., Massart, P.: Conc. Inequal. Oxford University Press, Oxford (2013)
Brooks, S.P., Catchpole, E.A., Morgan, B.J.T.: Bayesian animal survival estimation. Stat. Sci. 15(4), 357–376 (2000)
Brosse, N., Durmus, A., Meyn, S., Moulines, É., Radhakrishnan, A.: Diffusion approximations and control variates for MCMC. (2018) arXiv preprint arXiv:1808.01665
Caflisch, R.E.: Monte Carlo and quasi-Monte Carlo methods. Acta Numer. 7, 1–49 (1998)
Davis, R.A., do Rêgo Sousa, T., Klüppelberg, C.: Indirect inference for time series using the empirical characteristic function and control variates. J. Time Ser. Anal. (2019)
Glasserman, P.: Monte Carlo Methods in Financial Engineering, vol. 53. Springer, Berlin (2013)
Glynn, P.W., Szechtman, R.: Some new perspectives on the method of control variates. In: Monte Carlo and Quasi-Monte Carlo Methods 2000, Springer, pp 27–49 (2002)
Gobet, E., Labart, C.: Solving BSDE with adaptive control variate. SIAM J. Numer. Anal. 48(1), 257–277 (2010)
Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1(1), 75–89 (1988)
Gower, R., Le Roux, N., Bach, F.: Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 707–715 (2018)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. In: Conference on learning theory, JMLR Workshop and Conference Proceedings, pp 9–1 (2012)
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. Found. Comput. Math. 14(3), 569–600 (2014)
Javanmard, A., Montanari, A.: Debiasing the lasso: optimal sample size for Gaussian designs. Ann. Stat. 46(6A), 2593–2622 (2018)
Jie, T., Abbeel, P.: On a connection between importance sampling and the likelihood ratio policy gradient. In: Advances in Neural Information Processing Systems, pp 1000–1008 (2010)
Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.: A short note on concentration inequalities for random vectors with subGaussian norm. arXiv preprint arXiv:1902.03736 (2019)
Lebreton, J.D., Burnham, K.P., Clobert, J., Anderson, D.R.: Modeling survival and testing biological hypotheses using marked animals: a unified approach with case studies. Ecol Monogr 62(1), 67–118 (1992)
Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., Liu, Q.: Action-dependent control variates for policy optimization via stein identity. In: ICLR 2018 Conference (2018)
Marzolin, G.: Polygynie du Cincle plongeur (Cinclus cinclus) dans les côtes de Lorraine. Oiseau et la Revue Francaise d’Ornithologie 58(4), 277–286 (1988)
Newey, W.K.: Convergence rates and asymptotic normality for series estimators. J. Econom. 79(1), 147–168 (1997)
Nott, D.J., Drovandi, C.C., Mengersen, K., Evans, M., et al.: Approximation of Bayesian predictive \(p\)-values with regression ABC. Bayesian Anal. 13(1), 59–83 (2018)
Oates, C.J., Girolami, M., Chopin, N.: Control functionals for Monte Carlo integration. J. R. Stat. Soc. Ser. B 79(3), 695–718 (2017). (Statistical Methodology)(Statistical Methodology)(Statistical Methodology)(Statistical Methodology)
Owen, A., Zhou, Y.: Safe and effective importance sampling. J. Am. Stat. As. 95(449), 135–143 (2000)
Owen, A.B.: Monte Carlo Theory, Methods and Examples (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Portier, F., Delyon, B.: Asymptotic optimality of adaptive importance sampling. Adv. Neural Inf. Process. Syst. 31, 3134–3144 (2018)
Portier, F., Segers, J.: Monte Carlo integration with a growing number of control variates. J. Appl. Probab. 56, 1168–1186 (2019)
Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. In: Artificial Intelligence and Statistics, pp 814–822 (2014)
Rudin, W.: Real and Complex Analysis. Tata McGraw-Hill Education, New York (2006)
South, L.F., Oates, C.J., Mira, A., Drovandi, C.: Regularised zero-variance control variates for high-dimensional variance reduction. arXiv preprint arXiv:1811.05073 (2018)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser.B 58(1), 267–288 (1996). Methodological
Tibshirani, R., Wainwright, M., Hastie, T.: Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, Boca Raton (2015)
Tropp, J.A.: An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(1-2):1–230, (2015) arXiv:1501.01571
van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electr. J. Stat. 3, 1360–1392 (2009)
Wang, C., Chen, X., Smola, A.J., Xing, E.P.: Variance reduction for stochastic gradient optimization. In: Advances in Neural Information Processing Systems, pp 181–189 (2013)
Zhang, A., Brown, L.D., Cai, T.T.: Semi-supervised inference: general theory and estimation of means. Ann. Stat. 47(5), 2538–2566 (2019)
Acknowledgements
The authors are grateful to the Associate Editor and two anonymous Reviewers for their many valuable comments and interesting suggestions. Further, the authors gratefully acknowledge the exceptionally careful proofreading done by Aigerim Zhuman (UCLouvain).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Auxiliary results
Lemma 1
(Sub-Gaussian) Let \(X_1,\ldots ,X_n \) be independent and identically distributed random variables in \(({\mathscr {X}}, {\mathscr {A}})\) with distribution P. Let \(\varphi _1,\ldots ,\varphi _p \) be real-valued functions on \({\mathscr {X}}\) such that \(P(\varphi _k) = 0\) and \(\varphi _k \in {\mathscr {G}}(\tau ^2)\) for all \(k=1,\ldots ,p\). Then for all \(\delta > 0\), we have with probability at least \(1-\delta \),
Proof
For each \(k=1,\ldots ,p\), the centered random variable \(\sum _{i=1}^n \varphi _k(X_i)\) is sub-Gaussian with variance factor \(n \tau ^2\). By the union bound and by Chernoff’s inequality, we have, for each \(t > 0\),
Set \(t = \sqrt{2n\tau ^2\log (2p/\delta )}\) to find the result. \(\square \)
Lemma 2
(Smallest eigenvalue lower bound) Let \(X_1,\ldots ,X_n\) be independent and identically distributed random variables in \(({\mathscr {X}}, {\mathscr {A}})\) with distribution P. Let \(g = (g_1, \ldots , g_p)^T\) in \(L_2(P)^p\) be such that the \(p \times p\) Gram matrix \(G = P(g g^T)\) satisfies \(\lambda _{\min }(G) > 0\). Define the transformation \({{\tilde{g}}} = G^{-1/2} g\) and put \(B_{{{\tilde{g}}}} := \sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle {{\tilde{g}}}(x)}\right\Vert _2^2\). Let \(\delta , \eta \in (0, 1)\). For \(\delta \in (0, 1)\), the empirical Gram matrix \({\hat{G}}_n = P_n(g g^T)\) satisfies, with probability at least \(1-\delta \),
Proof
Suppose that the result is true in the special case that G is the identity matrix. In case of a general Gram matrix G, we could then apply the result for the special case to the vector of functions \({{\tilde{g}}} = G^{-1/2} g\), whose Gram matrix is the identity matrix. We would get that \(\lambda _{\min }( P_n({{\tilde{g}}} {{\tilde{g}}}^T)) > 1-\eta \) with probability at least \(1-\delta \). Since \(P_n({{\tilde{g}}} {{\tilde{g}}}^T) = G^{-1/2} {\hat{G}}_n G^{-1/2}\) and since \(u^T G^{-1} u \leqslant 1 / \lambda _{\min }(G)\) for every unit vector \(u \in {\mathbb {R}}^p\), we would have
It would then follow that
as required. Hence we only need to show the result for \(G = I\), in which case \({{\tilde{g}}} = g\).
We apply the matrix Chernoff inequality in (Tropp 2015, Theorem 5.1.1) to the random matrices \(n^{-1} g(X_i)g(X_i)^T\). These matrices are independent and symmetric with dimension \(p \times p\). Their minimum and maximum eigenvalues are between 0 and \(L = B_g/n\), with \(B_g = \sup _{x \in {\mathscr {X}}} \lambda _{\max } (g(x) g(x)^T) = \sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle g(x)}\right\Vert ^2_2\). Their sum is equal to \(P_n(g g^T) = {\hat{G}}_n\), whose expectation is \(G = I\) by assumption. In the notation of the cited theorem, we have \(\mu _{\min } = \lambda _{\min }(G) = 1\), and thus, by Eq. (5.1.5) in that theorem, we have, for \(\eta \in [0, 1)\),
The term in square brackets is bounded above by \(\exp (-\eta ^2/2)\). Indeed, we have, for \(\eta \in [0, 1)\),
and
It follows that
Solving \(p \exp \left( - \frac{\eta ^2 n}{2 B_g} \right) = \delta \) in \(\eta \), we find that, with probability at least \(1-\delta \),
\(\square \)
Lemma 3
(Upper bound of moments) Let X be a random variable such that \({{\mathbb {E}}}(|X|^{2p}) \leqslant 2^{p+1} p!\) for every integer \(p \geqslant 1\). Then
in which it is implicitly understood that the series on the left-hand side converges.
Proof
Let \(\lambda \in {\mathbb {R}}\). We split the series in terms with even and odd indices k, leading to
We will bound the series on the odd indices in terms of the series on the even indices.
Since the geometric mean of two nonnegative numbers is bounded by their arithmetic mean, we have, for all \(x \geqslant 0\) and all \(a > 0\),
Applying the previous inequality to \(x = \lambda X\) and scalars \(a_p > 0\) to be chosen later,
Here, \(\mathbb {1}\) denotes an indicator function. We obtain
Define \(b_p = a_p (2p+1)\) and use the hypothesis on \({{\mathbb {E}}}(|X|^{2p})\) to see that, for any constants \(b_p > 0\),
The objective is to find a constant \(c > 0\) as small as possible and such that the right-hand side is bounded by \(\exp (c \lambda ^2) = 1 + \sum _{p=1}^\infty c^p \lambda ^{2p} / p!\). Comparing coefficients, this means that we need to determine scalars \(b_p > 0\) and \(c > 0\) in such a way that for all \(p = 1, 2,\ldots \)
or, equivalently,
The case \(p = 1\) gives
showing that, with this proof technique, we will always find \(c > 2\). Setting \(b_p \equiv b > 0\) for all integer \(p \geqslant 1\) and \(c = 2 + 1/b\), inequality (18) is automatically satisfied, so it remains to find \(b > 0\) such that forall \(p = 2, 3, \ldots \)
The left-hand side is decreasing in p whereas the right-hand side is increasing in p. It is thus sufficient to have the inequality satisfied for \(p = 2\), i.e.,
Equating both sides leads to a nonlinear equation in b that can be solved numerically, giving the root \(b \approx 4.006156\). With \(b = 4\), inequality (19) is satisfied, as can be checked directly (\(91/72 \leqslant 81/64\)). We conclude that \(c = 2 + 1/4 = 9/4\) is a valid choice. \(\square \)
Note that the series in (17) starts at \(k = 2\). If also \({{\mathbb {E}}}(X) = 0\), the left-hand side in (17) is an upper bound for \({{\mathbb {E}}}(\exp (\lambda X))\), and we obtain the following corollary.
Corollary 1
Let Z be a centered random variable such that
Then \(\log {{\mathbb {E}}}(\exp (\lambda Z)) \leqslant 9 \lambda ^2/4\) for all \(\lambda \in {\mathbb {R}}\), i.e., \(Z \in {\mathscr {G}}(9/2)\).
Lemma 4
Let (X, Y) be a pair of uncorrelated random variables. If \(X \in {\mathscr {G}}(\nu )\) and \(|Y| \leqslant \kappa \) for some \(\nu > 0\) and \(\kappa > 0\), then \(XY \in {\mathscr {G}}((9/2)\kappa ^2 \nu )\).
Proof
The random variable \(X/\sqrt{\nu }\) is sub-Gaussian with variance factor 1. As on page 25 in Boucheron et al. (2013), this implies that \({{\mathbb {P}}}(|X/\sqrt{\nu }| > t) \leqslant 2 \exp (-t^2/2)\) for all \(t \geqslant 0\) and thus \({{\mathbb {E}}}[|X/\sqrt{\nu }|^{2p}] \leqslant 2^{p+1} p!\) for all integer \(p \geqslant 1\) (see (Boucheron et al. 2013, Theorem 2.1)).
Let \(Z = XY / (\sqrt{\nu } \kappa )\). Since X is centered and X and Y are uncorrelated, XY is centred too, and therefore also Z. From the previous paragraph, we have \({{\mathbb {E}}}(|Z|^{2p}) \leqslant {{\mathbb {E}}}(|X/\sqrt{\nu }|^{2p}) \leqslant 2^{p+1} p!\) for all integer \(p \geqslant 1\). Corollary 1 gives for all \(\lambda \in {\mathbb {R}}\) that \(\log {{\mathbb {E}}}(\exp (\lambda Z)) \leqslant 9\lambda ^2/4\), from which
\(\square \)
Lemma 5
(Upper bound for norm-subGaussian random vector) Let X be a d-dimensional random vector with zero-mean and such that \({{\mathbb {P}}}(\left\Vert {\textstyle X}\right\Vert _2 \geqslant t) \leqslant 2 \exp \left( -t^2/(2\sigma ^2) \right) \) for all \(t \geqslant 0\). Then the random matrix Y defined by
satisfies \({{\mathbb {E}}}(\exp (\theta Y)) \preceq \exp (c\theta ^2\sigma ^2) I\) for any \(\theta \in {\mathbb {R}}\), with \(c = 9/4\), where I denotes the identity matrix.
Proof
The non-zero eigenvalues of Y are \(\left\Vert {\textstyle X}\right\Vert \) and \(-\left\Vert {\textstyle X}\right\Vert \). The non-zero eigenvalues of \(Y^k\) are thus \(\left\Vert {\textstyle X}\right\Vert ^k\) and \((-\left\Vert {\textstyle X}\right\Vert )^k\) for integer \(k \geqslant 1\). It follows that \(Y^k \preceq \left\Vert {\textstyle X}\right\Vert ^k I\) for all integer \(k \geqslant 1\), and therefore also \({{\mathbb {E}}}(Y^k) \preceq {{\mathbb {E}}}(\left\Vert {\textstyle X}\right\Vert ^k) I\) for all integer \(k \geqslant 1\). Furthermore, the operator norm of \(Y^k\) is bounded by \(\left\Vert {\textstyle Y^k}\right\Vert \leqslant \left\Vert {\textstyle X}\right\Vert ^k\).
Since \({{\mathbb {E}}}(Y) = 0\), we get, for any \(\theta \in {\mathbb {R}}\),
where \(\xi = \left\Vert {\textstyle X}\right\Vert /\sigma \). The first series converges in operator norm since \(\left\Vert {\textstyle {{\mathbb {E}}}(Y^k) }\right\Vert \leqslant {{\mathbb {E}}}(\left\Vert {\textstyle Y^k}\right\Vert ) \leqslant {{\mathbb {E}}}(\left\Vert {\textstyle X}\right\Vert ^k)\).
By assumption, \({{\mathbb {P}}}(\xi > t) = {{\mathbb {P}}}(\left\Vert {\textstyle X}\right\Vert \geqslant \sigma t) \leqslant 2e^{-t^2/2}\) for all \(t \geqslant 0\) and thus \({{\mathbb {E}}}(|\xi |^{2p}) \leqslant 2^{p+1} p!\) for all integer \(p \geqslant 1\) But then we can apply Lemma 3 with \(\lambda = \theta \sigma \) and \(X = \xi \), completing the proof. \(\square \)
The following result is a special case of Jin et al. (2019, Corollary 7). Our contribution is to make the constant c in the cited result explicit. In passing, we correct an inaccuracy in the proof of Jin et al. (2019, Lemma 4), in which it was incorrectly claimed that the odd moments of a certain random matrix Y as in our Lemma 5 are all zero.
Lemma 6
(Hoeffding inequality for norm-subGaussian random vectors) Let the d-dimensional random vectors \(Z_1,\ldots ,Z_n\) be independent, have mean zero, and satisfy
for some \(\sigma > 0\). Then for any \(\delta > 0\), with probability at least \(1-\delta \), we have
Proof
Given Corollary 7 in Jin et al. (2019), the only thing to prove is that their constant can be set equal to 3. Their Corollary 7 follows from their Lemma 6 in which it is shown that when the matrix Y defined in (20) satisfies
then we have for any \(\theta >0\), with probability at least \((1-\delta )\),
Taking \(\theta = \sqrt{\log (2d/\delta )/(cn\sigma ^2)}\) yields
and we conclude with Lemma 5 (\(c=9/4,2\sqrt{c}=3\)). \(\square \)
Proof of Theorem 1
The proof is organized as follows. We first provide an upper bound on the error (Step 1). This bound involves the norm of the error made on the rescaled coefficients and is controlled in Step 2. Then (Step 3), we construct an event that has probability at least \(1-\delta \) on which we can control the terms that appear in the upper bound of Step 2. Collecting all the inequalities, we will arrive at the stated bound (Step 4).
Step 1. — Since \(f = P(f) + \beta ^\star (f)^T h + \epsilon \), the oracle estimate of P(f), which uses the unknown, optimal coefficient vector \(\beta ^\star (f)\), is
The difference between the OLS and oracle estimates is
Let \(G = P(h h^\top )\) be the \(m \times m\) Gram matrix. By assumption, G is positive definite. Write
The estimation error of the OLS estimator can thus be decomposed as
By the triangle and Cauchy–Schwarz inequalities,
Step 2. — We will show that, if \(\lambda _{\min }(P_n(\hbar \hbar ^\top )) > \left\Vert {\textstyle P_n(\hbar )}\right\Vert _2^2\), then
and thus, by (22),
Step 2.1 — Considered the column-centered \(n \times m\) design matrices
Since \({\bar{H}}^\top \mathbb {1}_n = n P_n(\hbar )\), we have
As a consequence, for \(u \in {\mathbb {R}}^m\),
by the Cauchy–Schwarz inequality. In particular, \(u^\top {\bar{H}}_c^\top {\bar{H}}_c u\) is non-zero for non-zero \(u \in {\mathbb {R}}^m\), so that \({\bar{H}}_c^\top {\bar{H}}_c\) is invertible, and so is the matrix
Also, the smallest eigenvalue of \({\bar{H}}_c^\top {\bar{H}}_c\) is bounded from below by
The largest eigenvalue of the inverse matrix \(({\bar{H}}_c^\top {\bar{H}}_c)^{-1}\) is then bounded from above by
Step 2.2. — Write \(\epsilon _c^{(n)} = (\epsilon (X_i) - P_n(\epsilon ))_{i=1}^n\) for the centered vector of error terms. Recall \(f_c^{(n)} = (f(X_i) - P_n(f))_{i=1}^n\), the centered vector of samples from the integrand. As \(f = P(f) + h^\top \beta ^*(f) + \epsilon \), we have
From the characterization (4) of the OLS estimate of the coefficient vector and since \(H_c^\top H_c\) is invertible,
We obtain
Step 2.3. — We combine the results from Steps 2.1 and 2.2. From the upper bound (25) and the identity (26), we obtain
Finally, as \({\bar{H}}_c = (\hbar _j(X_i) - P_n(\hbar _j))_{i,j}\), we find
Equation (23) follows.
Step 3. — In view of (24), we need to ensure that \(\left|{\textstyle P_n(\epsilon )}\right|\), \(\left\Vert {\textstyle P_n(\hbar )}\right\Vert \) and \(\left\Vert {\textstyle P_n(\hbar \epsilon )}\right\Vert _2\) are small and that \(\lambda _{\min }(P_n(\hbar \hbar ^\top ))\) is large. Let \(\delta >0\). We construct an event with probability at least \(1-\delta \) on which four inequalities hold simultaneously. Recall \(B = \sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle \hbar (x)}\right\Vert _2^2\), defined in (7).
Step 3.1. — Because \(\epsilon \in {\mathscr {G}}(\tau ^2)\), Chernoff’s inequality (or Lemma 1 with \(p = 1\)) implies that with probability at least \(1 - \delta /4\),
Step 3.2. — For the term \(\left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i)}\right\Vert _2\), we apply the vector Bernstein bound in (Hsu et al. 2014, Lemma 9). On the one hand \(\sup _{x \in {\mathscr {X}}} \left\Vert {\textstyle \hbar (x)}\right\Vert _2 \leqslant \sqrt{B}\) and on the other hand
The cited vector Bernstein bound gives
Setting \(t = \log (4/\delta )\), we find that, with probability at least \(1-\delta /4\), we have
Since \(\log (4/\delta ) \geqslant \log (4)\), we have
and thus
The condition on n easily implies that
and thus
Step 3.3. — To control \(\left\Vert {\textstyle \sum _{i=1}^n \hbar (X_i) \epsilon (X_i)}\right\Vert _2\), we apply Lemma 6 with \(Z_i = \hbar (X_i) \epsilon (X_i)\). The random vectors \(\hbar (X_i) \epsilon (X_i)\) for \(i = 1, \ldots , n\) are independent and identically distributed and have mean zero. Since \(\left\Vert {\textstyle \hbar (X_i)}\right\Vert _2 \leqslant \sqrt{B}\) by (7) and since \(\epsilon \in {\mathscr {G}}(\tau ^2)\) by Assumption 1, we have, for all \(t > 0\),
and (21) holds with \(\sigma ^2 = B \tau ^2\). Lemma 6 then implies that, with probability at least \(1-\delta /4\) and \(c=3\) that
Step 3.4. — Recall the \(n \times m\) matrix \(H = (h_j(X_i))_{i,j}\) and put
The empirical Gram matrix of the vector \(\hbar = (\hbar _1, \ldots , \hbar _m)^\top \in L_2(P)^m\) based on the sample \(X_1, \ldots , X_n\) is
We apply Lemma 2 with \(g = {\tilde{g}} = \hbar \), \(p = m\), and \(\delta \) replaced by \(\delta /4\). We find that, with probability at least \(1-\delta /4\),
Since \(P_n(\hbar \hbar ^\top ) = n^{-1} {\bar{H}}^\top {\bar{H}}\), it follows that
as the assumption on n implies that \(2 B n^{-1} \log (4m/\delta ) \leqslant 1/9\).
By the union bound, the inequalities (27), (28), (29), and (30) hold simultaneously on an event with probability at least \(1-\delta \). For the remainder of the proof, we work on this event, denoted by E.
Step 4. — We combine the bound (24) on the estimation error with the bounds valid on the event E constructed in Step 3. By (31), we have
since the assumption on n implies that \(25 m n^{-1} \log (4/\delta ) \leqslant 1/3\). As \(B \geqslant m \geqslant 1\), we have
since, by assumption, \(n \geqslant 75m \log (4/\delta )\) which implies that \(\sqrt{n^{-1} \log (4/\delta )} \leqslant 1/(5\sqrt{3})\). We find
and the value \(c=3\) gives \(15 (c+\sqrt{2/3}) \approx 57.2 <58\) which is the bound stated in Theorem 1. \(\square \)
Proof of Theorem 2
For a vector \(\beta \in {\mathbb {R}}^m\) and for a non-empty set \(S \subset \{1, \ldots , m\}\), write \(\beta _S = (\beta _k)_{k \in S}\). For any matrix \(A \in {\mathbb {R}}^{n\times m}\) and \(k\in \{1,\ldots , m\}\), let \(A_k \) denote its k-th column and if \(S = \{k_1,\ldots ,k_\ell \} \subset \{1,\ldots ,m\}\) with \(k_1< \ldots < k_\ell \), write \(A_S = (A_{k_1},\ldots ,A_{ k_\ell }) \in {\mathbb {R}}^{n \times \ell }\).
The proof is organized in a similar way as the one of Theorem 1. We first provide an initial upper bound on the error (Step 1). Then we construct an event that (Step 2) has probability at least \(1-\delta \) and (Steps 3, 4, 5) on which we can control each of the terms of the previous upper bound. The combination of all steps to deduce the final statement is made clear in Step 6.
Step 1. — As in the proof of Theorem 1, with \({\hat{\beta }}_{n}^{\mathrm {ols}}(f)\) replaced by \({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\), the estimation error of the LASSO estimator can be decomposed as
Writing \({\hat{u}} = {\hat{\beta }}_{n}^{\mathrm {lasso}}(f) - \beta ^\star (f)\), we get, by the triangle and Hölder inequalities,
Step 2. — Let \(\delta > 0\). We construct an event, E, with probability at least \(1-\delta \) on which four inequalities, namely (33), (34), (35) and (36), hold simultaneously.
-
Since \(\epsilon \in {\mathscr {G}}( \tau ^2)\), we can apply Lemma 1 with \(p=1\) to get that, with probability at least \(1 - \delta /4\),
$$\begin{aligned} \left| \sum _{i=1}^n \epsilon (X_i) \right| \leqslant \sqrt{2n \tau ^2\log (8/\delta )}. \end{aligned}$$(33) -
In view of (Boucheron et al. 2013, Lemma 2.2) and Assumption 2, we have \(h_k\in {\mathscr {G}}(U_h^2)\) for all \(k=1,\ldots , m\). Hence we can apply Lemma 1 with \(p=m\) to get that, with probability at least \(1-\delta /4\),
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \right| \leqslant \sqrt{2n U_h^2\log (8m/\delta )}. \end{aligned}$$(34) -
By virtue of Assumptions 1 and 2, we can apply Lemma 4 to find \(h_k \epsilon \in {\mathscr {G}}(C \tau ^2 U_h^2)\) with \(C = 9/2\). Hence we can apply Lemma 1 to get that, with probability at least \(1-\delta /4\),
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \epsilon (X_i) \right| \leqslant \sqrt{2nC\tau ^2 U_h^2\log (8m/\delta ))}. \end{aligned}$$(35) -
In view of (Boucheron et al. 2013, Lemma 2.2) and Assumptions 2 and 6, we have \(h_{k} h_{l} - P (h_k h_l) \in {\mathscr {G}}(U_h^4)\) for all \(k,l \in \{1,\ldots , m \}\). Hence we can apply Lemma 1 with \(p=m^2\) to get that, with probability at least \(1-\delta /4\),
$$\begin{aligned} \max _{\begin{array}{c} 1\leqslant k \leqslant m \\ 1\leqslant l \leqslant m \end{array}} \left| \sum _{i=1}^n \{ h_k(X_i)h_l(X_i) - P (h_k h_l)\} \right| \leqslant \sqrt{2n U_h^4\log (8m^2/\delta )}. \end{aligned}$$Denote by \(\varDelta = (P_n- P) \{h h^T\} \). Because by assumption \(2 (\ell ^\star / \gamma ^\star ) \sqrt{2 U_h^4\log (8m^2/\delta )} \leqslant \sqrt{n}\), we have that
$$\begin{aligned} (\ell ^\star / \gamma ^\star ) \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\leqslant 1 / 2. \end{aligned}$$Remark that
$$\begin{aligned} \forall u\in {\mathbb {R}}^{m}, \quad n^{-1} \left\Vert {\textstyle Hu}\right\Vert _2^2 - u^TG u&= u^T \varDelta u. \end{aligned}$$Then, following (Bickel et al. 2009, equation (3.3)), use the inequality \( |u^T \varDelta u|\leqslant \left\Vert {\textstyle u}\right\Vert _1^2 \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\), to obtain that, with probability \(1-\delta /4\), for all \(u\in {\mathscr {C}} (S^\star ; 3)\),
$$\begin{aligned} \left\Vert {\textstyle Hu}\right\Vert _2^2 / n&\geqslant u^TG u - \left\Vert {\textstyle u}\right\Vert _1^2 \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\\&\geqslant u^TG u - \left\Vert {\textstyle u}\right\Vert _2^2 \ell ^\star \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\\&\geqslant u^TG u - (u^TG u) (\ell ^\star / \gamma ^\star ) \max _ {1\leqslant k,l \leqslant m} |\varDelta _{k,l}|\\&\geqslant (u^TG u ) / 2. \end{aligned}$$It follows that with probability at least \(1-\delta /4\),
$$\begin{aligned} \left\Vert {\textstyle H u}\right\Vert _2^2 \geqslant ( n\gamma ^\star /2)\left\Vert {\textstyle u}\right\Vert _2^2 . \end{aligned}$$(36)
Step 3. — We claim that, on the event E, we have
We have
and thus,
We treat both terms on the right-hand side. On the one hand, we just have obtained a lower bound for the first term. On the other hand, in view of (34) and because \(\left\Vert {\textstyle u}\right\Vert _1^2\leqslant 16 \left\Vert {\textstyle u_{S^\star }}\right\Vert _1^2\leqslant 16 \ell ^\star \left\Vert {\textstyle u}\right\Vert _2^2\), we have
as \(n \geqslant (16 \times 8) \ell ^\star (U_h^{2}/\gamma ^\star ) \log (8m/\delta )\) by assumption. In combination with (36), we find
Step 4. — We claim that, on the event E, we have
Indeed, on the left-hand side in (38) we have in virtue of (33), (34) and (35),
Since \(\ell ^\star \geqslant 1\) and \(\ell ^\star U_h^{2} \geqslant \sum _{k\in S^\star } P(h_k^2) \geqslant \gamma ^\star \), the assumed lower bound on n implies that \(n \geqslant 128 \log (8/\delta )\). As \(C = 9/2\), the factor \(\sqrt{2C}(1 + \sqrt{2 \log (8/\delta ) / (Cn)})\) is bounded by \(3 + \sqrt{2}/8\) and we get (38).
Step 5. — Recall \({\hat{u}} = {\hat{\beta }}_{n}^{\mathrm {lasso}}(f) - \beta ^{\star }(f)\). We claim that, on the event E, we have
To prove this result, we shall rely on the following lemma.
Lemma 7
If \(n \lambda \geqslant 2 \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _{\infty }\) then, writing \({\hat{u}} = {\hat{\beta }}_{n}^{\mathrm {lasso}}(f) - \beta ^{\star }(f)\), we have \({\hat{u}} \in {\mathscr {C}}(S^\star ; 3)\) and
Proof
This is just a reformulation of the reasoning on p. 298 in (Tibshirani et al. 2015) with a slightly sharper upper bound. The vector \({\hat{\nu }}\) at the right-hand side of their Eq. (11.23) can be replaced by \({\hat{\nu }}_S\). For the sake of completeness, we provide the details.
In the proof we use the short-cuts \(\beta ^\star = \beta ^\star (f)\) and \({\hat{\beta }}_{n}^{\mathrm {lasso}}= {\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\). Recall \(\epsilon _c^{(n)} = f_c^{(n)} - H_c \beta ^\star (f)\) and define
Because \(G({{\hat{u}}} )\leqslant G(0)\), we have
From the triangle inequality
implying that
From Hölder’s inequality, we get
which leads to
Consequently, because \(\left\Vert {\textstyle H_c^T\epsilon _c^{(n)}}\right\Vert _\infty / n \leqslant \lambda /2\) by assumption, we obtain
The right-hand side must be nonnegative, whence \(\left\Vert {\textstyle {\hat{u}}_{\overline{S^\star }}}\right\Vert _1 \leqslant 3 \left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1\), i.e., \({\hat{u}} \in {\mathscr {C}}(S; 3)\). The bound in (40) follows as well. \(\square \)
On the event E, the conclusion of Lemma 7 is valid because the bound on \(\left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty \) in (38) and the assumption on \(\lambda \) in Theorem 2 together imply that \(\lambda \geqslant 2 \left\Vert {\textstyle H_c^T \epsilon _c^{(n)}}\right\Vert _\infty / n\). The cone property of Lemma 7 yields \({\hat{u}} \in {\mathscr {C}}(S^\star ; 3)\) so that
Thanks to (37) and Lemma 7, and since \(\left|{\textstyle S^\star }\right| = \ell ^\star \), we get
It follows that \(\left\Vert {\textstyle {\hat{u}}_{S^\star }}\right\Vert _1 \leqslant 12 \ell ^\star \lambda /\gamma ^\star \). In combination with (41), we find (39).
Step 6. — Equation (32) gave a bound on the estimation error involving three terms. On the event E, these terms were shown to be bounded in (33), (34), and (39). It follows that, on E, we finally have
Divide by n and use \(48\sqrt{2} < 68\) to obtain (9). \(\square \)
Proof of Theorem 3
Recall that \(S^\star = \{ j = 1, \ldots , m : \beta _j^\star (f) \ne 0\}\) with \(\ell ^\star = \left|{\textstyle S^\star }\right|\) and that \(\overline{S^\star } = \{1,\ldots ,m\} \setminus S^\star \). Further, \(H_{c,S^\star }\) is the \(n \times \ell ^\star \) matrix having columns \(H_{c,k}\) for \(k \in S^\star \), where \(H_{c,k}\) is the k-th column of \(H_c\).
Step 1. — We first establish some (non-probabilistic) properties of \({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\). To this end, we consider the linear regression of the non-active control variates on the active ones: for \(k \in \overline{S^\star } = \{ j = 1, \ldots , m : \beta _j^\star (f) = 0 \}\), this produces the coefficient vector
Further, we consider the OLS oracle estimate \({\hat{\beta }}_{n}^{\star }\), which is the OLS estimator based upon the active control variables only, i.e.,
Our assumptions will imply that, with large probability, \(H_{c,S^{\star }}\) has rank \(\ell ^*\), in which case
The following lemma provides a number of (non-probabilistic) properties of \({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\), given certain conditions on \(H_c\) and \(\epsilon _c^{(n)}\). Recall that a norm \(\left\Vert {\textstyle \,\cdot \,}\right\Vert \) on \({\mathbb {R}}^p\) induces a matrix norm on \({\mathbb {R}}^{p \times p}\) via \(\left\Vert {\textstyle A}\right\Vert = \sup \{ \left\Vert {\textstyle A u}\right\Vert : u \in {\mathbb {R}}^p, \left\Vert {\textstyle u}\right\Vert = 1 \}\) for \(A \in {\mathbb {R}}^{p \times p}\).
Lemma 8
If \(H_{c,S^{\star }}\) has rank \(\ell ^\star \) and if there exists \(\kappa \in (0, 1]\) such that
then the minimizer \({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\) in (5) is unique, with support \({\text {supp}}({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)) \subset S^\star \), and it satisfies
Proof
The proof of the previous result is actually contained in Tibshirani et al. (2015). The uniqueness of the LASSO solution and the property that it does not select inactive covariates follows directly from the proof of their Theorem 11.3. The only difference is that, in our case, the inequality (43) is an assumption whereas in Tibshirani et al. (2015) it is a property of the Gaussian fixed design model. The approach in Tibshirani et al. (2015) is based upon checking the strict dual feasibility condition. The bound (44) is Eq. (11.37) in Tibshirani et al. (2015). \(\square \)
We slightly modify Lemma 8 to make the conditions (42) and (43) easier to check and to make the bound (44) easier to use.
Lemma 9
If there exists \(\nu > 0\) such that
and if there exists \(\kappa \in (0,1]\) such that
then the minimizer \({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)\) in (5) is unique, with support satisfying \({\text {supp}}({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)) \subset S^\star \), and it holds that
Proof
By (45), the smallest eigenvalue of the \(\ell ^\star \times \ell ^\star \) matrix \(H_{c,S^\star }^T H_{c,S^\star }\) is positive, so that it is invertible and \(H_{c,S^\star }\) has rank \(\ell ^\star \).
We show that (46) implies (42). For each \(k \in \overline{S^\star }\), the vector \({\hat{\theta }}_n^{(k)}\) has length \(\ell ^\star \), so that
Because \({{\hat{\theta }}}_{n}^{(k)} \) is an OLS estimate, using that the largest eigenvalue of \((H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}\) being bounded from above by \((n \nu )^{-1}\), we obtain
Since \(\left\Vert {\textstyle x}\right\Vert _2 \leqslant \sqrt{m} \, \left\Vert {\textstyle x}\right\Vert _\infty \) for \(x \in {\mathbb {R}}^m\), we can conclude that
Combining the two bounds, we find that (46) indeed implies (42).
Next we show that (47) implies (43). For \(k \in \overline{S^\star }\), we have
Using (42) and (47) we deduce (43).
The conditions of Lemma 8 have been verified, and so its conclusion holds. We simplify the two terms in the upper bound (44). First, we use that
Second, for any matrix \(A \in {\mathbb {R}}^{p\times p}\), we have \(\left\Vert {\textstyle A}\right\Vert _\infty \leqslant \sqrt{p} \left\Vert {\textstyle A}\right\Vert _2\) (e.g., (Horn and Johnson 2012, page 365)), and this we apply to \((H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}\). In this way, the upper bound in (44) is dominated by
since the largest eigenvalue of \((H_{c,S^{\star }}^T H_{c,S^{\star }})^{-1}\) is at most \((n \nu )^{-1}\). Use (47) to further simplify the right-hand side, yielding (48). \(\square \)
Step 2. — Let \(\delta \in (0, 1)\) and \(n = 1, 2, \ldots \). In a similar way as in the proof of Theorem 1, we construct an event of probability at least \(1-\delta \). This time, we need five inequalities to hold simultaneously.
-
Because \(\epsilon \in {\mathscr {G}}( \tau ^2)\), with probability at least \(1-\delta /5\),
$$\begin{aligned} \left| \sum _{i=1}^n \epsilon (X_i) \right| \leqslant \sqrt{2n \tau ^2 \log (10/\delta )}. \end{aligned}$$(49) -
In view of (Boucheron et al. 2013, Lemma 2.2) and Assumption 2, we have \(h_k\in {\mathscr {G}}(U_h^2)\) for all \(k=1,\ldots , m\). Hence we can apply Lemma 1 with \(p=m\) to get that, with probability at least \(1-\delta /5\),
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \right| \leqslant \sqrt{2n U_h^2\log (10m/\delta )}. \end{aligned}$$(50) -
By virtue of Assumptions 1 and 2, we can apply Lemma 4 to have \(h_k \epsilon \in {\mathscr {G}}(C U_h^2 \tau ^2 )\), where \(C = 9/2\). Hence we can apply Lemma 1 to get that, with probability at least \(1-\delta /5\),
$$\begin{aligned} \max _{k=1,\ldots , m} \left| \sum _{i=1}^n h_k(X_i) \epsilon (X_i) \right| \leqslant \sqrt{2C n \tau ^2 U_h^2\log (10m/\delta ))}. \end{aligned}$$(51) -
Recall that \(B^\star = \sup _{x \in {\mathscr {X}}} h_{S^\star }^T(x) G_{S^\star }^{-1} h_{S^\star }(x)\) with
$$\begin{aligned} B^\star \leqslant \lambda _{\max }(G_{S^\star }^{-1}) \sup _{x \in {\mathscr {X}}} h_{S^\star }^T(x) h_{S^\star }(x) \leqslant \ell ^\star U_h^2 / \gamma ^{\star \star }, \end{aligned}$$The assumption on n easily implies that \(n \geqslant 8 B^{\star } \log (5\ell ^\star /\delta )\). Applying Lemma 2 with \(p=\ell ^\star \), \(g = h_{S^\star }\), and \(\delta \) replaced by \(\delta /5\), we find that, with probability at least \(1-\delta /5\),
$$\begin{aligned} \left\Vert {\textstyle H_{S^{\star }} u}\right\Vert _2^2 \geqslant n\gamma ^{\star \star } \left\Vert {\textstyle u}\right\Vert ^2_2/2, \qquad \forall u\in {\mathbb {R}}^{\ell ^*}. \end{aligned}$$(52) -
Finally, because \(\left|{\textstyle h_j(x)}\right| \leqslant U_h\) for all \(x \in {\mathscr {X}}\) and \(j \in \{1,\ldots ,m\}\) and because \(P(h_k h_j) = 0\) for all \((k, j) \in \overline{S^\star } \times S^\star \), we have \(h_kh_j \in {\mathscr {G}}(U_h^4)\) for such k and j, and thus, with probability at least \(1-\delta /5\),
$$\begin{aligned} \max _{k\in \overline{S^\star }} \max _{j\in S^{\star }} \left| \sum _{i=1}^n h_k(X_i) h_j(X_i) \right| \leqslant \sqrt{2nU_h^4 \log (10 \ell ^\star m /\delta ) }. \end{aligned}$$(53)
By the union bound, the event, say E, on which (49), (50), (51), (52) and (53) are satisfied simultaneously has probability at least \(1-\delta \). We work on the event E for the rest of the proof.
Step 3. — On the event E, we have
where \(\alpha \in (0, 1/2)\) is an absolute constant whose value will be fixed in Step 6(ii). We have
and thus, by the Cauchy–Schwarz inequality and by (52),
In view of (50), we have
We thus get
A sufficient condition for (54) is thus that the term in square brackets is at least \(\alpha \), i.e.,
Since \(\ell ^\star \geqslant 1\) and \(U_{h}^2 \geqslant \gamma ^{\star \star }\), a condition of the form
is thus sufficient, with much to spare, provided \(\rho > 2/(1/2-\alpha )\). In Step 6(ii), we will choose \(\alpha \) in such a way that the constant \(\rho = 70\) appearing in the statement of the theorem is sufficient.
Step 4. — On the event E, we have
Indeed, denote \(A = \overline{S^\star } \times S^\star \), in virtue of (50) and (53), the left-hand side is bounded by
which is (56).
Step 5. — On the event E, we have
The proof is the same as the first part of the one (38).
Step 6. — We will verify that on the event E, the three assumptions of Lemma 9 are satisfied with \(\kappa = 1/2\) and \(\nu = \alpha \gamma ^{\star \star }\), with \(\alpha \) as in Step 3.
-
(i)
Eq. (45) with \(\nu = \alpha \gamma ^{\star \star }\) is just (54).
-
(ii)
Eq. (46) with \(\nu = \alpha \gamma ^{\star \star }\) and \(\kappa = 1/2\) follows from (56) provided we have
$$\begin{aligned}&\frac{\ell ^\star }{\alpha \gamma ^{\star \star } n} \left( \sqrt{2nU_h^4 \log (10 \ell ^\star m /\delta ) } + 2 U_h^2\log (10m/\delta ) \right) \\&\quad \leqslant 1 - \frac{1}{2}. \end{aligned}$$To check whether this is satisfied, we will make use of the elementary inequalityFootnote 2
$$\begin{aligned} \forall (a,b,c) \in (0, \infty )^3, \, \forall x \geqslant \sqrt{b^2 + 4ac}/a, \quad a x^2 \geqslant bx + c. \end{aligned}$$with \(x = \sqrt{n}\) and
$$\begin{aligned} a= & {} \alpha \gamma ^{\star \star } / (2\ell ^\star ), \\ b= & {} \sqrt{2 U_h^4 \log (10 \ell ^\star m / \delta )}, \\ c= & {} 2U_h^2 \log (10m/\delta ). \end{aligned}$$Sufficient is that \(n = x^2\) is bounded from below by \((b^2 + 4ac)/a^2 = (b/a)^2 + 4c/a\), which is
$$\begin{aligned}&\frac{2 U_h^4 \log (10 \ell ^\star m / \delta )}{(\alpha \gamma ^{\star \star } / (2\ell ^\star ))^2} + 4 \frac{2U_h^2 \log (10m/\delta )}{\alpha \gamma ^{\star \star } / (2\ell ^\star )} \\&\quad =\frac{8}{\alpha ^2} \log (10 \ell ^\star m/\delta ) \left( \frac{\ell ^\star U_h^2}{ \gamma ^{\star \star }}\right) ^2\\&+\frac{16}{\alpha } \log (10 m/\delta ) \left( \frac{\ell ^\star U_h^2}{ \gamma ^{\star \star }}\right) . \end{aligned}$$But \(\ell ^\star \geqslant 1\) and \(\gamma ^{\star \star } \leqslant (1/\ell ^\star ) \sum _{j \in S^\star } P(h_j^2) \leqslant U_h^2\), so that a sufficient condition is that
$$\begin{aligned} n \geqslant \left( \frac{8}{\alpha ^2} + \frac{16}{\alpha }\right) \log (10 \ell ^\star m/\delta ) [\ell ^\star (U_h^2 / \gamma ^{\star \star })]^2. \end{aligned}$$The constant \(\rho \) in (55) must thus be such that
$$\begin{aligned} \rho \geqslant \max \left( \frac{2}{1/2 - \alpha }, \frac{8}{\alpha ^2} + \frac{16}{\alpha } \right) . \end{aligned}$$The minimum of the right-hand side as a function of \(\alpha \in (0, 1/2)\) occurs at \(\alpha = \sqrt{2}/3\) and is equal to \(2/(1/2 - \sqrt{2}/3) \approx 69.9\). Taking \(\rho = 70\) as in the assumption on n is thus sufficient.
-
(iii)
Eq. (47) with \(\kappa = 1/2\) follows from (57), since
$$\begin{aligned} \sqrt{n \tau ^2 U_h^2\log (10m/\delta )} \left( \sqrt{2C} + 2\sqrt{ \log (10/\delta ) / n} \right) \leqslant \lambda n /4 \end{aligned}$$by the assumed lower bound on \(\lambda \). Indeed, since \(\ell ^\star \geqslant 1\) and \(U_{h}^{2} \geqslant \gamma ^{\star \star }\), the assumed lower bounds on n imply that \(n \geqslant 70 \log (10/\delta )\), so that \( 2\sqrt{ \log (10/\delta ) / n}\) is bounded by \(2/\sqrt{70}\); recall \(C = 9/2\). Since \(4 \cdot (3 + 2/\sqrt{70}) \approx 12.9\), the assumed lower bound for \(\lambda \) suffices.
Step 7. — By the previous step, the conclusions of Lemma 9 with \(\kappa = 1/2\) and \(\nu = \alpha \gamma ^{\star \star }\) hold on the event E, where \(\alpha = \sqrt{2}/3\) was specified in Step 6(ii). The minimizer \({\hat{\beta }}_{n}^{\mathrm {lasso}}\) in (5) is thus unique and we have \({\text {supp}}({\hat{\beta }}_{n}^{\mathrm {lasso}}(f)) \subset S^\star \).
To show the reverse inclusion, we need to verify that \(\left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f)}\right| > 0\) for all \(k \in S^\star \). To this end, we apply (48) with \(\kappa = 1/2\) and \(\nu = \alpha \gamma ^{\star \star }\), which becomes
For any \(k\in S^\star \), we thus have
But for \(\alpha = \sqrt{2}/3\), we have approximately \(5/(4\alpha ) \approx 2.65\). Since \(\min _{j \in S^\star } \left|{\textstyle \beta _j^\star (f)}\right| > 3 \sqrt{\ell ^{\star }} \lambda / \gamma ^{\star \star }\) by the assumed upper bound for \(\lambda \), we find \(\left|{\textstyle {\hat{\beta }}_{n,k}^{\mathrm {lasso}}(f)}\right| > 0\), as required. \(\square \)
Proof of Theorem 4
Recall that the LSLASSO estimator is defined as an OLS estimate computed on the active variables selected by the LASSO based on a subsample of size \(N \in \{1,\ldots ,n\}\). Let \({\hat{\beta }}_N^{\mathrm {lasso}}(f)\) denote the LASSO coefficient vector in (5) based on the subsample \(X_1, \ldots , X_N\) and let
denote the estimated active set of \({\hat{\ell }} = |{\hat{S}}_N|\) control variates. The LSLASSO estimate \({\hat{\alpha }}_n^{\mathrm {lslasso}}(f)\) based on the full sample \(X_1, \ldots , X_n\) is defined as the OLS estimator based on the control variates \(h_k\) for \(k \in {\hat{S}}_N\): writing \(H_{{\hat{S}}_N}\) for the \(n \times {\hat{\ell }}\) matrix with columns \((h_k(X_i))_{i=1}^n\) with \(k \in {\hat{S}}_N\), we have
Therefore, we can derive a concentration inequality by combining the support recovery property (Theorem 3) along with the concentration inequality for the OLS estimate (Theorem 1) using only the active control variates.
Let \(\delta >0\) and \(n \geqslant 1\). We construct an event with probability at least \(1-\delta \) on which the support recovery property and the concentration inequality for the OLS estimate hold simultaneously. Recall that \(S^\star = {\text {supp}}(\beta ^\star (f))\) is the true set of \(\ell ^\star = \left|{\textstyle S^\star }\right|\) active control variables.
-
Thanks to Theorem 3, with probability at least \(1-\delta /2\),
$$\begin{aligned} {{\hat{S}}}_N = S^\star . \end{aligned}$$(58)Indeed, the conditions on N and \(\lambda \) in Theorem 4 are such that we can apply Theorem 3 with n and \(\delta \) replaced by N and \(\delta /2\), respectively.
-
Thanks to Theorem 1, with probability at least \(1-\delta /2\),
$$\begin{aligned} \left|{\textstyle {\hat{\alpha }}_{n}^{\mathrm {ols}}(f,h_{S^\star }) - P(f)}\right|\leqslant & {} \sqrt{2 \log (16/\delta )} \frac{\tau }{\sqrt{n}} \nonumber \\&+ 58 \sqrt{B^\star \ell ^\star \log (16 \ell ^\star /\delta ) \log (8/\delta )}\frac{\tau }{n}. \end{aligned}$$(59)where for any \(S\subset \{1,\ldots , m\}\), \( {\hat{\alpha }}_{n}^{\mathrm {ols}}(f,h_{S})\) is the OLS estimate of P(f) based on the control variates \(h_{S}\). Indeed, we apply Theorem 1 with h and \(\delta \) replaced by \(h_{S^\star }\) and \(\delta /2\), respectively. The required lower bound on n is now \(n \geqslant \max \left( 18 B^\star \log (8 \ell ^\star /\delta ),75 \ell ^\star \log (8/\delta ) \right) \). By assumption we have \(N \geqslant 75 [\ell ^\star (U_h^2/\gamma ^{\star \star })]^2 \log (20 \ell ^\star /\delta )\). The required lower bound is already satisfied for N, and thus certainly by n.
By the union bound, the event on which (58) and (59) are satisfied simultaneously has probability at least \(1-\delta \). On this event, we can, by definition of \({\hat{\alpha }}_n^\mathrm {lslasso}(f)\) and by (58), write the integration error as
But the right-hand side is bounded by (59), yielding (13), as required.
Rights and permissions
About this article
Cite this article
Leluc, R., Portier, F. & Segers, J. Control variate selection for Monte Carlo integration. Stat Comput 31, 50 (2021). https://doi.org/10.1007/s11222-021-10011-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-021-10011-z