Skip to main content
Log in

On the optimization landscape of tensor decompositions

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that “all local optima are (approximately) global optima”, and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised learning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant \(\varepsilon > 0\), among the set of points with function values \((1+\varepsilon )\)-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. However, achieving such a initialization with random guess would still require super-polynomial number of attempts. Our main technique uses Kac–Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac–Rice formula is successfully applied to counting the number of local optima of a highly-structured random polynomial with dependent coefficients.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Power method in this case is equivalent to set \(x^{t+1} = \frac{\nabla f}{\Vert \nabla f(x^t)\Vert }\), or \(x = \frac{\sum _{i=1}^n \langle {a_i,x}\rangle ^3 a_i}{\Vert \sum _{i=1}^n \langle {a_i,x}\rangle ^3 a_i\Vert }\) in our model. It is exactly equivalent to gradient ascent with a properly chosen finite learning rate.

  2. We note that by stochastic gradient descent we meant the algorithm that is analyzed in [4]. To get a global maximum in polynomial time (polynomial in \(\log (1/\varepsilon )\) to get \(\varepsilon \) precision), one also needs to slightly modify stochastic gradient descent in the following way: one can run SGD until 1/d accuracy and then switch to gradient descent. Since the problem is locally strongly convex, the local convergence is linear.

  3. We omit the long list of regularity conditions here for simplicity. See more details at  [40, Theorem 12.1.1]. The most important regularity condition requires that f is morse with probability 1. This is implied by Lemma 2 below. The proof of this lemma is the same as the proof of Lemma 3.1 in Auffinger et al. [42].

  4. For example, the gradient is defined in [44, Section 3.6, Equation (3.31)], and the Hessian is defined in [44, Section 5.5, Definition 5.5.1].  [44, Example 5.4.1] gives the Riemannian connection of the sphere \(S^{d-1}\) which can be used to compute the Hessian.

References

  1. Ge, R., Ma, T.: On the optimization landscape of tensor decompositions. In: Advances in Neural Information Processing Systems, pp. 3653–3663 (2017)

  2. Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)

  3. Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: AISTATS (2015)

  4. Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points—online stochastic gradient for tensor decomposition. In: Proceedings of The 28th Conference on Learning Theory, pp. 797–842 (2015)

  5. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23–26, 2016, pp. 1246–1257 (2016)

  6. Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)

    Article  MathSciNet  Google Scholar 

  7. Sun, J., Qu, Q., Wright, J.: When are nonconvex problems not scary? Preprint arXiv:1510.06096 (2015)

  8. Bandeira, A.S., Boumal, N., Voroninski, V.: On the low-rank approach for semidefinite programs arising in synchronization and community detection. In: Conference on Learning Theory, pp. 361–382 (2016)

  9. Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)

  10. Kawaguchi, K.: Deep Learning without Poor Local Minima. ArXiv e-prints (2016)

  11. Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)

  12. Hardt, M., Ma, T.: Identity matters in deep learning. In: ICLR (2017)

  13. Hardt, M., Ma, T., Recht, B.: Gradient descent learns linear dynamical systems. J. Mach. Learn. Res. 19(1), 1025–1068 (2018)

    MathSciNet  MATH  Google Scholar 

  14. Chang, J.T.: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137, 51–73 (1996)

    Article  MathSciNet  Google Scholar 

  15. Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16(2), 583–614 (2006)

    Article  MathSciNet  Google Scholar 

  16. Hsu, D., Kakade, S.M., Zhang, T.: A spectral algorithm for learning hidden Markov models. J. Comput. Syst. Sci. 78(5), 1460–1480 (2012)

    Article  MathSciNet  Google Scholar 

  17. Anandkumar, A., Hsu, D., Kakade, S.M.: A method of moments for mixture models and hidden Markov models. In: COLT (2012)

  18. Anandkumar, A., Foster, D.P., Hsu, D., Kakade, S.M., Liu, Y-K: A spectral algorithm for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems 25 (2012)

  19. Hsu, D., Kakade, S.M.: Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In: Fourth Innovations in Theoretical Computer Science (2013)

  20. Comon, P., Luciani, X., De Almeida, A.: Tensor decompositions, alternating least squares and other tales. J. Chemom. 23(7–8), 393–405 (2009)

    Article  Google Scholar 

  21. Kolda, T.G., Mayo, J.R.: Shifted power method for computing tensor eigenpairs. SIAM J. Matrix Anal. Appl. 32(4), 1095–1124 (2011)

    Article  MathSciNet  Google Scholar 

  22. Ge, R., Lee, J.D., Ma, T.: Learning one-hidden-layer neural networks with landscape design. In: International Conference on Learning Representations (2018)

  23. Janzamin, M., Sedghi, H., Anandkumar, A.: Beating the perils of non-convexity: guaranteed training of neural networks using tensor methods. ArXiv e-prints (2015)

  24. Cohen, N., Shashua, A.: Convolutional rectifier networks as generalized tensor decompositions. In: International Conference on Machine Learning, pp. 955–963 (2016)

  25. Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, efficient and neural algorithms for sparse coding. In: Proceedings of The 28th Conference on Learning Theory (2015)

  26. De Lathauwer, L., Castaing, J., Cardoso, J.-F.: Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Trans. Signal Process. 55(6), 2965–2973 (2007)

    Article  MathSciNet  Google Scholar 

  27. Bhaskara, A., Charikar, M., Moitra, A., Vijayaraghavan, A.: Smoothed analysis of tensor decompositions. In: Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pp. 594–603. ACM (2014)

  28. Goyal, N., Vempala, S., Xiao, Y.: Fourier pca and robust tensor decomposition. In: Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, pp. 584–593 (2014)

  29. Barak, B., Kelner, J.A., Steurer, D.: Dictionary learning and tensor decomposition via the sum-of-squares method. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14–17, 2015, pp. 143–151 (2015)

  30. Ge, R., Ma, T.: Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms. Preprint arXiv:1504.05287 (2015)

  31. Hopkins, S.B., Schramm, T., Shi, J., Steurer, D.: Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In: Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18–21, 2016, pp. 178–191 (2016)

  32. Ma, T., Shi, J., Steurer, D.: Polynomial-time tensor decompositions with sum-of-squares. In: FOCS 2016 (2016)

  33. Håstad, J.: Tensor rank is np-complete. J. Algorithms 11(4), 644–654 (1990)

    Article  MathSciNet  Google Scholar 

  34. Hillar, C.J., Lim, L.-H.: Most tensor problems are np-hard. J. ACM (JACM) 60(6), 45 (2013)

    Article  MathSciNet  Google Scholar 

  35. Ma, T., Shi, J., Steurer, D.: Polynomial-time tensor decompositions with sum-of-squares. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 438–446. IEEE (2016)

  36. Anandkumar, A., Ge, R., Janzamin, M.: Learning overcomplete latent variable models through tensor methods. In: Proceedings of the Conference on Learning Theory (COLT), Paris, France (2015)

  37. Cartwright, D., Sturmfels, B.: The number of eigenvalues of a tensor. Linear Algebra Appl. 438(2), 942–952 (2013)

    Article  MathSciNet  Google Scholar 

  38. Abo, H., Seigal, A., Sturmfels, B.: Eigenconfigurations of tensors. Algebraic Geom. Methods Discrete Math. 685, 1–25 (2017)

    Article  MathSciNet  Google Scholar 

  39. Anandkumar, A., Ge, R.: Analyzing tensor power method dynamics in overcomplete regime. JMLR (2016)

  40. Adler, R.J., Taylor, J.E.: Random Fields and Geometry. Springer, Berlin (2009)

    MATH  Google Scholar 

  41. Auffinger, A., Arous, G.B., Černỳ, J.: Random matrices and complexity of spin glasses. Commun. Pure Appl. Math. 66(2), 165–201 (2013)

    Article  MathSciNet  Google Scholar 

  42. Auffinger, A., Arous, G.B., et al.: Complexity of random smooth functions on the high-dimensional sphere. Ann. Probab. 41(6), 4214–4247 (2013)

    Article  MathSciNet  Google Scholar 

  43. Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numerical Anal. 39(1), 1–33 (2019)

    Article  MathSciNet  Google Scholar 

  44. Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2007)

    MATH  Google Scholar 

  45. Wikipedia: Schauder fixed point theorem—wikipedia, the free encyclopedia, Accessed 26 May (2016)

  46. Wainwright, Martin: Basic tail and concentration bounds (2015)

  47. Hsu, D., Kakade, S.M., Zhang, T.: A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 17(52), 1–6 (2012)

    MathSciNet  MATH  Google Scholar 

  48. Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 28(5), 1302–1338 (2000)

    Article  MathSciNet  Google Scholar 

  49. Wikipedia: Incomplete gamma function—wikipedia, the free encyclopedia, Accessed 13 September(2016)

  50. Weisstein, E.W.: Incomplete gamma function—from mathworld–a wolfram web resource (2016)

  51. Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)

    Article  MathSciNet  Google Scholar 

  52. Candes, E.J.: The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique 346(9), 589–592 (2008)

    Article  MathSciNet  Google Scholar 

  53. Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin (2013)

    MATH  Google Scholar 

  54. Rudelson, M., Vershynin, R.: Non-asymptotic theory of random matrices: extreme singular values. In: Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures, pp. 1576–1602. World Scientific (2010)

  55. Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry property for random matrices. Constr. Approx. 28(3), 253–263 (2008)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This paper was done in part while the authors were hosted by Simons Institute. We are indebted to Ramon van Handel for very useful comments on random matrix theory. We thank Nicolas Boumal and Daniel Stern for helpful discussions. This research was supported by NSF, Office of Naval Research, and the Simons Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rong Ge.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A shorter version of this paper has appeared in NIPS 2017 [1].

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 329 KB)

Appendices

A Brief introduction to manifold optimization

Let \({\mathcal {M}}\) be a Riemannian manifold. For every \(x\in {\mathcal {M}}\), let \(T_x {\mathcal {M}}\) be the tangent space to \({\mathcal {M}}\) at x, and \(P_x\) denotes the projection to the tangent space \(T_x {\mathcal {M}}\). Let \(\mathrm{grad}\ f(x)\in T_x {\mathcal {M}}\) be the gradient of f at x on \({\mathcal {M}}\) and \(\mathrm{Hess}\ f(x)\) be the Riemannian Hessian. Note that \(\mathrm{Hess}\ f(x)\) is a linear mapping from \(T_x {\mathcal {M}}\) onto itself.

In this paper, we work with \({\mathcal {M}}=S^{d-1}\), and we view it as a submanifold of \(\mathbb {R}^d\). We also view f as the restriction of a smooth function \(\bar{f}\) to the manifold \({\mathcal {M}}\). In this case, we have that \(T_x {\mathcal {M}}= \{z\in \mathbb {R}^d: z^{\top } x = 0\}\), and \(P_x = \mathrm{Id}- xx^{\top }\). The gradient can be computed by,

$$\begin{aligned} \mathrm{grad}\ f(x) = P_x \nabla \bar{f}(x)\,,\end{aligned}$$

where \(\nabla \) is the usual gradient in the ambient space \(\mathbb {R}^d\). Moreover, the Riemannian Hessian is defined as,

$$\begin{aligned} \forall \xi \in T_x{\mathcal {M}}, \quad \mathrm{Hess}\ f(x)[\xi ]&= P_x (\nabla \mathrm{grad}\ f(x) [\xi ]) \\&= P_x \nabla ^2 \bar{f}(x) \xi - (x^{\top }\nabla \bar{f}(x)) \xi \,.\end{aligned}$$

Here we refer the readers to the book [44] for the derivation of these formulae and the exact definition of gradient and Hessian on the manifolds.Footnote 4

B Toolbox

In this section we collect most of the probabilistic statements together. Most of them follow from basic (but sometimes tedious) calculation and calculus. We provide the full proofs for completeness.

1.1 B.1 Sub-exponential random variables

In this subsection we give the formal definition of sub-exponential random variables, which is used heavily in our analysis. We also summarizes it’s properties.

Definition 1

(c.f. [46, Definition 2.2]) A random variable X with mean \(\mu = {\mathbb {E}}\left[ X\right] \) is sub-exponential if there are non-negative parameters \((\nu ,b)\) such that

$$\begin{aligned} {\mathbb {E}}\left[ e^{\lambda (X-\mu )}\right] \le e^{\frac{\nu ^2\lambda ^2}{2}}, \quad \forall \lambda \in \mathbb {R}, |\lambda | < 1/b\,.\end{aligned}$$

The following lemma gives a sufficient condition for X being a sub-exponential random variable.

Lemma 17

Suppose random variable X satisfies that for any \(t\ge 0\),

$$\begin{aligned} \Pr \left[ |X-\mu | \ge p\sqrt{t} + qt\right] \le 2e^{-t}\,.\end{aligned}$$

Then X is a sub-exponential random variable with parameter \((\nu ,b)\) satisfying \(\nu \lesssim p\) and \(b\lesssim q\).

The following Lemma gives the reverse direction of Lemma 17 which controls the tail of a sub-exponential random variable.

Lemma 18

([46, Proposition 2.2]) Suppose X is a sub-exponential random variable with mean 0 and parameters \((\nu ,b)\). Then,

$$\begin{aligned} \Pr \left[ X\ge x\right] \le \max \{e^{-x^2/(2\nu ^2)}, e^{-x/(2b)}\} \le e^{-x^2/(2\nu ^2)}+e^{-x/(2b)}\,.\end{aligned}$$

A summation of sub-exponential random variables remain sub-exponential (with different parameters).

Lemma 19

(c.f. [46]) Suppose independent random variable \(X_1,\dots ,X_n\) are sub-exponential variables with parameter \((\nu _1,b_1),\dots , (\nu _n,b_n)\) respectively, then \(X_1+\dots +X_n\) is a sub-exponential random variable with parameter \((\nu ^*,b^*)\) where \(\nu ^* =\sqrt{\sum _{k\in [n]} \nu _k^2}\) and \(b^* = \max _k b_k\).

The following Lemma uses the Lemma above to prove certain sum of powers of Gaussian random variables are sub-exponential random variables.

Lemma 20

Let \(x\sim \mathcal {N}(0,1)\) and \(z = (x^4 -3x^2){\mathbf {1}}({|x|\le \tau ^{1/4}})\) where \(\tau > C\) for a sufficiently large constant C. Then we have that Z is a sub-exponential random variable with parameter \((\nu ,b) \) satisfying \(\nu = \sqrt{41}\) and \(b=(4+o_{\tau }(1))\sqrt{\tau }\).

Proof

We verify that Z satisfies the Bernstein condition

$$\begin{aligned} {\mathbb {E}}\left[ |z|^k\right] \le \frac{1}{2}k! \nu ^2 b^{k-2} \quad \forall k \ge 2 \end{aligned}$$
(B.1)

For \(k = 2\), we have that \({\mathbb {E}}\left[ z^2\right] \le {\mathbb {E}}\left[ x^{8}-6x^6+9x^4\right] = 41 = \frac{1}{2} \cdot 2! \cdot \nu ^2\). For any k with \((1-o_{\tau }(1))\frac{e\sqrt{\tau }}{4}\ge k > 3\), we have

$$\begin{aligned} {\mathbb {E}}\left[ |z|^k\right]&\le {\mathbb {E}}\left[ x^{4k}{\mathbf {1}}({\sqrt{3} \le |x|\le \tau ^{1/4}})\right] + \left( \frac{9}{4}\right) ^k \quad \left( \text {since}\ |z|\le \frac{9}{4}\ \text {when}\ |x|\le 3\right) \nonumber \\&= (4k-1)!! + (9/4)^k \le 2^{2k} (2k)! + (9/4)^k \nonumber \\&\le 2^{2k} ((2k+1)/e)^{2k+1}+ (9/4)^k \quad (\text {by Claim}~10)\nonumber \\&\le \frac{1}{2}(1+o_{\tau }(1)) k! (4\sqrt{\tau })^{k-2}\nonumber \\&\qquad \left( \text {since}\ (1-o_{\tau }(1))\frac{e\sqrt{\tau }}{4} \ge k\ \text {and}\ \tau \ \text {is large enough}\right) \nonumber \\&\le \frac{1}{2}k! b^{k-2} \end{aligned}$$
(B.2)

On the other hand, when \(k\ge (1-o_{\tau }(1))\frac{e\sqrt{\tau }}{4}\), we have

$$\begin{aligned} {\mathbb {E}}\left[ |z|^k\right]&\le \tau ^k + \left( \frac{9}{4}\right) ^k \quad (\text {since}\ |z|\lesssim \tau ^{1/4}\ \text {a.s})\\&\le \left( \frac{4k\sqrt{\tau }}{e}(1+o_{\tau }(1))\right) ^k \\&\le \frac{1}{2}e(k/e)^k b^{k-2} \quad (\text {since}\ k\gtrsim \sqrt{\tau }\ \text {is large enough and} \ b =4 (1+o_{\tau }(1))\sqrt{\tau })\\&\le \frac{1}{2}k!b^{k-2} \end{aligned}$$

\(\square \)

1.2 B.2 Norm of Gaussian random variables

The following Lemma shows that the norm of a Gaussian random variable has an sub-exponential upper tail and sub-Gaussian lower tail.

Lemma 21

([47, Proposition 1.1]) Let \(A\in \mathbb {R}^{n\times n}\) be a matrix and \(\Sigma = A^{\top }A\). Let \(x\sim \mathcal {N}(0,\mathrm{Id}_n)\). Then for all \(t \ge 0\),

$$\begin{aligned}&\Pr \left[ \Vert Ax\Vert ^2 - \mathrm{tr}(\Sigma ) \ge 2\sqrt{\mathrm{tr}(\Sigma ^2)t} +2\Vert \Sigma \Vert t \right] \le e^{-t}\\&\Pr \left[ \Vert Ax\Vert ^2 - \mathrm{tr}(\Sigma ) \le - 2\sqrt{\mathrm{tr}(\Sigma ^2)t} \right] \le e^{-t} \,.\end{aligned}$$

Moreover, \(\Vert Ax\Vert ^2\) is a sub-exponential random variable with parameter \((\nu ,b)\) with \(\nu \lesssim \sqrt{\mathrm{tr}(\Sigma ^2)}\) and \(b\lesssim \Vert \Sigma \Vert \).

We note that this is a slight strengthen of [47, Proposition 1.1] with the lower tail. The strengthen is simple. The key of [47, Proposition 1.1] is to work in the eigenspace of \(\Sigma \) and apply [48, Lemma 1]. The lower tail can be bounded exactly the same by using the lower tail guarantee of [48, Lemma 1]. The second statement follows simply from Lemma 17.

The following Lemma is a simple helper Lemma that says that the norm of a product of a Gaussian matrix with a vector has a sub-Gaussian lower tail.

Lemma 22

Let \(v\in \mathbb {R}^m\) be a fixed vector and \(z_1,\dots , z_m\in \mathbb {R}^{n}\) be independent Gaussian random with mean \(\mu \in \mathbb {R}^n\) and covariance \(\Sigma \). Then \(X = [z_1,\dots , z_m]v\) is a Gaussian random variable with mean \((v^{\top }{\mathbf {1}})\mu \) and covariance \(\Vert v\Vert ^2 \Sigma \).

Moreover, we have that for any \(t \ge 0\),

$$\begin{aligned} \Pr \left[ \Vert {X}\Vert ^2 \ge \mathrm{tr}(\Sigma ) -\Vert {\Sigma }\Vert -2\sqrt{\mathrm{tr}(\Sigma )t}\right] \ge 1-e^{-t} \,.\end{aligned}$$
(B.3)

Proof

The mean and variance of X can be computed straightforwardly. Towards establishing (B.3), let \(\bar{\mu } =\mu /\Vert {\mu }\Vert \), and we define \(X' = (\mathrm{Id}- \bar{\mu } \bar{\mu }^{\top })X\). Thus we have \(\Vert {X}\Vert \ge \Vert {X'}\Vert \) and that \(X'\) is Gaussian random variable with mean 0 and covariance \(\Sigma ' = (\mathrm{Id}- \bar{\mu }\bar{\mu }^{\top })\Sigma (\mathrm{Id}- \bar{\mu } \bar{\mu }^{\top })\). Then by Lemma 21, we have that

$$\begin{aligned} \Pr \left[ \Vert {X}\Vert ^2\ge \mathrm{tr}(\Sigma ') - 2\sqrt{\mathrm{tr}(\Sigma ')t}\right] \ge \Pr \left[ \Vert {X'}\Vert ^2\ge \mathrm{tr}(\Sigma ') - 2\sqrt{\mathrm{tr}(\Sigma ')t}\right] \ge 1- e^{-t} \end{aligned}$$

Finally we observe that \(\mathrm{tr}{\Sigma }\ge \mathrm{tr}(\Sigma ')\ge \mathrm{tr}(\Sigma )-\Vert {\Sigma }\Vert \) and this together with equation above completes the proof. \(\square \)

1.3 B.3 Moments of sub-exponential random variables

In this subsection, we give several Lemmas regarding the estimation of the moments of (truncated) sub-exponential random variables.

We start with a reduction Lemma that states that one can control the moments of sub-exponential random variable with the moments of the Gaussian random variable and exponential random variables.

Lemma 23

Suppose X is a sub-exponential random variable with mean 0 and parameters \((\nu ,b)\). Let \(X_1\sim {\mathcal {N}}(0,\nu ^2)\) and \(X_2\) be a exponential random variable with mean 2b. Let \(p\ge 1\) be an integer and \(s \ge 0\). Then,

$$\begin{aligned} {\mathbb {E}}\left[ X^p{\mathbf {1}}({X\ge s})\right]&\le s^p (e^{-s^2/(2\nu ^2)} +e^{-s/(2b)}) \\&\quad + p \sqrt{2\pi }\nu {\mathbb {E}}\left[ X_1^{p-1} {\mathbf {1}}({X_1\ge s})\right] + 2pb{\mathbb {E}}\left[ X_2^{p-1} {\mathbf {1}}({X_2\ge s})\right] \,.\end{aligned}$$

Proof

Let p(x) be the density of X and let \(G(x) = \Pr [X\ge x]\). By Lemma 18 we have \(G(x)\le e^{-x^2/(2\nu ^2)}+e^{-x/(2b)}\). Then we have

$$\begin{aligned} {\mathbb {E}}\left[ X^p{\mathbf {1}}({X\ge s})\right]&= \int _s^{\infty } x^p p(x) dx =-\int _s^{\infty } x^p dG(x)\\&= - x^p G(x)\big \vert _s^{\infty } + \int _s^{\infty } px^{p-1}G(x)dx \quad (\text {By integration by parts})\\&\le s^p (e^{-s^2/(2\nu ^2)}+e^{-s/(2b)}) + \int _s^{\infty } px^{p-1}(e^{-x^2/(2\nu ^2)}+e^{-x/(2b)})dx\\&= s^p (e^{-s^2/(2\nu ^2)}+e^{-s/(2b)}) + p \sqrt{2\pi }\nu {\mathbb {E}}\left[ X_1^{p-1}{\mathbf {1}}({X_1\ge s})\right] \\&\quad + 2pb{\mathbb {E}}\left[ X_2^{p-1}{\mathbf {1}}({X_2\ge s})\right] \,.\end{aligned}$$

\(\square \)

Next we bound the p-th moments of a sub-exponential random variable with zero mean.

Lemma 24

Let p be an even integer and let X be a sub-exponential random variable with mean 0 and parameters \((\nu ,b)\). Then we have that

$$\begin{aligned} {\mathbb {E}}\left[ |X|^p\right] \lesssim \nu ^p p!! + (2b)^p p! \end{aligned}$$
(B.4)

Proof

Let \(X_1\sim {\mathcal {N}}(0,\nu ^2)\) and \(X_2\) be a exponential random variable with mean 2b. By Lemma 23, we have that

$$\begin{aligned} {\mathbb {E}}\left[ X^p {\mathbf {1}}({X\ge 0}) \right]&\le p \sqrt{2\pi }\nu {\mathbb {E}}\left[ X_1^{p-1}{\mathbf {1}}({X_1\ge 0})\right] + 2pb{\mathbb {E}}\left[ X_2^{p-1}\right] \,.\\&\lesssim \nu ^p p!! + (2b)^p p! \,.\end{aligned}$$

For the negative part we can obtain the similar bound, \({\mathbb {E}}\left[ (-X)^p{\mathbf {1}}({X\le 0})\right] \lesssim \nu ^p p!! +(2b)^p p!\). These together complete the proof. \(\square \)

The Lemma above implies the bound for the moments of sub-exponential variable with arbitrary mean via Holder inequality.

Lemma 25

Let X be a sub-exponential random variable with mean \(\mu \) and parameters \((\nu ,b)\). Let p be an integer. Then for any \(\varepsilon \in (0,1)\) we have,

$$\begin{aligned} {\mathbb {E}}|X|^p \lesssim \frac{1}{\varepsilon ^{p-1}} |\mu |^p +\frac{1}{(1-\varepsilon )^{p-1}} \left( \nu ^p p!! + (2b)^p p!\right) \,.\end{aligned}$$

Proof

Write \(X = \mu + Z\) where Z has mean 0. Then we have that for any \(\varepsilon \in (0,1)\), by Holder inequality,

$$\begin{aligned} {\mathbb {E}}\left[ |X|^p \right] = {\mathbb {E}}\left[ (\mu +Z)^p\right]&\le \frac{1}{\varepsilon ^{p-1}} |\mu |^p + \frac{1}{(1-\varepsilon )^{p-1}} {\mathbb {E}}\left[ |Z|^p\right] \,.\\&\lesssim \frac{1}{\varepsilon ^{p-1}} |\mu |^p \\&\quad + \frac{1}{(1-\varepsilon )^{p-1}} \left( \nu ^p p!! + (2b)^p p!\right) \,.\ \quad (\text {by Lemma}~24) \end{aligned}$$

\(\square \)

The next few Lemmas are to bound the moments of a truncated sub-exponential random variable. We use Lemma 23 to reduce the estimates to Gaussian and exponential cases. Thus, we first bound the moment of a truncated Gaussian.

Claim 7

Suppose \(X\sim \mathcal {N}(0,1)\). Then, as \(p\rightarrow \infty \), we have for \(s \ge \sqrt{p}\), \({\mathbb {E}}\left[ |X|^p{\mathbf {1}}({|X|\ge s})\right] \le (1+o_p(1))^p \cdot p \cdot e^{-s^2/2}s^p\nonumber \)

Proof

We assume p to be an odd number for simplicity. The even case can be bounded by the odd case straightforwardly. Let \(\Gamma (\cdot ,\cdot )\) be the upper incomplete Gamma function, we have,

$$\begin{aligned} \int _{s}^{\infty } X^p e^{-x^2/2}dx&= -2^{\frac{p-1}{2}} \cdot \Gamma \left( \frac{p+1}{2}, x^2/2\right) \Big \vert _{s}^{\infty }\\&\quad \quad \left( \text {since}\ \frac{\partial \Gamma (s,x)}{\partial x} = -x^{s-1}e^{-x}\ [49]\right) \\&= 2^{\frac{p-1}{2}}\cdot \Gamma \left( \frac{p+1}{2}, s^2/2\right) \\&= 2^{\frac{p-1}{2}} ((p-1)/2)! \\&\quad \cdot e^{-s^2/2}\sum _{k=0}^{(p-1)/2} \frac{1}{k!} \left( \frac{s^2}{2}\right) ^k \quad (\text {see equation (2) of [50]})\\&\le (1+o_p(1))^p \cdot 2^{\frac{p-1}{2}} ((p-1)/2)! \\&\quad \cdot e^{-s^2/2} \cdot (p-1)/2 \cdot \left( \frac{s^2e}{p-1} \right) ^{(p-1)/2}\\&\quad \left( \text {since}\ k! \ge e(k/e)^k\ \text {and}\ \frac{1}{k!} \left( \frac{s^2}{2}\right) ^k\right. \\&\qquad \quad \left. \text {is maximized at} \ k = (p-1)/2\ \text {when}\ s^2\ge p\right) \\&\le (1+o_p(1))^p \cdot p \cdot e^{-s^2/2}s^p\,.\end{aligned}$$

\(\square \)

The next Claim estimates the moments of a truncated exponential random variable.

Claim 8

Suppose X has exponential distribution with mean 1 (that is, the density is \(p(x) = e^{-x}\)). Then, as \(p\rightarrow \infty \), for any \(s\ge (1+\varepsilon )p\) and \(\varepsilon \in (0,1)\), we have \({\mathbb {E}}\left[ X^p {\mathbf {1}}({X\ge s})\right] \lesssim e^{-s}s^p/\varepsilon \)

Proof

We have that

$$\begin{aligned} \int _s^{\infty } e^{-x}x^pdx&= - e^{-x}x^p\sum _{k=0}^{p} \frac{p!}{(p-k)!} x^{-k}\Big \vert _{s}^{\infty }\\&= e^{-s}s^p\sum _{k=0}^{p} \frac{p!}{(p-k)!} s^{-k} \\&\le e^{-s}s^p \sum _k \left( \frac{1}{1+\varepsilon }\right) ^k \quad (\text {Since}\ s\ge (1+\varepsilon )p)\\&\lesssim e^{-s}s^p/\varepsilon \,. \end{aligned}$$

\(\square \)

Combing the two Lemmas and use Lemma 23 we obtain the bounds for sub-exponential random variables.

Lemma 26

Suppose X is a sub-exponential random variable with mean \(\mu \) and parameters \((\nu ,b)\). Then, for integer \(p\ge 2\), real numbers \(\eta ,\varepsilon \in (0,1)\) and \(s\ge \max \{\nu \sqrt{p}, 2bp(1+\eta )\}\), we have

$$\begin{aligned}&{\mathbb {E}}\left[ |X|^p{\mathbf {1}}({|X|\ge \mu + s})\right] \\&\le \frac{1}{\varepsilon ^{p-1}} |\mu |^p +\frac{1}{(1-\varepsilon )^{p-1}} s^p \left( (1+\zeta _p)^p p^2 e^{-s^2/(2\nu ^2)} + p e^{-s/(2b)} /\eta \right) \,.\end{aligned}$$

where \(\zeta _p \rightarrow 0\) as \(p\rightarrow \infty \).

Proof

Write \(X = \mu + Z\) where Z has mean 0. Then we have that for any \(\varepsilon \in (0,1)\), by Holder inequality,

$$\begin{aligned} |X|^p = (\mu +Z)^p \le \frac{1}{\varepsilon ^{p-1}} |\mu |^p +\frac{1}{(1-\varepsilon )^{p-1}} |Z|^p \,.\end{aligned}$$
(B.5)

Therefore it remains to bound \({\mathbb {E}}\left[ |Z|^p\right] \). Let p(z) be the density of Z. Since Z is sub-exponential, we have that \(p(z) \le \max \{e^{-z^2/(2\nu ^2)}, e^{-z/(2b)}\} \le e^{-z^2/(2\nu ^2)}+e^{-z/(2b)}\). Let \(Z_1\sim {\mathcal {N}}(0,\nu ^2)\) and \(Z_2\) be an exponential random variable with mean 2b. Then by Lemma 23, we have

$$\begin{aligned} {\mathbb {E}}\left[ |Z|^p{\mathbf {1}}({|Z|\ge s}) \right]&\le s^p (e^{-s^2/(2\nu ^2)}+e^{-s/(2b)}) + p \sqrt{2\pi }\nu {\mathbb {E}}\left[ Z_1^{p-1}{\mathbf {1}}({X_1\ge s})\right] \\&\quad + 2pb {\mathbb {E}}\left[ X_2^{p-1}{\mathbf {1}}({X_2\ge s})\right] \\&\le s^p (e^{-s^2/(2\nu ^2)}+e^{-s/(2b)}) \\&\quad +p\sqrt{2\pi }\nu ^p {\mathbb {E}}\left[ (Z_1/\nu )^{p-1} {\mathbf {1}}({X_1/\nu \ge s/\nu })\right] \\&\quad + p(2b)^p{\mathbb {E}}\left[ (X_2/2b)^{p-1}{\mathbf {1}}({X_2/(2b)\ge s/(2b)})\right] \\&\le s^p (e^{-s^2/(2\nu ^2)}+e^{-s/(2b)}) \\&\quad + p^2\sqrt{2\pi }e^{-s^2/(2\nu ^2)} s^p (1+o_p(1))^p + pe^{-s/(2b)}s^p /\eta \\&\lesssim p^2\sqrt{2\pi }e^{-s^2/(2\nu ^2)}s^p (1+o_p(1))^p +pe^{-s/(2b)}s^p /\eta \,.\end{aligned}$$

Combing inequality above and Eq. (B.5) we complete the proof. \(\square \)

Finally, the following is a helper claim that is used in the proof of Claim 8.

Claim 9

For any \(\mu \in \mathbb {R}\) and \(\beta > 0\), we have that

$$\begin{aligned} \int e^{-\beta x} (\mu +x)^d dx = - \frac{e^{-\beta x}(\mu +x)^d}{\beta } \sum _{k=0}^{d} \frac{d!}{(d-k)!}\left( (\mu +x)\beta \right) ^{-k} \end{aligned}$$
(B.6)

Proof

Let F(x) denote the RHS of (B.6). We differentiate F(x) and obtain that

$$\begin{aligned} \frac{dF(x)}{dx}&= e^{-\beta x}(\mu +x)^d\sum _{k=0}^{d} \frac{d!}{(d-k)!}\left( (\mu +x)\beta \right) ^{-k} \\&\quad -\frac{e^{-\beta x}}{\beta } \sum _{k=0}^{d-1} \frac{d!}{(d-k-1)!}\frac{(\mu +x)^{d-k-1}}{\beta ^k}\\&= e^{-\beta x}\sum _{k=0}^{d} \frac{d!}{(d-k)!}\frac{(\mu +x)^{d-k}}{\beta ^k} -e^{-\beta x}\sum _{k=1}^{d} \frac{d!}{(d-k)!}\frac{(\mu +x)^{d-k}}{\beta ^k}\\&= e^{-\beta x} (1+x)^d \,.\end{aligned}$$

Therefore, the claim follows from integration both hands of the equation above. \(\square \)

1.4 B.4 Auxiliary Lemmas

Lemma 27

Let \(\bar{\tau }\) be unit vector in \(\mathbb {R}^n\), \(w\in \mathbb {R}^n\), and \(z \sim \mathcal {N}(0,\sigma ^2\cdot (\mathrm{Id}_n-\bar{q}\bar{q}^{\top }))\). Then \(\Vert {z\odot w}\Vert ^{2}\) is a sub-exponential random variable with parameter \((\nu ,b)\) satisfying \(\nu \lesssim \sigma ^2\Vert w\Vert _4^2\) and \(b\lesssim \sigma ^2\Vert w\Vert _{\infty }^2\). Moreover, it has mean \(\sigma ^2(\Vert w\Vert ^2 - \langle {\bar{q}^{\odot 2},w^{\odot 2}}\rangle )\).

Proof

We have that \(z\odot w\) is a Gaussian random variable with covariance matrix \(\Sigma = \sigma ^2 P_{\bar{q}} {{\,\mathrm{diag}\,}}(w^{\odot 2})P_{\bar{q}}\). Then, \(z\odot w\) can be written as \(\Sigma ^{1/2}x\) where \(x\sim \mathcal {N}(0,\mathrm{Id}_n)\). Then by Lemma 21, we have that \(\Vert z\odot w\Vert ^2 =\Vert \Sigma ^{1/2}x\Vert ^2\) is a sub-exponential random variable with parameter \((\nu ,b)\) satisfying

$$\begin{aligned} \nu ^2 \lesssim \mathrm{tr}(\Sigma ^2)&= \sigma ^4\mathrm{tr}(P_{\bar{q}} {{\,\mathrm{diag}\,}}(w^{\odot 2})P_{\bar{q}}{{\,\mathrm{diag}\,}}(w^{\odot 2})P_{\bar{q}}) \\&\le \sigma ^4\mathrm{tr}({{\,\mathrm{diag}\,}}(w^{\odot 2})P_{\bar{q}}{{\,\mathrm{diag}\,}}(w^{\odot 2})) \\&= \sigma ^4\mathrm{tr}({{\,\mathrm{diag}\,}}(w^{\odot 4})P_{\bar{q}}) \\&\le \sigma ^4\Vert w\Vert _4^4\,.\end{aligned}$$

Here at the second and forth line we both used the fact that \(\mathrm{tr}(PA)\le \mathrm{tr}(A)\) holds for any symmetric PSD matrix and projection matrix P.

Moreover, we have \(b\lesssim \Vert \Sigma \Vert \le \sigma ^2\max _k w_k^2 \). Finally, we have the mean of \(\Vert z\odot w\Vert ^2\) is \(\mathrm{tr}(\Sigma ) =\sigma ^2(\Vert w\Vert ^2 - \langle {\bar{q}^{\odot 2},w^{\odot 2}}\rangle )\), which completes the proof. \(\square \)

Claim 10

(folklore) We have that \(e\left( \frac{n}{e}\right) ^n \le n!\le e\left( \frac{n+1}{e}\right) ^{n+1}\). It follows that

$$\begin{aligned} 2e\left( \frac{2(n-1)}{e}\right) ^{n-1}\le (2n-2)!!\le (2n-1)!!\le (2n)!! = 2^n n! \le \frac{e}{2}\left( \frac{2(n+1)}{e}\right) ^{n+1}\,.\end{aligned}$$

Claim 11

For any \(\varepsilon \in (0,1)\) and \(A,B\in \mathbb {R}\), we have \((A+B)^d \le \max \{(1+\varepsilon )^d|A|^d, (1/\varepsilon +1)^d |B|^d\}\)

Claim 12

For any vectors ab, and any \(\varepsilon >0\), we have \(\Vert a+b\Vert _6^6 \le (1+\varepsilon ) \Vert a\Vert _6^6 + O(1/\varepsilon ^5) \Vert b\Vert _6^6\).

Proof

By Cauchy-Schwartz we know we only need to prove this when ab are real numbers instead of vectors. For numbers ab, \((a+b)^6 = a^6 +6a^5b+ 15a^4b^2 + 20a^3b^3+15a^2b^4+6ab^5+b^6\). All the intermediate terms \(a^ib^j (i+j = 6)\) can be bounded by \(\varepsilon /56 \cdot a^6 +O(1/\varepsilon ^5) b^6\), therefore we know \((a+b)^6 \le (1+\varepsilon ) a^6 + O(1/\varepsilon ^5)b^6\). \(\square \)

Lemma 28

For any vector \(z \in \mathbb {R}^n\), and any partition of [n] into two subsets S and L, and any \(\eta \in (0,1)\), we have that,

$$\begin{aligned} Q(z)^d (\left\Vert {{z}}\right\Vert _{6}^{6})^{-d/2} \le \frac{1}{\eta ^{d/2-1}} \left( \frac{Q(z_S)^2}{\left\Vert {{z_S}}\right\Vert _{6}^{6}} \right) ^{d/2} +\frac{1}{(1-\eta )^{d/2-1}} \left( \frac{Q(z_L)^2}{\left\Vert {{z_L}}\right\Vert _{6}^{6}}\right) ^{d/2} \,.\end{aligned}$$

Proof

By Cauchy-Schwarz inequality, we have that

$$\begin{aligned} \left( \left\Vert {z}\right\Vert _4^4-\left\Vert {z}\right\Vert ^2\right) ^2 (\left\Vert {z}\right\Vert _6^6)^{-1} =\frac{(Q(z_S) + Q(z_L))^2}{\left\Vert {{z_S}}\right\Vert _{6}^{6}+\left\Vert {{z_L}}\right\Vert _{6}^{6}} \le \frac{Q(z_S)^2}{\left\Vert {{z_S}}\right\Vert _{6}^{6}} + \frac{Q(z_L)^2}{\left\Vert {{z_L}}\right\Vert _{6}^{6}}\,.\end{aligned}$$

Therefore, we obtain that

$$\begin{aligned} \left( \left\Vert {z}\right\Vert _4^4-\left\Vert {z}\right\Vert ^2\right) ^d (\left\Vert {z}\right\Vert _6^6)^{-d/2}&\le \left( \frac{Q(z_S)^2}{\left\Vert {{z_S}}\right\Vert _{6}^{6}} +\frac{Q(z_L)^2}{\left\Vert {{z_L}}\right\Vert _{6}^{6}}\right) ^{d/2}\\&\le \frac{1}{\eta ^{d/2}}\left( \frac{Q(z_S)^2}{\left\Vert {{z_S}}\right\Vert _{6}^{6}} \right) ^{d/2} +\frac{1}{(1-\eta )^{d/2}}\left( \frac{Q(z_L)^2}{\left\Vert {{z_L}}\right\Vert _{6}^{6}}\right) ^{d/2}\,.\\&\quad (\text {by Holder inequality}) \end{aligned}$$

\(\square \)

1.5 B.5 Restricted isometry property and corollaries

In this section we will describe the Restricted Isometry Property (RIP), which was introduced in [51]. This is the crucial tool that we used in Sect. 5.

Definition 2

(RIP property) Suppose \(A \in \mathbb {R}^{d\times n}\) is a matrix whose columns have unit norm. We say the matrix A satisfies the \((k, \delta )\)-RIP property, if for any subset S of columns of size at most k, let \(A_S\) be the \(d\times |S|\) matrix with columns in S, we have

$$\begin{aligned} 1-\delta \le \Vert A_S\Vert \le 1+\delta . \end{aligned}$$

In our case the random components \(a_i\)’s do not have unit norm, we abuse notation and say A is RIP if the column normalized matrix satisfy the RIP property.

Intuitively RIP condition means the norm of Ax is very close to \(\Vert x\Vert \), if the number of nonzero entries in x is at most k. From [] we know a random matrix is RIP with high probability:

Theorem 6

([52]) For any constant \(\delta > 0\), when \(k = d/\Delta \log n\) for large enough constant \(\Delta \), with high probability a random Gaussian matrix is \((k,\delta )\)-RIP.

In our analysis, we mostly use the RIP property to show any given vector x cannot have large correlation with many components \(a_i\)’s.

1.6 B.6 Concentration inequalities

We first show if we only look at the components with small correlations with x, the function value cannot be much larger than 3n.

Lemma 29

With high probability over \(a_i\)’s, for any unit vector x, for any threshold \(\tau > 1\)

$$\begin{aligned} \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^4 \le 3n + O((\sqrt{nd}+d\tau ^4)\log d). \end{aligned}$$

Proof

Let \(X_i = {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^4 -{\mathbb {E}}[{\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^4]\). Notice that \({\mathbb {E}}[{\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^4] \le {\mathbb {E}}[\langle {a_i,x}\rangle ^4] = 3\). Clearly \({\mathbb {E}}[X_i] = 0\) and \(X_i \le \tau ^4\). We can apply Bernstein’s inequality, and we know for any x,

$$\begin{aligned} \Pr \left[ \sum _{i=1}^n X_i \ge t\right] \le \exp \left( -\frac{t^2}{15nd+\tau ^4t}\right) . \end{aligned}$$

When \(t \ge C((\sqrt{nd}+d\tau ^4)\log d)\) for large enough constant, the probability is smaller than \(\exp (-C'd\log d)\) for large constant \(C'\), so we can union bound over all vectors in an \(\varepsilon \)-net with \(\varepsilon = 1/d^3\). This is true even if we set the threshold to \(2\tau \).

For any x not in the \(\varepsilon \)-net, let \(x'\) be its closest neighbor in \(\varepsilon \)-net, when \(a_i\)’s have norm bounded by \(O(\sqrt{d})\) (which happens with high probability), it is easy to show whenever \(|\langle {x,a_i}\rangle |\le \tau \) we always have \(|\langle {x',a_i}\rangle |\le 2\tau \). Therefore the sum for x cannot be much larger than the sum for \(x'\). \(\square \)

Using very similar techniques we can also show the contribution to the gradient for these small components is small

Lemma 30

With high probability over \(a_i\)’s, for any unit vector x, for any threshold \(\tau > 1\)

$$\begin{aligned} \left\| \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^3 a_i\right\| \le O(n + (\sqrt{nd}+d\tau ^4)\log d). \end{aligned}$$

Proof

Let \(v = \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^3 a_i\). First notice that the inner-product \(\langle {v,x}\rangle \) is exactly the sum we bounded in Lemma 29. Therefore we only need to bound the norm in the orthogonal direction of x. Since \(a_i\)’s are Gaussian we know \(\langle {a_i,x}\rangle \) and \(P_x a_i\) are independent. So we can apply vector Bernstein’s inequality, and we know

$$\begin{aligned} \Pr \left[ \left\| \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^3 P_x a_i \right\| \ge t\right] \le d\exp \left( -\frac{t^2}{15nd+\tau ^4t}\right) . \end{aligned}$$

By the same \(\varepsilon \)-net argument as Lemma 29 we can get the desired bound. \(\square \)

Finally we prove similar conditions for the tensor T applied to the vector x twice.

Lemma 31

With high probability over \(a_i\)’s, for any unit vector x, for any threshold \(\tau > 1\)

$$\begin{aligned} \left\| \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^2 a_ia_i^{\top }\right\| \le O(n + (\sqrt{nd}+d\tau ^4)\log d). \end{aligned}$$

Proof

Let \(H = \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^2 a_i a_i^{\top }\). First notice that the quadratic form \(x^{\top } H x\) is exactly the sum we bounded in Lemma 29. We will bound the spectral norm in the orthogonal direction of x, and use the fact that H is a PSD matrix, so \(\Vert H\Vert \le 2(x^{\top } H x + \Vert P_x H P_x|)\) (this is basically the fact that \((a+b)^2 \le 2a^2+2b^2\)). Since \(a_i\)’s are Gaussian we know \(\langle {a_i,x}\rangle \) and \(P_x a_i\) are independent. So we can apply matrix Bernstein’s inequality, because \({\mathbb {E}}[P_x a_i a_i^{\top } P_x] = P_x\), we know

$$\begin{aligned} \Pr \left[ \left\| \sum _{i=1}^n {\mathbf {1}}({|\langle {a_i,x}\rangle | \le \tau }) \langle {a_i,x}\rangle ^2 (P_x a_i a_i^{\top } P_x - P_x)\right\| \ge t\right] \le d\exp \left( -\frac{t^2}{15nd+\tau ^4t}\right) . \end{aligned}$$

By the same \(\varepsilon \)-net argument as Lemma 29 we can get the desired bound. \(\square \)

C Missing proofs in Section 3

Proof of Lemma 4

Let \(A = [a_1,\dots , a_n]\). First of all by standard random matrix theory (see, e.g., [53, Section 3.3] or [54, Theorem 2.6]), we have that with high probability, \(\left\Vert {A}\right\Vert \le \sqrt{n}+1.1\sqrt{d}\). Therefore, we have that with high probability \(\sum \langle {a_i,x}\rangle ^2 \le \left\Vert {A}\right\Vert \le n + 3\sqrt{nd}\). By the RIP property of random matrix A (see [55, Theorem 5.2] or Theorem 6), we that with high probability, Eq. (3.4) holds.

Finally, using a straightforward \(\varepsilon \)-net argument and union bound, we have that for every unit vector x, \(\sum _{i=1}^{n}\langle {a_i,x}\rangle ^4\ge 15n- O(\sqrt{nd}\log d)\). Therefore, when \(1/\delta ^2 \cdot d\log ^2 d \le n\), Eq. (3.5) also holds. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, R., Ma, T. On the optimization landscape of tensor decompositions. Math. Program. 193, 713–759 (2022). https://doi.org/10.1007/s10107-020-01579-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-020-01579-x

Keywords

Mathematics Subject Classification

Navigation