Skip to main content
Log in

The Global Optimization Geometry of Shallow Linear Neural Networks

  • Published:
Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

Abstract

We examine the squared error loss landscape of shallow linear neural networks. We show—with significantly milder assumptions than previous works—that the corresponding optimization problems have benign geometric properties: There are no spurious local minima, and the Hessian at every saddle point has at least one negative eigenvalue. This means that at every saddle point there is a directional negative curvature which algorithms can utilize to further decrease the objective value. These geometric properties imply that many local search algorithms (such as the gradient descent which is widely utilized for training neural networks) can provably solve the training problem with global convergence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. From an optimization perspective, non-strict saddle points and local minima have similar first-/second-order information and it is hard for first-/second-order methods (like gradient descent) to distinguish between them.

References

  1. Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)

  2. Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2(1), 53–58 (1989)

    Article  Google Scholar 

  3. Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)

  4. Blum, A., Rivest, R.L.: Training a 3-node neural network is NP-complete. In: Advances in Neural Information Processing Systems (NIPS), pp. 494–501 (1989)

  5. Borgerding, M., Schniter, P., Rangan, S.: Amp-inspired deep networks for sparse linear inverse problems. IEEE Trans. Signal Process. 65(16), 4293–4308 (2017)

    Article  MathSciNet  Google Scholar 

  6. Byrd, R.H., Gilbert, J.C., Nocedal, J.: A trust region method based on interior point techniques for nonlinear programming. Math. Program. 89(1), 149–185 (2000)

    Article  MathSciNet  Google Scholar 

  7. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)

  8. Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)

    Book  Google Scholar 

  9. Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412 (2017)

  10. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)

    Article  MathSciNet  Google Scholar 

  11. Du, S.S., Lee, J.D., Tian, Y.: When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129 (2017)

  12. Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points—online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)

  13. Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: a unified geometric analysis. In: International Conference on Machine Learning, pp. 1233–1242 (2017)

  14. Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)

  15. Haeffele, B.D., Vidal, R.: Global optimality in neural network training. In: Conference on Computer Vision and Pattern Recognition, pp. 7331–7339 (2017)

  16. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)

    Book  Google Scholar 

  17. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  Google Scholar 

  18. Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: International Conference on Machine Learning, pp. 1724–1732 (2017)

  19. Kamilov, U.S., Mansour, H.: Learning optimal nonlinearities for iterative thresholding algorithms. IEEE Signal Process. Lett. 23(5), 747–751 (2016)

    Article  Google Scholar 

  20. Kawaguchi, K.: Deep learning without poor local minima. In: Advances in Neural Information Processing Systems, pp. 586–594 (2016)

  21. Laurent, T., von Brecht, J.: Deep linear neural networks with arbitrary loss: all local minima are global. arXiv preprint arXiv:1712.01473 (2017)

  22. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  23. Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)

  24. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)

  25. Li, Q., Zhu, Z., Tang, G.: The non-convex geometry of low-rank matrix optimization. Inf. Inference: J. IMA 8, 51–96 (2019)

    Article  MathSciNet  Google Scholar 

  26. Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H., Zhao, T.: Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)

  27. Li, X., Zhu, Z., So, A.M.C., Vidal, R.: Nonconvex robust low-rank matrix recovery. arXiv preprint arXiv:1809.09237 (2018)

  28. Li, Y., Zhang, Y., Huang, X., Ma, J.: Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 99, 1–16 (2018)

    Article  Google Scholar 

  29. Li, Y., Zhang, Y., Huang, X., Yuille, A.L.: Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images. ISPRS J. Photogramm. Remote Sens. 146, 182–196 (2018)

    Article  Google Scholar 

  30. Liu, H., Yue, M.C., Man-Cho So, A.: On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim. 27(4), 2426–2446 (2017)

    Article  MathSciNet  Google Scholar 

  31. Lu, H., Kawaguchi, K.: Depth creates no bad local minima. arXiv preprint arXiv:1702.08580 (2017)

  32. Mousavi, A., Patel, A.B., Baraniuk, R.G.: A deep learning approach to structured signal recovery. In: 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1336–1343 (2015)

  33. Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)

    Article  MathSciNet  Google Scholar 

  34. Nouiehed, M., Razaviyayn, M.: Learning deep models: critical points and local openness. arXiv preprint arXiv:1803.02968 (2018)

  35. Park, D., Kyrillidis, A., Carmanis, C., Sanghavi, S.: Non-square matrix sensing without spurious local minima via the burer-monteiro approach. In: Artificial Intelligence and Statistics, pp. 65–74 (2017)

  36. Qu, Q., Zhang, Y., Eldar, Y.C., Wright, J.: Convolutional phase retrieval via gradient descent. arXiv preprint arXiv:1712.00716 (2017)

  37. Safran, I., Shamir, O.: Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968 (2017)

  38. Schalkoff, R.J.: Artificial Neural Networks, vol. 1. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  39. Soudry, D., Carmon, Y.: No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361 (2016)

  40. Soudry, D., Hoffer, E.: Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777 (2017)

  41. Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18, 1–68 (2018)

    Article  MathSciNet  Google Scholar 

  42. Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)

    Article  MathSciNet  Google Scholar 

  43. Tian, Y.: An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In: International Conference on Machine Learning, pp. 3404–3413 (2017)

  44. Werbos, P.: Beyond regression: new fools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University (1974)

  45. Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444 (2017)

  46. Zhu, Z., Li, Q., Tang, G., Wakin, M.B.: The global optimization geometry of low-rank matrix optimization. arXiv preprint arXiv:1703.01256 (2017)

  47. Zhu, Z., Li, Q., Tang, G., Wakin, M.B.: Global optimality in low-rank matrix optimization. IEEE Trans. Signal Process. 66(13), 3614–3628 (2018)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhihui Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

ZZ and MBW were supported by NSF Grant CCF–1409261, NSF CAREER Grant CCF–1149225, and the DARPA Lagrange Program under ONR/SPAWAR contract N660011824020. DS was supported by the Israel Science Foundation (Grant No. 31/1031) and by the Taub Foundation.

Appendices

Proof of Lemma 1

1.1 Proof of (9)

Intuitively, the regularizer \(\rho \) in (3) forces \(\varvec{W}_2^\mathrm {T}\) and \(\varvec{W}_1\varvec{X}\) to be balanced (i.e., \(\varvec{W}_2^\mathrm {T}\varvec{W}_2 = \varvec{W}_1 \varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\)). We show that with this regularizer, any critical point of g obeys (9). To establish this, first note that any critical point \(\varvec{Z}\) of \(g(\varvec{Z})\) satisfies \(\nabla g(\varvec{Z})=\varvec{0}\), i.e.,

$$\begin{aligned} \begin{aligned}&\nabla _{\varvec{W}_1} g(\varvec{W}_1,\varvec{W}_2)= \varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\\&\quad - \mu (\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T})\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}= \mathbf{0}, \end{aligned} \end{aligned}$$
(19)

and

$$\begin{aligned} \begin{aligned}&\nabla _{\varvec{W}_2} g(\varvec{W}_1,\varvec{W}_2)= (\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\\&\quad + \mu \varvec{W}_2(\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}) = \mathbf{0}. \end{aligned} \end{aligned}$$
(20)

By (19), we obtain

$$\begin{aligned} \begin{aligned}&\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\\&\quad = \mu (\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T})\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}. \end{aligned} \end{aligned}$$
(21)

Multiplying (20) on the left by \(\varvec{W}_2^\mathrm {T}\) and plugging the result with the expression for \(\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\) in (21) gives

$$\begin{aligned}&(\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T})\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\\&\quad + \varvec{W}_2^\mathrm {T}\varvec{W}_2(\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}) = \varvec{0}, \end{aligned}$$

which is equivalent to

$$\begin{aligned} \varvec{W}_2^\mathrm {T}\varvec{W}_2\varvec{W}_2^\mathrm {T}\varvec{W}_2 = \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}. \end{aligned}$$

Note that \(\varvec{W}_2^\mathrm {T}\varvec{W}_2\) and \(\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\) are the principal square roots (i.e., PSD square roots) of \(\varvec{W}_2^\mathrm {T}\varvec{W}_2\varvec{W}_2^\mathrm {T}\varvec{W}_2\) and \(\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\), respectively. Utilizing the result that a PSD matrix \(\varvec{A}\) has a unique PSD matrix \(\varvec{B}\) such that \(\varvec{B}^k = \varvec{A}\) for any \(k\ge 1\) [16, Theorem 7.2.6], we obtain

$$\begin{aligned} \varvec{W}_2^\mathrm {T}\varvec{W}_2 = \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\end{aligned}$$

for any critical point \(\varvec{Z}\).

1.2 Proof of (10)

To show (10), we first plug (9) back into (19) and (20), simplifying the first-order optimality equation as

$$\begin{aligned} \begin{aligned}&\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}= \mathbf{0},\\&(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}= \mathbf{0}. \end{aligned} \end{aligned}$$
(22)

What remains is to find all \((\varvec{W}_1,\varvec{W}_2)\) that satisfy the above equation.

Let \(\varvec{W}_2 = \varvec{L}{\varvec{\Pi }}\varvec{R}^\mathrm {T}\) be a full SVD of \(\varvec{W}_2\), where \(\varvec{L}\in {\mathbb {R}}^{d_2\times d_2}\) and \(\varvec{R}\in {\mathbb {R}}^{d_1\times d_1}\) are orthonormal matrices. Define

$$\begin{aligned} {\widetilde{\varvec{W}}}_2 = \varvec{W}_2 \varvec{R}= \varvec{L}{\varvec{\Pi }}, \ {\widetilde{\varvec{W}}}_1 = \varvec{R}^\mathrm {T}\varvec{W}_1\varvec{U}{\varvec{\varSigma }}. \end{aligned}$$
(23)

Since \(\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}= \varvec{W}_2^\mathrm {T}\varvec{W}_2 \) [see (9)], we have

$$\begin{aligned} {\widetilde{\varvec{W}}}_1{\widetilde{\varvec{W}}}_1^\mathrm {T}= {\widetilde{\varvec{W}}}_2^\mathrm {T}{\widetilde{\varvec{W}}}_2 = {\varvec{\Pi }}^\mathrm {T}{\varvec{\Pi }}. \end{aligned}$$
(24)

Noting that \({\varvec{\Pi }}^\mathrm {T}{\varvec{\Pi }}\) is a diagonal matrix with nonnegative diagonals, it follows that \({\widetilde{\varvec{W}}}_1^\mathrm {T}\) is an orthogonal matrix, but possibly includes zero columns.

Due to (22), we have

$$\begin{aligned} \begin{aligned}&{\widetilde{\varvec{W}}}_2^\mathrm {T}({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}\\&\quad = \varvec{R}^\mathrm {T}(\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}) = \mathbf{0},\\&({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}{\widetilde{\varvec{W}}}_1^\mathrm {T}\\&\quad = (\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\varvec{R}= \mathbf{0}, \end{aligned} \end{aligned}$$
(25)

where we utilized the reduced SVD decomposition \(\varvec{X}= \varvec{U}{\varvec{\varSigma }}\varvec{V}^\mathrm {T}\) in (8). Note that the diagonals of \({\varvec{\varSigma }}\) are all positive and recall

$$\begin{aligned} {\widetilde{\varvec{Y}}} = \varvec{Y}\varvec{V}. \end{aligned}$$

Then, (25) gives

$$\begin{aligned} \begin{aligned}&{\widetilde{\varvec{W}}}_2^\mathrm {T}({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - {\widetilde{\varvec{Y}}}) = \mathbf{0}, \\&({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - {\widetilde{\varvec{Y}}}){\widetilde{\varvec{W}}}_1^\mathrm {T}= \mathbf{0}. \end{aligned} \end{aligned}$$
(26)

We now compute all \({\widetilde{\varvec{W}}}_2\) and \({\widetilde{\varvec{W}}}_1\) satisfying (26). To that end, let \(\varvec{\phi }\in {\mathbb {R}}^{d_2}\) and \(\varvec{\psi }\in {\mathbb {R}}^{d_0}\) be the ith column and the ith row of \({\widetilde{\varvec{W}}}_2\) and \({\widetilde{\varvec{W}}}_1\), respectively. Due to (24), we have

$$\begin{aligned} \Vert \varvec{\phi }\Vert _2 = \Vert \varvec{\psi }\Vert _2. \end{aligned}$$
(27)

It follows from (26) that

$$\begin{aligned} {\widetilde{\varvec{Y}}}^\mathrm {T}\varvec{\phi }= \Vert \varvec{\phi }\Vert _2^2 \varvec{\psi }, \end{aligned}$$
(28)
$$\begin{aligned} {\widetilde{\varvec{Y}}} \varvec{\psi }= \Vert \varvec{\psi }\Vert _2^2 \varvec{\phi }. \end{aligned}$$
(29)

Multiplying (28) by \({\widetilde{\varvec{Y}}}\) and plugging (29) into the resulting equation gives

$$\begin{aligned} {\widetilde{\varvec{Y}}} {\widetilde{\varvec{Y}}}^\mathrm {T}\varvec{\phi }= \Vert \varvec{\phi }\Vert _2^4\varvec{\phi }, \end{aligned}$$
(30)

where we used (27). Similarly, we have

$$\begin{aligned} {\widetilde{\varvec{Y}}}^\mathrm {T}{\widetilde{\varvec{Y}}}\varvec{\psi }= \Vert \varvec{\psi }\Vert _2^4\varvec{\psi }. \end{aligned}$$
(31)

Let \({\widetilde{\varvec{Y}}} = \varvec{P}{\varvec{\Lambda }}\varvec{Q}^\mathrm {T}= \sum _{j=1}^r\lambda _j \varvec{p}_j\varvec{q}_j^\mathrm {T}\) be the reduced SVD of \({\widetilde{\varvec{Y}}}\). It follows from (30) that \(\varvec{\phi }\) is either a zero vector (i.e., \(\varvec{\phi }= \varvec{0}\)), or a left singular vector of \({\widetilde{\varvec{Y}}}\) (i.e., \(\varvec{\phi }= \alpha \varvec{p}_j\) for some \(j\in [r]\)). Plugging \(\varvec{\phi }= \alpha \varvec{p}_j\) into (30) gives

$$\begin{aligned} \lambda _j^2 = \alpha ^4. \end{aligned}$$

Thus, \(\varvec{\phi }= \pm \sqrt{\lambda _j}\varvec{p}_j\). If \(\varvec{\phi }= \varvec{0}\), then due to (27), we have \(\varvec{\psi }= \varvec{0}\). If \(\varvec{\phi }= \pm \sqrt{\lambda _j}\varvec{p}_j\), then plugging into (28) gives

$$\begin{aligned} \varvec{\psi }= \pm \sqrt{\lambda _j}\varvec{q}_j. \end{aligned}$$

Thus, we conclude that

$$\begin{aligned} (\varvec{\phi },\varvec{\psi })\in \left\{ \pm \sqrt{\lambda _1}(\varvec{p}_1,\varvec{q}_1),\ldots ,\pm \sqrt{\lambda _r}(\varvec{p}_r,\varvec{q}_r),(\varvec{0},\varvec{0}) \right\} , \end{aligned}$$

which together with (24) implies that any critical point \(\varvec{Z}\) belongs to (10) by absorbing the sign ± into \(\varvec{R}\).

We now prove the other direction \(\Rightarrow \). For any \(\varvec{Z}\in \mathcal {C}\), we compute the gradient of g at this point and directly verify it satisfies (19) and (20), i.e., \(\varvec{Z}\) is a critical point of \(g(\varvec{Z})\). This completes the proof of Lemma 1.

Proof of Lemma 2

Due to the fact that \(\varvec{Z}\) is a global minimum of \(g(\varvec{Z})\) if and only if \({\widetilde{\varvec{Z}}}\) is a global minimum of \(\widetilde{g}({\widetilde{\varvec{Z}}})\), we know any \(\varvec{Z}\in \mathcal {X}\) is a global minimum of \(g(\varvec{Z})\). The rest is to show that any \(\varvec{Z}\in \mathcal {C}\setminus \mathcal {X}\) is a strict saddle. For this purpose, we first compute the Hessian quadrature form \(\nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}]\) for any \({\varvec{\varDelta }}= \begin{bmatrix}{\varvec{\varDelta }}_2\\{\varvec{\varDelta }}_1^\mathrm {T}\end{bmatrix}\) (with \({\varvec{\varDelta }}_1\in {\mathbb {R}}^{d_1\times d_0},{\varvec{\varDelta }}_2\in {\mathbb {R}}^{d_2\times d_1}\)) as

$$\begin{aligned}&\nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}]\nonumber \\&\quad = \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}\right\| _F^2\nonumber \\&\quad \quad + 2\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle \nonumber \\&\quad \quad + \mu \big (\langle \varvec{W}_2^\mathrm {T}\varvec{W}_2 {-} \varvec{W}_1 \varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T},{\varvec{\varDelta }}_2^\mathrm {T}{\varvec{\varDelta }}_2 {-} {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}\rangle + \nonumber \\&\quad \quad \, \frac{1}{2}\Vert \varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 + {\varvec{\varDelta }}_2^\mathrm {T}\varvec{W}_2 {-} \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}{-} {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^T\Vert _F^2 \big )\nonumber \\&\quad = \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2\nonumber \\&\quad \quad + 2\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle +\nonumber \\&\quad \quad \, \frac{\mu }{2}\Vert \varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 + {\varvec{\varDelta }}_2^\mathrm {T}\varvec{W}_2 {-} \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}{-} {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^T\Vert _F^2,\nonumber \\ \end{aligned}$$
(32)

where the second equality follows because any critical point \(\varvec{Z}\) satisfies (9). We continue the proof by considering two cases in which we provide explicit expressions for the set \(\mathcal {X}\) that contains all the global minima and construct a negative direction for g at all the points \(\mathcal {C}\setminus \mathcal {X}\).

Case i: \(r\le d_1\). In this case, \(\min {\widetilde{g}}({\widetilde{\varvec{Z}}}) = 0\) and \({\widetilde{g}}({\widetilde{\varvec{Z}}})\) achieves its global minimum 0 if and only if \({\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1 = \varvec{Y}\varvec{V}\). Thus, we rewrite \(\mathcal {X}\) as

$$\begin{aligned} \begin{aligned} \mathcal {X}= \bigg \{\varvec{Z}= \begin{bmatrix}{\widetilde{\varvec{W}}}_2 \varvec{R}\\ \varvec{U}{\varvec{\varSigma }}^{-1}{\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{R}\end{bmatrix} \in \mathcal {C}: \widetilde{\varvec{W}}_2{\widetilde{\varvec{W}}}_1 = \varvec{Y}\varvec{V}\bigg \}, \end{aligned} \end{aligned}$$

which further implies that

$$\begin{aligned} \mathcal {C}\setminus \mathcal {X}= \bigg \{\varvec{Z}=&\begin{bmatrix}\widetilde{\varvec{W}}_2 \varvec{R}\\ \varvec{U}{\varvec{\varSigma }}^{-1}{\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{R}\end{bmatrix} \in \mathcal {C}:\\ {}&\varvec{Y}\varvec{V}- {\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1 =\sum _{i\in \varOmega }\lambda _i\varvec{p}_i\varvec{q}_i^\mathrm {T},\varOmega \subset [r] \bigg \}. \end{aligned}$$

Thus, for any \(\varvec{Z}\in \mathcal {C}\setminus \mathcal {X}\), the corresponding \({\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1\) is a low-rank approximation to \(\varvec{Y}\varvec{V}\).

Let \(k\in \varOmega \). We have

$$\begin{aligned} \varvec{p}_k^\mathrm {T}{\widetilde{\varvec{W}}}_2 = \varvec{0}, \ {\widetilde{\varvec{W}}}_1 \varvec{q}_k = \varvec{0}. \end{aligned}$$
(33)

In words, \(\varvec{p}_k\) and \(\varvec{q}_k\) are orthogonal to \({\widetilde{\varvec{W}}}_2\) and \({\widetilde{\varvec{W}}}_1\), respectively. Let \(\varvec{\alpha }\in {\mathbb {R}}^{d_1}\) be the eigenvector associated with the smallest eigenvalue of \({\widetilde{\varvec{Z}}}^\mathrm {T}{\widetilde{\varvec{Z}}}\). Note that such \(\varvec{\alpha }\) simultaneously lives in the null spaces of \({\widetilde{\varvec{W}}}_2\) and \({\widetilde{\varvec{W}}}_1^\mathrm {T}\) since \({\widetilde{\varvec{Z}}}\) is rank deficient, indicating

$$\begin{aligned} 0 =\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{Z}}}^\mathrm {T}{\widetilde{\varvec{Z}}} \varvec{\alpha }= \varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_2^\mathrm {T}{\widetilde{\varvec{W}}}_2 \varvec{\alpha }+ \varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1{\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{\alpha }, \end{aligned}$$

which further implies

$$\begin{aligned} {\widetilde{\varvec{W}}}_2 \varvec{\alpha }= \varvec{0},\ {\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{\alpha }= \varvec{0}. \end{aligned}$$
(34)

With this property, we construct \({\varvec{\varDelta }}\) by setting \({\varvec{\varDelta }}_{2} = \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\) and \({\varvec{\varDelta }}_{1} = \varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}\).

Now, we show that \(\varvec{Z}\) is a strict saddle by arguing that \(g(\varvec{Z})\) has a strictly negative curvature along the constructed direction \({\varvec{\varDelta }}\), i.e., \([\nabla ^2g(\varvec{Z})]({\varvec{\varDelta }},{\varvec{\varDelta }})<0\). For this purpose, we compute the three terms in (32) as follows:

$$\begin{aligned} \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2 = 0 \end{aligned}$$
(35)

since \(\varvec{W}_2{\varvec{\varDelta }}_1 = \varvec{W}_2\varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}= {\widetilde{\varvec{W}}}_2\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}= \varvec{0}\) and \({\varvec{\varDelta }}_2\varvec{W}_1 = \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\varvec{W}_1 = \varvec{p}_k\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 = \varvec{0}\) by utilizing (34);

$$\begin{aligned} \Vert \varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 + {\varvec{\varDelta }}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}- {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^T\Vert _F^2 = 0 \end{aligned}$$

since it follows from (33) that \(\varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 = \varvec{R}^\mathrm {T}{\widetilde{\varvec{W}}}_2^\mathrm {T}\varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}= \varvec{0}\) and

$$\begin{aligned}&\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}= \varvec{R}^\mathrm {T}{\widetilde{\varvec{W}}}_1{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}\varvec{U}{\varvec{\varSigma }}^2\varvec{U}^\mathrm {T}\varvec{U}{\varvec{\varSigma }}^{-1}\varvec{q}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\\&\quad = \varvec{R}^\mathrm {T}{\widetilde{\varvec{W}}}_1 \varvec{q}_k\varvec{\alpha }^\mathrm {T}\varvec{R}= \varvec{0}; \end{aligned}$$

and

$$\begin{aligned}&\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T},({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T},{\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 \right\rangle - \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}, \varvec{Y}\varvec{V}\right\rangle = - \lambda _k, \end{aligned}$$

where the last equality utilizes (33). Thus, we have

$$\begin{aligned} \nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}] = -2\lambda _k \le -2\lambda _r. \end{aligned}$$

We finally obtain (14) by noting that

$$\begin{aligned} \Vert {\varvec{\varDelta }}\Vert _F^2&= \Vert {\varvec{\varDelta }}_1\Vert _F^2 + \Vert {\varvec{\varDelta }}_2\Vert _F^2 \\&= \Vert \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\Vert _F^2 + \Vert \varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}\Vert _F^2\\&= 1 + \Vert {\varvec{\varSigma }}^{-1} \varvec{q}_k\Vert _F^2 \le 1 + \Vert {\varvec{\varSigma }}^{-1} \Vert _F^2 \Vert \varvec{q}_k\Vert _F^2\\&= 1 + \Vert {\varvec{\varSigma }}^{-1} \Vert _F^2, \end{aligned}$$

where the inequality follows from the Cauchy–Schwartz inequality \(|\varvec{a}^\mathrm {T}\varvec{b}|\le \Vert \varvec{a}\Vert _2\Vert \varvec{b}\Vert _2\).

Case ii: \(r> d_1\). In this case, minimizing \({\widetilde{g}}({\widetilde{\varvec{Z}}})\) in (12) is equivalent to finding a low-rank approximation to \(\varvec{Y}\varvec{V}\). Let \(\Gamma \) denote the indices of the singular vectors \(\{\varvec{p}_j\}\) and \(\{\varvec{q}_j\}\) that are included in \({\widetilde{\varvec{Z}}}\), that is,

$$\begin{aligned} \left\{ {\widetilde{\varvec{z}}}_i,i\in [d_1]\right\} = \left\{ \varvec{0},\sqrt{\lambda _j}\begin{bmatrix}\varvec{p}_j\\\varvec{q}_j\end{bmatrix},j\in \Gamma \right\} . \end{aligned}$$

Then, for any \({\widetilde{\varvec{Z}}}\), we have

$$\begin{aligned} {\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}= \sum _{i\ne \lambda }\lambda _i\varvec{p}_i\varvec{q}_i \end{aligned}$$

and

$$\begin{aligned} \widetilde{g}(\widetilde{\varvec{Z}}) = \frac{1}{2}\Vert \widetilde{\varvec{W}}_2\widetilde{\varvec{W}}_1 - \varvec{Y}\varvec{V}\Vert _F^2 = \frac{1}{2}\left( \sum _{i\ne \Lambda }\lambda _i^2\right) , \end{aligned}$$

which implies that \(\widetilde{\varvec{Z}}\) is a global minimum of \({\widetilde{g}}(\widetilde{\varvec{Z}})\) if

$$\begin{aligned} \Vert \widetilde{\varvec{W}}_2\widetilde{\varvec{W}}_1 - \varvec{Y}\varvec{V}\Vert _F^2 = \sum _{i>d_1}\lambda _i^2. \end{aligned}$$

To simply the following analysis, we assume \(\lambda _{d_1}> \lambda _{d_1 +1}\), but the argument is similar in the case of repeated eigenvalues at \(\lambda _{d_1}\) (i.e., \(\lambda _{d_1}= \lambda _{d_1 +1} = \cdots \)). In this case, we know for any \(\varvec{Z}\in \mathcal {C}\setminus \mathcal {X}\) that is not a global minimum, there exists \(\varOmega \subset [r]\) which contains \(k\in \varOmega ,k\le d_1\) such that

$$\begin{aligned} \varvec{Y}\varvec{V}- \widetilde{\varvec{W}}_2\widetilde{\varvec{W}}_1 =\sum _{i\in \varOmega }\lambda _i\varvec{p}_i\varvec{q}_i^\mathrm {T}. \end{aligned}$$

Similar to Case i, we have

$$\begin{aligned} \varvec{p}_k^\mathrm {T}\widetilde{\varvec{W}}_2 = \varvec{0}, \ \widetilde{\varvec{W}}_1 \varvec{q}_k = \varvec{0}. \end{aligned}$$
(36)

Let \(\varvec{\alpha }\in {\mathbb {R}}^{d_1}\) be the eigenvector associated with the smallest eigenvalue of \(\widetilde{\varvec{Z}}^\mathrm {T}\widetilde{\varvec{Z}}\). By the form of \(\widetilde{\varvec{Z}}\) in (10), we have

$$\begin{aligned} \Vert \widetilde{\varvec{W}}_2 \varvec{\alpha }\Vert _2^2 = \Vert \widetilde{\varvec{W}}_1^\mathrm {T}\varvec{\alpha }\Vert _2^2 \le \lambda _{d_1+1}, \end{aligned}$$
(37)

where the inequality attains equality when \(d_1+1\in \varOmega \). As in Case i, we construct \({\varvec{\varDelta }}\) by setting \({\varvec{\varDelta }}_{2} = \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\) and \({\varvec{\varDelta }}_{1} = \varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}\). We now show that \(\varvec{Z}\) is a strict saddle by arguing that \(g(\varvec{Z})\) has a strictly negative curvature along the constructed direction \({\varvec{\varDelta }}\) (i.e., \([\nabla ^2g(\varvec{Z})]({\varvec{\varDelta }},{\varvec{\varDelta }})<0\)) by computing the three terms in (32) as follows:

$$\begin{aligned}&\left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2\\&\quad = \left\| {\widetilde{\varvec{W}}}_2\varvec{\alpha }\varvec{q}_k^\mathrm {T}\varvec{V}^\mathrm {T}+ \varvec{p}_k\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 \varvec{V}^\mathrm {T}\right\| _F^2\\&\quad = \left\| {\widetilde{\varvec{W}}}_2\varvec{\alpha }\right\| _F^2 + \left\| + \varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 \right\| _F^2 + 2\left\langle \widetilde{\varvec{W}}_2\varvec{\alpha }\varvec{q}_k^\mathrm {T}, \varvec{p}_k\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 \right\rangle \\&\quad \le 2\lambda _{d_1 +1}, \end{aligned}$$

where the last line follows from (36) and (37);

$$\begin{aligned} \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2 = 0 \end{aligned}$$

holds with a similar argument as in (35); and

$$\begin{aligned}&\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T},({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T},{\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 \right\rangle - \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}, \varvec{Y}\varvec{V}\right\rangle \\&\quad = - \lambda _k \le -\lambda _{d_1}, \end{aligned}$$

where the last equality used (36) and the fact that \(k\le d_1\). Thus, we have

$$\begin{aligned} \nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}] \le -2(\lambda _{d_1} - \lambda _{d_1+1}), \end{aligned}$$

completing the proof of Lemma 2.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Z., Soudry, D., Eldar, Y.C. et al. The Global Optimization Geometry of Shallow Linear Neural Networks. J Math Imaging Vis 62, 279–292 (2020). https://doi.org/10.1007/s10851-019-00889-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10851-019-00889-w

Keywords

Navigation