The Global Optimization Geometry of Shallow Linear Neural Networks

Zhu, Zhihui; Soudry, Daniel; Eldar, Yonina C.; Wakin, Michael B.

doi:10.1007/s10851-019-00889-w

The Global Optimization Geometry of Shallow Linear Neural Networks

Published: 31 May 2019

Volume 62, pages 279–292, (2020)
Cite this article

Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

778 Accesses
7 Citations
Explore all metrics

Abstract

We examine the squared error loss landscape of shallow linear neural networks. We show—with significantly milder assumptions than previous works—that the corresponding optimization problems have benign geometric properties: There are no spurious local minima, and the Hessian at every saddle point has at least one negative eigenvalue. This means that at every saddle point there is a directional negative curvature which algorithms can utilize to further decrease the objective value. These geometric properties imply that many local search algorithms (such as the gradient descent which is widely utilized for training neural networks) can provably solve the training problem with global convergence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning Optimization

Landscape Analysis for Shallow Neural Networks: Complete Classification of Critical Points for Affine Target Functions

Article Open access 05 July 2022

Using Hessians as a Regularization Technique

Notes

From an optimization perspective, non-strict saddle points and local minima have similar first-/second-order information and it is hard for first-/second-order methods (like gradient descent) to distinguish between them.

References

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)
Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2(1), 53–58 (1989)
Article Google Scholar
Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)
Blum, A., Rivest, R.L.: Training a 3-node neural network is NP-complete. In: Advances in Neural Information Processing Systems (NIPS), pp. 494–501 (1989)
Borgerding, M., Schniter, P., Rangan, S.: Amp-inspired deep networks for sparse linear inverse problems. IEEE Trans. Signal Process. 65(16), 4293–4308 (2017)
Article MathSciNet Google Scholar
Byrd, R.H., Gilbert, J.C., Nocedal, J.: A trust region method based on interior point techniques for nonlinear programming. Math. Program. 89(1), 149–185 (2000)
Article MathSciNet Google Scholar
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)
Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)
Book Google Scholar
Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412 (2017)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
Article MathSciNet Google Scholar
Du, S.S., Lee, J.D., Tian, Y.: When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129 (2017)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points—online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: a unified geometric analysis. In: International Conference on Machine Learning, pp. 1233–1242 (2017)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)
Haeffele, B.D., Vidal, R.: Global optimality in neural network training. In: Conference on Computer Vision and Pattern Recognition, pp. 7331–7339 (2017)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Article Google Scholar
Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: International Conference on Machine Learning, pp. 1724–1732 (2017)
Kamilov, U.S., Mansour, H.: Learning optimal nonlinearities for iterative thresholding algorithms. IEEE Signal Process. Lett. 23(5), 747–751 (2016)
Article Google Scholar
Kawaguchi, K.: Deep learning without poor local minima. In: Advances in Neural Information Processing Systems, pp. 586–594 (2016)
Laurent, T., von Brecht, J.: Deep linear neural networks with arbitrary loss: all local minima are global. arXiv preprint arXiv:1712.01473 (2017)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)
Li, Q., Zhu, Z., Tang, G.: The non-convex geometry of low-rank matrix optimization. Inf. Inference: J. IMA 8, 51–96 (2019)
Article MathSciNet Google Scholar
Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H., Zhao, T.: Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)
Li, X., Zhu, Z., So, A.M.C., Vidal, R.: Nonconvex robust low-rank matrix recovery. arXiv preprint arXiv:1809.09237 (2018)
Li, Y., Zhang, Y., Huang, X., Ma, J.: Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 99, 1–16 (2018)
Article Google Scholar
Li, Y., Zhang, Y., Huang, X., Yuille, A.L.: Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images. ISPRS J. Photogramm. Remote Sens. 146, 182–196 (2018)
Article Google Scholar
Liu, H., Yue, M.C., Man-Cho So, A.: On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim. 27(4), 2426–2446 (2017)
Article MathSciNet Google Scholar
Lu, H., Kawaguchi, K.: Depth creates no bad local minima. arXiv preprint arXiv:1702.08580 (2017)
Mousavi, A., Patel, A.B., Baraniuk, R.G.: A deep learning approach to structured signal recovery. In: 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1336–1343 (2015)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet Google Scholar
Nouiehed, M., Razaviyayn, M.: Learning deep models: critical points and local openness. arXiv preprint arXiv:1803.02968 (2018)
Park, D., Kyrillidis, A., Carmanis, C., Sanghavi, S.: Non-square matrix sensing without spurious local minima via the burer-monteiro approach. In: Artificial Intelligence and Statistics, pp. 65–74 (2017)
Qu, Q., Zhang, Y., Eldar, Y.C., Wright, J.: Convolutional phase retrieval via gradient descent. arXiv preprint arXiv:1712.00716 (2017)
Safran, I., Shamir, O.: Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968 (2017)
Schalkoff, R.J.: Artificial Neural Networks, vol. 1. McGraw-Hill, New York (1997)
MATH Google Scholar
Soudry, D., Carmon, Y.: No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361 (2016)
Soudry, D., Hoffer, E.: Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777 (2017)
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18, 1–68 (2018)
Article MathSciNet Google Scholar
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)
Article MathSciNet Google Scholar
Tian, Y.: An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In: International Conference on Machine Learning, pp. 3404–3413 (2017)
Werbos, P.: Beyond regression: new fools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University (1974)
Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444 (2017)
Zhu, Z., Li, Q., Tang, G., Wakin, M.B.: The global optimization geometry of low-rank matrix optimization. arXiv preprint arXiv:1703.01256 (2017)
Zhu, Z., Li, Q., Tang, G., Wakin, M.B.: Global optimality in low-rank matrix optimization. IEEE Trans. Signal Process. 66(13), 3614–3628 (2018)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, USA
Zhihui Zhu
Department of Electrical Engineering, Technion, Israel Institute of Technology, Haifa, Israel
Daniel Soudry
Department of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot, Israel
Yonina C. Eldar
Department of Electrical Engineering, Colorado School of Mines, Golden, USA
Michael B. Wakin

Authors

Zhihui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Soudry
View author publications
You can also search for this author in PubMed Google Scholar
Yonina C. Eldar
View author publications
You can also search for this author in PubMed Google Scholar
Michael B. Wakin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhihui Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

ZZ and MBW were supported by NSF Grant CCF–1409261, NSF CAREER Grant CCF–1149225, and the DARPA Lagrange Program under ONR/SPAWAR contract N660011824020. DS was supported by the Israel Science Foundation (Grant No. 31/1031) and by the Taub Foundation.

Appendices

Proof of Lemma 1

1.1 Proof of (9)

Intuitively, the regularizer $\rho $ in (3) forces $\varvec{W}_2^\mathrm {T}$ and $\varvec{W}_1\varvec{X}$ to be balanced (i.e., $\varvec{W}_2^\mathrm {T}\varvec{W}_2 = \varvec{W}_1 \varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}$). We show that with this regularizer, any critical point of g obeys (9). To establish this, first note that any critical point $\varvec{Z}$ of $g(\varvec{Z})$ satisfies $\nabla g(\varvec{Z})=\varvec{0}$, i.e.,

$$\begin{aligned} \begin{aligned}&\nabla _{\varvec{W}_1} g(\varvec{W}_1,\varvec{W}_2)= \varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\\&\quad - \mu (\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T})\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}= \mathbf{0}, \end{aligned} \end{aligned}$$

(19)

and

$$\begin{aligned} \begin{aligned}&\nabla _{\varvec{W}_2} g(\varvec{W}_1,\varvec{W}_2)= (\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\\&\quad + \mu \varvec{W}_2(\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}) = \mathbf{0}. \end{aligned} \end{aligned}$$

(20)

By (19), we obtain

$$\begin{aligned} \begin{aligned}&\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\\&\quad = \mu (\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T})\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}. \end{aligned} \end{aligned}$$

(21)

Multiplying (20) on the left by $\varvec{W}_2^\mathrm {T}$ and plugging the result with the expression for $\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}$ in (21) gives

$$\begin{aligned}&(\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T})\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\\&\quad + \varvec{W}_2^\mathrm {T}\varvec{W}_2(\varvec{W}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}) = \varvec{0}, \end{aligned}$$

which is equivalent to

$$\begin{aligned} \varvec{W}_2^\mathrm {T}\varvec{W}_2\varvec{W}_2^\mathrm {T}\varvec{W}_2 = \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}. \end{aligned}$$

Note that $\varvec{W}_2^\mathrm {T}\varvec{W}_2$ and $\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}$ are the principal square roots (i.e., PSD square roots) of $\varvec{W}_2^\mathrm {T}\varvec{W}_2\varvec{W}_2^\mathrm {T}\varvec{W}_2$ and $\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}$, respectively. Utilizing the result that a PSD matrix $\varvec{A}$ has a unique PSD matrix $\varvec{B}$ such that $\varvec{B}^k = \varvec{A}$ for any $k\ge 1$ [16, Theorem 7.2.6], we obtain

$$\begin{aligned} \varvec{W}_2^\mathrm {T}\varvec{W}_2 = \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\end{aligned}$$

for any critical point $\varvec{Z}$.

1.2 Proof of (10)

To show (10), we first plug (9) back into (19) and (20), simplifying the first-order optimality equation as

$$\begin{aligned} \begin{aligned}&\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}= \mathbf{0},\\&(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}= \mathbf{0}. \end{aligned} \end{aligned}$$

(22)

What remains is to find all $(\varvec{W}_1,\varvec{W}_2)$ that satisfy the above equation.

Let $\varvec{W}_2 = \varvec{L}{\varvec{\Pi }}\varvec{R}^\mathrm {T}$ be a full SVD of $\varvec{W}_2$, where $\varvec{L}\in {\mathbb {R}}^{d_2\times d_2}$ and $\varvec{R}\in {\mathbb {R}}^{d_1\times d_1}$ are orthonormal matrices. Define

$$\begin{aligned} {\widetilde{\varvec{W}}}_2 = \varvec{W}_2 \varvec{R}= \varvec{L}{\varvec{\Pi }}, \ {\widetilde{\varvec{W}}}_1 = \varvec{R}^\mathrm {T}\varvec{W}_1\varvec{U}{\varvec{\varSigma }}. \end{aligned}$$

(23)

Since $\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}= \varvec{W}_2^\mathrm {T}\varvec{W}_2 $ [see (9)], we have

$$\begin{aligned} {\widetilde{\varvec{W}}}_1{\widetilde{\varvec{W}}}_1^\mathrm {T}= {\widetilde{\varvec{W}}}_2^\mathrm {T}{\widetilde{\varvec{W}}}_2 = {\varvec{\Pi }}^\mathrm {T}{\varvec{\Pi }}. \end{aligned}$$

(24)

Noting that ${\varvec{\Pi }}^\mathrm {T}{\varvec{\Pi }}$ is a diagonal matrix with nonnegative diagonals, it follows that ${\widetilde{\varvec{W}}}_1^\mathrm {T}$ is an orthogonal matrix, but possibly includes zero columns.

Due to (22), we have

$$\begin{aligned} \begin{aligned}&{\widetilde{\varvec{W}}}_2^\mathrm {T}({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}\\&\quad = \varvec{R}^\mathrm {T}(\varvec{W}_2^\mathrm {T}(\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}) = \mathbf{0},\\&({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}{\widetilde{\varvec{W}}}_1^\mathrm {T}\\&\quad = (\varvec{W}_2 \varvec{W}_1\varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T}\varvec{R}= \mathbf{0}, \end{aligned} \end{aligned}$$

(25)

where we utilized the reduced SVD decomposition $\varvec{X}= \varvec{U}{\varvec{\varSigma }}\varvec{V}^\mathrm {T}$ in (8). Note that the diagonals of ${\varvec{\varSigma }}$ are all positive and recall

$$\begin{aligned} {\widetilde{\varvec{Y}}} = \varvec{Y}\varvec{V}. \end{aligned}$$

Then, (25) gives

$$\begin{aligned} \begin{aligned}&{\widetilde{\varvec{W}}}_2^\mathrm {T}({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - {\widetilde{\varvec{Y}}}) = \mathbf{0}, \\&({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - {\widetilde{\varvec{Y}}}){\widetilde{\varvec{W}}}_1^\mathrm {T}= \mathbf{0}. \end{aligned} \end{aligned}$$

(26)

We now compute all ${\widetilde{\varvec{W}}}_2$ and ${\widetilde{\varvec{W}}}_1$ satisfying (26). To that end, let $\varvec{\phi }\in {\mathbb {R}}^{d_2}$ and $\varvec{\psi }\in {\mathbb {R}}^{d_0}$ be the ith column and the ith row of ${\widetilde{\varvec{W}}}_2$ and ${\widetilde{\varvec{W}}}_1$, respectively. Due to (24), we have

$$\begin{aligned} \Vert \varvec{\phi }\Vert _2 = \Vert \varvec{\psi }\Vert _2. \end{aligned}$$

(27)

It follows from (26) that

$$\begin{aligned} {\widetilde{\varvec{Y}}}^\mathrm {T}\varvec{\phi }= \Vert \varvec{\phi }\Vert _2^2 \varvec{\psi }, \end{aligned}$$

(28)

$$\begin{aligned} {\widetilde{\varvec{Y}}} \varvec{\psi }= \Vert \varvec{\psi }\Vert _2^2 \varvec{\phi }. \end{aligned}$$

(29)

Multiplying (28) by ${\widetilde{\varvec{Y}}}$ and plugging (29) into the resulting equation gives

$$\begin{aligned} {\widetilde{\varvec{Y}}} {\widetilde{\varvec{Y}}}^\mathrm {T}\varvec{\phi }= \Vert \varvec{\phi }\Vert _2^4\varvec{\phi }, \end{aligned}$$

(30)

where we used (27). Similarly, we have

$$\begin{aligned} {\widetilde{\varvec{Y}}}^\mathrm {T}{\widetilde{\varvec{Y}}}\varvec{\psi }= \Vert \varvec{\psi }\Vert _2^4\varvec{\psi }. \end{aligned}$$

(31)

Let ${\widetilde{\varvec{Y}}} = \varvec{P}{\varvec{\Lambda }}\varvec{Q}^\mathrm {T}= \sum _{j=1}^r\lambda _j \varvec{p}_j\varvec{q}_j^\mathrm {T}$ be the reduced SVD of ${\widetilde{\varvec{Y}}}$. It follows from (30) that $\varvec{\phi }$ is either a zero vector (i.e., $\varvec{\phi }= \varvec{0}$), or a left singular vector of ${\widetilde{\varvec{Y}}}$ (i.e., $\varvec{\phi }= \alpha \varvec{p}_j$ for some $j\in [r]$). Plugging $\varvec{\phi }= \alpha \varvec{p}_j$ into (30) gives

$$\begin{aligned} \lambda _j^2 = \alpha ^4. \end{aligned}$$

Thus, $\varvec{\phi }= \pm \sqrt{\lambda _j}\varvec{p}_j$. If $\varvec{\phi }= \varvec{0}$, then due to (27), we have $\varvec{\psi }= \varvec{0}$. If $\varvec{\phi }= \pm \sqrt{\lambda _j}\varvec{p}_j$, then plugging into (28) gives

$$\begin{aligned} \varvec{\psi }= \pm \sqrt{\lambda _j}\varvec{q}_j. \end{aligned}$$

Thus, we conclude that

$$\begin{aligned} (\varvec{\phi },\varvec{\psi })\in \left\{ \pm \sqrt{\lambda _1}(\varvec{p}_1,\varvec{q}_1),\ldots ,\pm \sqrt{\lambda _r}(\varvec{p}_r,\varvec{q}_r),(\varvec{0},\varvec{0}) \right\} , \end{aligned}$$

which together with (24) implies that any critical point $\varvec{Z}$ belongs to (10) by absorbing the sign ± into $\varvec{R}$.

We now prove the other direction $\Rightarrow $. For any $\varvec{Z}\in \mathcal {C}$, we compute the gradient of g at this point and directly verify it satisfies (19) and (20), i.e., $\varvec{Z}$ is a critical point of $g(\varvec{Z})$. This completes the proof of Lemma 1.

Proof of Lemma 2

Due to the fact that $\varvec{Z}$ is a global minimum of $g(\varvec{Z})$ if and only if ${\widetilde{\varvec{Z}}}$ is a global minimum of $\widetilde{g}({\widetilde{\varvec{Z}}})$, we know any $\varvec{Z}\in \mathcal {X}$ is a global minimum of $g(\varvec{Z})$. The rest is to show that any $\varvec{Z}\in \mathcal {C}\setminus \mathcal {X}$ is a strict saddle. For this purpose, we first compute the Hessian quadrature form $\nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}]$ for any ${\varvec{\varDelta }}= \begin{bmatrix}{\varvec{\varDelta }}_2\\{\varvec{\varDelta }}_1^\mathrm {T}\end{bmatrix}$ (with ${\varvec{\varDelta }}_1\in {\mathbb {R}}^{d_1\times d_0},{\varvec{\varDelta }}_2\in {\mathbb {R}}^{d_2\times d_1}$) as

$$\begin{aligned}&\nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}]\nonumber \\&\quad = \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}\right\| _F^2\nonumber \\&\quad \quad + 2\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle \nonumber \\&\quad \quad + \mu \big (\langle \varvec{W}_2^\mathrm {T}\varvec{W}_2 {-} \varvec{W}_1 \varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^\mathrm {T},{\varvec{\varDelta }}_2^\mathrm {T}{\varvec{\varDelta }}_2 {-} {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}\rangle + \nonumber \\&\quad \quad \, \frac{1}{2}\Vert \varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 + {\varvec{\varDelta }}_2^\mathrm {T}\varvec{W}_2 {-} \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}{-} {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^T\Vert _F^2 \big )\nonumber \\&\quad = \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2\nonumber \\&\quad \quad + 2\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle +\nonumber \\&\quad \quad \, \frac{\mu }{2}\Vert \varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 + {\varvec{\varDelta }}_2^\mathrm {T}\varvec{W}_2 {-} \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}{-} {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^T\Vert _F^2,\nonumber \\ \end{aligned}$$

(32)

where the second equality follows because any critical point $\varvec{Z}$ satisfies (9). We continue the proof by considering two cases in which we provide explicit expressions for the set $\mathcal {X}$ that contains all the global minima and construct a negative direction for g at all the points $\mathcal {C}\setminus \mathcal {X}$.

Case i: $r\le d_1$. In this case, $\min {\widetilde{g}}({\widetilde{\varvec{Z}}}) = 0$ and ${\widetilde{g}}({\widetilde{\varvec{Z}}})$ achieves its global minimum 0 if and only if ${\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1 = \varvec{Y}\varvec{V}$. Thus, we rewrite $\mathcal {X}$ as

$$\begin{aligned} \begin{aligned} \mathcal {X}= \bigg \{\varvec{Z}= \begin{bmatrix}{\widetilde{\varvec{W}}}_2 \varvec{R}\\ \varvec{U}{\varvec{\varSigma }}^{-1}{\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{R}\end{bmatrix} \in \mathcal {C}: \widetilde{\varvec{W}}_2{\widetilde{\varvec{W}}}_1 = \varvec{Y}\varvec{V}\bigg \}, \end{aligned} \end{aligned}$$

which further implies that

$$\begin{aligned} \mathcal {C}\setminus \mathcal {X}= \bigg \{\varvec{Z}=&\begin{bmatrix}\widetilde{\varvec{W}}_2 \varvec{R}\\ \varvec{U}{\varvec{\varSigma }}^{-1}{\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{R}\end{bmatrix} \in \mathcal {C}:\\ {}&\varvec{Y}\varvec{V}- {\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1 =\sum _{i\in \varOmega }\lambda _i\varvec{p}_i\varvec{q}_i^\mathrm {T},\varOmega \subset [r] \bigg \}. \end{aligned}$$

Thus, for any $\varvec{Z}\in \mathcal {C}\setminus \mathcal {X}$, the corresponding ${\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1$ is a low-rank approximation to $\varvec{Y}\varvec{V}$.

Let $k\in \varOmega $. We have

$$\begin{aligned} \varvec{p}_k^\mathrm {T}{\widetilde{\varvec{W}}}_2 = \varvec{0}, \ {\widetilde{\varvec{W}}}_1 \varvec{q}_k = \varvec{0}. \end{aligned}$$

(33)

In words, $\varvec{p}_k$ and $\varvec{q}_k$ are orthogonal to ${\widetilde{\varvec{W}}}_2$ and ${\widetilde{\varvec{W}}}_1$, respectively. Let $\varvec{\alpha }\in {\mathbb {R}}^{d_1}$ be the eigenvector associated with the smallest eigenvalue of ${\widetilde{\varvec{Z}}}^\mathrm {T}{\widetilde{\varvec{Z}}}$. Note that such $\varvec{\alpha }$ simultaneously lives in the null spaces of ${\widetilde{\varvec{W}}}_2$ and ${\widetilde{\varvec{W}}}_1^\mathrm {T}$ since ${\widetilde{\varvec{Z}}}$ is rank deficient, indicating

$$\begin{aligned} 0 =\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{Z}}}^\mathrm {T}{\widetilde{\varvec{Z}}} \varvec{\alpha }= \varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_2^\mathrm {T}{\widetilde{\varvec{W}}}_2 \varvec{\alpha }+ \varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1{\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{\alpha }, \end{aligned}$$

which further implies

$$\begin{aligned} {\widetilde{\varvec{W}}}_2 \varvec{\alpha }= \varvec{0},\ {\widetilde{\varvec{W}}}_1^\mathrm {T}\varvec{\alpha }= \varvec{0}. \end{aligned}$$

(34)

With this property, we construct ${\varvec{\varDelta }}$ by setting ${\varvec{\varDelta }}_{2} = \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}$ and ${\varvec{\varDelta }}_{1} = \varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}$.

Now, we show that $\varvec{Z}$ is a strict saddle by arguing that $g(\varvec{Z})$ has a strictly negative curvature along the constructed direction ${\varvec{\varDelta }}$, i.e., $[\nabla ^2g(\varvec{Z})]({\varvec{\varDelta }},{\varvec{\varDelta }})<0$. For this purpose, we compute the three terms in (32) as follows:

$$\begin{aligned} \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2 = 0 \end{aligned}$$

(35)

since $\varvec{W}_2{\varvec{\varDelta }}_1 = \varvec{W}_2\varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}= {\widetilde{\varvec{W}}}_2\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}= \varvec{0}$ and ${\varvec{\varDelta }}_2\varvec{W}_1 = \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\varvec{W}_1 = \varvec{p}_k\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 = \varvec{0}$ by utilizing (34);

$$\begin{aligned} \Vert \varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 + {\varvec{\varDelta }}_2^\mathrm {T}\varvec{W}_2 - \varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}- {\varvec{\varDelta }}_1\varvec{X}\varvec{X}^\mathrm {T}\varvec{W}_1^T\Vert _F^2 = 0 \end{aligned}$$

since it follows from (33) that $\varvec{W}_2^\mathrm {T}{\varvec{\varDelta }}_2 = \varvec{R}^\mathrm {T}{\widetilde{\varvec{W}}}_2^\mathrm {T}\varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}= \varvec{0}$ and

$$\begin{aligned}&\varvec{W}_1\varvec{X}\varvec{X}^\mathrm {T}{\varvec{\varDelta }}_1^\mathrm {T}= \varvec{R}^\mathrm {T}{\widetilde{\varvec{W}}}_1{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}\varvec{U}{\varvec{\varSigma }}^2\varvec{U}^\mathrm {T}\varvec{U}{\varvec{\varSigma }}^{-1}\varvec{q}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\\&\quad = \varvec{R}^\mathrm {T}{\widetilde{\varvec{W}}}_1 \varvec{q}_k\varvec{\alpha }^\mathrm {T}\varvec{R}= \varvec{0}; \end{aligned}$$

and

$$\begin{aligned}&\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T},({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T},{\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 \right\rangle - \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}, \varvec{Y}\varvec{V}\right\rangle = - \lambda _k, \end{aligned}$$

where the last equality utilizes (33). Thus, we have

$$\begin{aligned} \nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}] = -2\lambda _k \le -2\lambda _r. \end{aligned}$$

We finally obtain (14) by noting that

$$\begin{aligned} \Vert {\varvec{\varDelta }}\Vert _F^2&= \Vert {\varvec{\varDelta }}_1\Vert _F^2 + \Vert {\varvec{\varDelta }}_2\Vert _F^2 \\&= \Vert \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}\Vert _F^2 + \Vert \varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}\Vert _F^2\\&= 1 + \Vert {\varvec{\varSigma }}^{-1} \varvec{q}_k\Vert _F^2 \le 1 + \Vert {\varvec{\varSigma }}^{-1} \Vert _F^2 \Vert \varvec{q}_k\Vert _F^2\\&= 1 + \Vert {\varvec{\varSigma }}^{-1} \Vert _F^2, \end{aligned}$$

where the inequality follows from the Cauchy–Schwartz inequality $|\varvec{a}^\mathrm {T}\varvec{b}|\le \Vert \varvec{a}\Vert _2\Vert \varvec{b}\Vert _2$.

Case ii: $r> d_1$. In this case, minimizing ${\widetilde{g}}({\widetilde{\varvec{Z}}})$ in (12) is equivalent to finding a low-rank approximation to $\varvec{Y}\varvec{V}$. Let $\Gamma $ denote the indices of the singular vectors $\{\varvec{p}_j\}$ and $\{\varvec{q}_j\}$ that are included in ${\widetilde{\varvec{Z}}}$, that is,

$$\begin{aligned} \left\{ {\widetilde{\varvec{z}}}_i,i\in [d_1]\right\} = \left\{ \varvec{0},\sqrt{\lambda _j}\begin{bmatrix}\varvec{p}_j\\\varvec{q}_j\end{bmatrix},j\in \Gamma \right\} . \end{aligned}$$

Then, for any ${\widetilde{\varvec{Z}}}$, we have

$$\begin{aligned} {\widetilde{\varvec{W}}}_2{\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}= \sum _{i\ne \lambda }\lambda _i\varvec{p}_i\varvec{q}_i \end{aligned}$$

and

$$\begin{aligned} \widetilde{g}(\widetilde{\varvec{Z}}) = \frac{1}{2}\Vert \widetilde{\varvec{W}}_2\widetilde{\varvec{W}}_1 - \varvec{Y}\varvec{V}\Vert _F^2 = \frac{1}{2}\left( \sum _{i\ne \Lambda }\lambda _i^2\right) , \end{aligned}$$

which implies that $\widetilde{\varvec{Z}}$ is a global minimum of ${\widetilde{g}}(\widetilde{\varvec{Z}})$ if

$$\begin{aligned} \Vert \widetilde{\varvec{W}}_2\widetilde{\varvec{W}}_1 - \varvec{Y}\varvec{V}\Vert _F^2 = \sum _{i>d_1}\lambda _i^2. \end{aligned}$$

To simply the following analysis, we assume $\lambda _{d_1}> \lambda _{d_1 +1}$, but the argument is similar in the case of repeated eigenvalues at $\lambda _{d_1}$ (i.e., $\lambda _{d_1}= \lambda _{d_1 +1} = \cdots $). In this case, we know for any $\varvec{Z}\in \mathcal {C}\setminus \mathcal {X}$ that is not a global minimum, there exists $\varOmega \subset [r]$ which contains $k\in \varOmega ,k\le d_1$ such that

$$\begin{aligned} \varvec{Y}\varvec{V}- \widetilde{\varvec{W}}_2\widetilde{\varvec{W}}_1 =\sum _{i\in \varOmega }\lambda _i\varvec{p}_i\varvec{q}_i^\mathrm {T}. \end{aligned}$$

Similar to Case i, we have

$$\begin{aligned} \varvec{p}_k^\mathrm {T}\widetilde{\varvec{W}}_2 = \varvec{0}, \ \widetilde{\varvec{W}}_1 \varvec{q}_k = \varvec{0}. \end{aligned}$$

(36)

Let $\varvec{\alpha }\in {\mathbb {R}}^{d_1}$ be the eigenvector associated with the smallest eigenvalue of $\widetilde{\varvec{Z}}^\mathrm {T}\widetilde{\varvec{Z}}$. By the form of $\widetilde{\varvec{Z}}$ in (10), we have

$$\begin{aligned} \Vert \widetilde{\varvec{W}}_2 \varvec{\alpha }\Vert _2^2 = \Vert \widetilde{\varvec{W}}_1^\mathrm {T}\varvec{\alpha }\Vert _2^2 \le \lambda _{d_1+1}, \end{aligned}$$

(37)

where the inequality attains equality when $d_1+1\in \varOmega $. As in Case i, we construct ${\varvec{\varDelta }}$ by setting ${\varvec{\varDelta }}_{2} = \varvec{p}_k\varvec{\alpha }^\mathrm {T}\varvec{R}$ and ${\varvec{\varDelta }}_{1} = \varvec{R}^\mathrm {T}\varvec{\alpha }\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T}$. We now show that $\varvec{Z}$ is a strict saddle by arguing that $g(\varvec{Z})$ has a strictly negative curvature along the constructed direction ${\varvec{\varDelta }}$ (i.e., $[\nabla ^2g(\varvec{Z})]({\varvec{\varDelta }},{\varvec{\varDelta }})<0$) by computing the three terms in (32) as follows:

$$\begin{aligned}&\left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2\\&\quad = \left\| {\widetilde{\varvec{W}}}_2\varvec{\alpha }\varvec{q}_k^\mathrm {T}\varvec{V}^\mathrm {T}+ \varvec{p}_k\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 \varvec{V}^\mathrm {T}\right\| _F^2\\&\quad = \left\| {\widetilde{\varvec{W}}}_2\varvec{\alpha }\right\| _F^2 + \left\| + \varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 \right\| _F^2 + 2\left\langle \widetilde{\varvec{W}}_2\varvec{\alpha }\varvec{q}_k^\mathrm {T}, \varvec{p}_k\varvec{\alpha }^\mathrm {T}{\widetilde{\varvec{W}}}_1 \right\rangle \\&\quad \le 2\lambda _{d_1 +1}, \end{aligned}$$

where the last line follows from (36) and (37);

$$\begin{aligned} \left\| (\varvec{W}_2 {\varvec{\varDelta }}_1 + {\varvec{\varDelta }}_2\varvec{W}_1)\varvec{X}_1 \right\| _F^2 = 0 \end{aligned}$$

holds with a similar argument as in (35); and

$$\begin{aligned}&\left\langle {\varvec{\varDelta }}_2{\varvec{\varDelta }}_1,(\varvec{W}_2 \varvec{W}_1 \varvec{X}- \varvec{Y})\varvec{X}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}{\varvec{\varSigma }}^{-1}\varvec{U}^\mathrm {T},({\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 - \varvec{Y}\varvec{V}){\varvec{\varSigma }}\varvec{U}^\mathrm {T}\right\rangle \\&\quad = \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T},{\widetilde{\varvec{W}}}_2 {\widetilde{\varvec{W}}}_1 \right\rangle - \left\langle \varvec{p}_k\varvec{q}_k^\mathrm {T}, \varvec{Y}\varvec{V}\right\rangle \\&\quad = - \lambda _k \le -\lambda _{d_1}, \end{aligned}$$

where the last equality used (36) and the fact that $k\le d_1$. Thus, we have

$$\begin{aligned} \nabla ^2 g(\varvec{Z})[{\varvec{\varDelta }},{\varvec{\varDelta }}] \le -2(\lambda _{d_1} - \lambda _{d_1+1}), \end{aligned}$$

completing the proof of Lemma 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Z., Soudry, D., Eldar, Y.C. et al. The Global Optimization Geometry of Shallow Linear Neural Networks. J Math Imaging Vis 62, 279–292 (2020). https://doi.org/10.1007/s10851-019-00889-w

Download citation

Received: 03 November 2018
Accepted: 21 May 2019
Published: 31 May 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10851-019-00889-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Global Optimization Geometry of Shallow Linear Neural Networks

Abstract

Access this article

Similar content being viewed by others

Deep Learning Optimization

Landscape Analysis for Shallow Neural Networks: Complete Classification of Critical Points for Affine Target Functions

Using Hessians as a Regularization Technique

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Lemma 1

1.1 Proof of (9)

1.2 Proof of (10)

Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Global Optimization Geometry of Shallow Linear Neural Networks

Abstract

Access this article

Similar content being viewed by others

Deep Learning Optimization

Landscape Analysis for Shallow Neural Networks: Complete Classification of Critical Points for Affine Target Functions

Using Hessians as a Regularization Technique

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Lemma 1

1.1 Proof of (9)

1.2 Proof of (10)

Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation