Skip to main content
Log in

Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. This work appeared concurrently with our manuscript.

  2. Weakly convex functions also go by other names such as lower-\(C^2\), uniformly prox-regularity, paraconvex, and semiconvex. We refer the reader to the seminal works on the topic [2, 50, 53, 56, 58].

  3. A function is called semi-algebraic if its graph decomposes into a finite union of sets, each defined by finitely many polynomial inequalities.

  4. Perhaps more appropriate would be the terms active strict saddle and the active strict saddle property. For brevity, we omit the word “active.”

  5. Weak convexity is not essential here, provided one modifies the definitions appropriately. Moreover, this guarantee holds more generally for functions definable in an o-minimal structure.

  6. The domain of \(d^2 f_{\mathcal {M}}(\bar{y})(u|\cdot )\) consists of w satisfying \((\langle \nabla ^2 G_1(\bar{y})u,u\rangle ,\ldots , \langle \nabla ^2 G_{n-r}(\bar{y})u,u\rangle )=-\nabla G(\bar{y})w\), where \(G_i\) are the coordinate functions of G.

  7. What we call an active manifold here is called an identifiable manifold in [19]—the reference we most closely follow. The term active is more evocative in the context of the current work.

  8. Note that due to the convention \(\inf _{\emptyset }=+\infty \), the entire space \(\mathcal {M}=\mathbb {R}^d\) is the active manifold for a globally \(C^p\)-smooth function f around any of its critical points.

  9. Better terminology would be the terms active strict saddle and the active strict saddle property. To streamline the notation, we omit the word active, as it should be clearly understood from context.

  10. A function is semi-algebraic if its graph can be written as a finite union of sets each cut out by finitely many polynomial inequalities.

  11. For example, let F be a \(C^2\) function defined on a neighborhood U of \(\bar{x}\) that agrees with f on \(U\cap \mathcal {M}\). Using a partition of unity (e.g., [36, Lemma 2.26]), one can extend F from a slightly smaller neighborhood to be \(C^2\) on all of \(\mathbb {R}^d\).

  12. We should note that metric regularity of F at \((\bar{x},\bar{v})\) is equivalent to (A.1) for an arbitrary set-valued map F with closed graph, provided we interpret \(N_{\mathrm{gph}\,F}(\bar{x},\bar{v})\) as the limiting normal cone [57, Definition 6.3].

References

  1. F. Al-Khayyal and J. Kyparisis. Finite convergence of algorithms for nonlinear programs and variational inequalities. J. Optim. Theory Appl., 70(2):319–332, 1991.

    Article  MathSciNet  Google Scholar 

  2. P. Albano and P. Cannarsa. Singularities of semiconcave functions in Banach spaces. In Stochastic analysis, control, optimization and applications, Systems Control Found. Appl., pages 171–190. Birkhäuser Boston, Boston, MA, 1999.

  3. H. Attouch, J. Bolte, and B.F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.

    Article  MathSciNet  Google Scholar 

  4. D. Avdiukhin, c. Jin, and G. Yaroslavtsev. Escaping saddle points with inequality constraints via noisy sticky projected gradient descent. Optimization for Machine Learning Workshop, 2019.

  5. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.

    Article  MathSciNet  Google Scholar 

  6. S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.

  7. J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.

    Article  MathSciNet  Google Scholar 

  8. J.F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. Springer, New York, 2000.

    Book  Google Scholar 

  9. J.V. Burke. Descent methods for composite nondifferentiable optimization problems. Math. Programming, 33(3):260–279, 1985.

    Article  MathSciNet  Google Scholar 

  10. J.V. Burke. On the identification of active constraints. II. The nonconvex case. SIAM J. Numer. Anal., 27(4):1081–1103, 1990.

    Article  MathSciNet  Google Scholar 

  11. J.V. Burke and J.J. Moré. On the identification of active constraints. SIAM J. Numer. Anal., 25(5):1197–1211, 1988.

    Article  MathSciNet  Google Scholar 

  12. P.H. Calamai and J.J. Moré. Projected gradient methods for linearly constrained problems. Math. Prog., 39(1):93–116, 1987.

    Article  MathSciNet  Google Scholar 

  13. V. Charisopoulos, Y. Chen, D. Davis, M. Díaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. Foundations of Computational Mathematics, pages 1–89, 2021.

  14. F.H. Clarke, Yu. Ledyaev, R.I. Stern, and P.R. Wolenski. Nonsmooth Analysis and Control Theory. Texts in Math. 178, Springer, New York, 1998.

  15. C. Criscitiello and N. Boumal. Efficiently escaping saddle points on manifolds. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

  16. D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.

    Article  MathSciNet  Google Scholar 

  17. D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News, 26(2), 2018.

  18. D. Drusvyatskiy, A.D. Ioffe, and A.S. Lewis. Generic minimizing behavior in semialgebraic optimization. SIAM Journal on Optimization, 26(1):513–534, 2016.

    Article  MathSciNet  Google Scholar 

  19. D. Drusvyatskiy and A.S. Lewis. Optimality, identifiablity, and sensitivity. Math. Program., 147(1-2, Ser. A):467–498, 2014.

  20. D. Drusvyatskiy and A.S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018.

    Article  MathSciNet  Google Scholar 

  21. D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1-2):503–558, 2019.

    Article  MathSciNet  Google Scholar 

  22. S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, A. Singh, and B. Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pages 1067–1077, 2017.

  23. J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.

    Article  MathSciNet  Google Scholar 

  24. J.C. Dunn. On the convergence of projected gradient processes to singular critical points. J. Optim. Theory Appl., 55(2):203–216, 1987.

    Article  MathSciNet  Google Scholar 

  25. M.C. Ferris. Finite termination of the proximal point algorithm. Math. Program., 50(3, (Ser. A)):359–366, 1991.

  26. S.D. Flåm. On finite convergence and constraint identification of subgradient projection methods. Math. Program., 57:427–437, 1992.

    Article  MathSciNet  Google Scholar 

  27. R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.

  28. R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233–1242. JMLR. org, 2017.

  29. R. Ge, J.D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.

  30. N. Hallak and M. Teboulle. Finding second-order stationary points in constrained minimization: A feasible direction approach. Journal of Optimization Theory and Applications, 186(2):480–503, 2020.

    Article  MathSciNet  Google Scholar 

  31. W.L. Hare and A.S. Lewis. Identifying active manifolds. Algorithmic Oper. Res., 2(2):75–82, 2007.

    MathSciNet  MATH  Google Scholar 

  32. C. Jin, P. Netrapalli, and M. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880–4889. PMLR, 2020.

  33. R. Jin, C.and Ge, P. Netrapalli, S.M. Kakade, and M.I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org, 2017.

  34. J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, and B. Recht. First-order methods almost always avoid strict saddle points. Math. Program., 176(1-2):311–337, 2019.

    Article  MathSciNet  Google Scholar 

  35. J.D. Lee, M. Simchowitz, M.I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246–1257, 2016a.

  36. J.M. Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer, 2013.

  37. Sangkyun Lee and Stephen J Wright. Manifold identification in dual averaging for regularized stochastic online learning. Journal of Machine Learning Research, 13(Jun):1705–1744, 2012.

  38. C. Lemaréchal, F. Oustry, and C. Sagastizábal. The U-lagrangian of a convex function. Trans. Amer. Math. Soc., 352:711–729, 1996.

    Article  MathSciNet  Google Scholar 

  39. A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002.

  40. A.S. Lewis and S.J. Wright. A proximal method for composite minimization. Math. Program., pages 1–46, 2015.

  41. A.S. Lewis and S. Zhang. Partial smoothness, tilt stability, and generalized Hessians. SIAM Journal on Optimization, 23(1):74–94, 2013.

    Article  MathSciNet  Google Scholar 

  42. B. Martinet. Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Rech. Opérationnelle, 4(Sér. R-3):154–158, 1970.

  43. B. Martinet. Détermination approchée d’un point fixe d’une application pseudo-contractante. Cas de l’application prox. C. R. Acad. Sci. Paris Sér. A-B, 274:A163–A165, 1972.

  44. A. Mokhtari, A. Ozdaglar, and A. Jadbabaie. Escaping saddle points in constrained optimization. In Advances in Neural Information Processing Systems, pages 3629–3639, 2018.

  45. B.S. Mordukhovich. Variational analysis and generalized differentiation. I, volume 330 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2006. Basic theory.

  46. J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93:273–299, 1965.

    Article  MathSciNet  Google Scholar 

  47. Yu. Nesterov. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optimisation Methods and Software, 22(3):469–483, 2007.

    Article  MathSciNet  Google Scholar 

  48. Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.

    Article  MathSciNet  Google Scholar 

  49. M. Nouiehed, J.D. Lee, and M. Razaviyayn. Convergence to second-order stationarity for constrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018.

  50. E.A. Nurminskii. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9(1):145–150, 1973.

    Article  MathSciNet  Google Scholar 

  51. I. Panageas and G. Piliouras. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.

  52. E. Pauwels. The value function approach to convergence analysis in composite optimization. Operations Research Letters, 44(6):790–795, 2016.

    Article  MathSciNet  Google Scholar 

  53. R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996.

    Article  MathSciNet  Google Scholar 

  54. R.T. Rockafellar. Convex analysis. Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.

  55. R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optimization, 14(5):877–898, 1976.

    Article  MathSciNet  Google Scholar 

  56. R.T. Rockafellar. Favorable classes of Lipschitz-continuous functions in subgradient optimization. In Progress in nondifferentiable optimization, volume 8 of IIASA Collaborative Proc. Ser. CP-82, pages 125–143. Int. Inst. Appl. Sys. Anal., Laxenburg, 1982.

  57. R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.

  58. S. Rolewicz. On paraconvex multifunctions. In Third Symposium on Operations Research (Univ. Mannheim, Mannheim, 1978), Section I, volume 31 of Operations Res. Verfahren, pages 539–546. Hain, Königstein/Ts., 1979.

  59. A. Shapiro. Second order sensitivity analysis and asymptotic theory of parametrized nonlinear programs. Mathematical Programming, 33(3):280–299, 1985.

    Article  MathSciNet  Google Scholar 

  60. M. Shub. Global stability of dynamical systems. Springer Science & Business Media, 2013.

  61. J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.

  62. J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.

    Article  MathSciNet  Google Scholar 

  63. Y. Sun, N. Flammarion, and M. Fazel. Escaping from saddle points on riemannian manifolds. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

  64. S.J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31:1063–1079, 1993.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank John Duchi for his insightful comments on an early version of the manuscript. We also thank the anonymous referees for numerous suggestions that have improved the readability of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitriy Drusvyatskiy.

Additional information

Communicated by Michael Overton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

D. Drusvyatskiy: Research of Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.

Appendices

Proofs of Theorems 2.9 and 5.2

In this section, we prove Theorem 2.9. We should note that Theorem 2.9, appropriately restated, holds much more broadly beyond the weakly convex function class. To simplify the notational overhead, however, we impose the weak convexity assumption, throughout.

We will require some basic notation from variational analysis; for details, we refer the reader to [57]. A set-valued map \(F:\mathbb {R}^d\rightrightarrows \mathbb {R}^m\) assigns to each point \(x\in \mathbb {R}^d\) a set F(x) in \(\mathbb {R}^m\). The graph of F is defined by

$$\begin{aligned} \mathrm{gph}\,F:=\{(x,v):v\in F(x)\}.\end{aligned}$$

A map \(F:\mathbb {R}^d\rightrightarrows \mathbb {R}^m\) is called metrically regular at \((\bar{x},\bar{v})\in \mathrm{gph}\,F\) if there exists a constant \(\kappa >0\) such that the estimate holds:

$$\begin{aligned} \mathrm{dist}(x,F^{-1}(v))\le \kappa \mathrm{dist}(v,F(x))\end{aligned}$$

for all x near \(\bar{x}\) and all v near \(\bar{v}\). If the graph \(\mathrm{gph}\,F\) is a \(C^1\)-smooth manifold around \((\bar{x},\bar{v})\), then metric regularity at \((\bar{x},\bar{v})\) is equivalent to the condition [57, Theorem 9.43(d)]:Footnote 12

$$\begin{aligned} (0,u)\in N_{\mathrm{gph}\,F}(\bar{x},\bar{v})\quad \Longrightarrow \quad u=0. \end{aligned}$$
(A.1)

We begin with the following lemma.

Lemma A.1

(Subdifferential metric regularity in smooth minimization). Consider the optimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^d} f(x)\quad \text {subject to}\quad x\in \mathcal {M},\end{aligned}$$

where \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) is a \(C^2\)-smooth function and \(\mathcal {M}\) is a \(C^2\)-smooth manifold. Let \(\bar{x}\in \mathcal {M}\) satisfy the criticality condition \(0\in \partial f_{\mathcal {M}}(\bar{x})\) and suppose that the subdifferential map \(\partial f_{\mathcal {M}}:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Then, the guarantee holds:

$$\begin{aligned} \inf _{u\in \mathbb {S}^{d-1}\cap T_{\mathcal {M}}(\bar{x})} d^2 f_{\mathcal {M}}(\bar{x})(u)\ne 0. \end{aligned}$$
(A.2)

Proof

First, appealing to (A.1), we conclude that the implication holds:

$$\begin{aligned} (0,u)\in N_{\mathrm{gph}\,\partial f_{\mathcal {M}}}(\bar{x},0)\quad \Longrightarrow \quad u=0. \end{aligned}$$
(A.3)

Let us now interpret the condition (A.3) in Lagrangian terms. To this end, let \(G=0\) be the local defining equations for \(\mathcal {M}\) around \(\bar{x}\). Define the Lagrangian function

$$\begin{aligned} \mathcal {L}(x,\lambda )=f(x)+\langle G(x),\lambda \rangle ,\end{aligned}$$

and let \(\bar{\lambda }\) be the unique Lagrange multiplier vector satisfying \(\nabla _x \mathcal {L}(\bar{x},\bar{\lambda })=0\). According to [41, Corollary 2.9], we have the following expression:

$$\begin{aligned} (0,u)\in N_{\mathrm{gph}\,\partial f_{\mathcal {M}}}(\bar{x},0)\quad \Longleftrightarrow \quad u\in T_{\mathcal {M}}(\bar{x})\quad \text {and}\quad L u \in N_{\mathcal {M}}(\bar{x}), \end{aligned}$$
(A.4)

where \(L:=\nabla ^2_{xx}\mathcal {L}(\bar{x},\bar{\lambda })\) denotes the Hessian of the Lagrangian. Combining (A.3) and (A.4), we deduce that the only vector \(u\in T_{\mathcal {M}}(\bar{x})\) satisfying \(L u\in N_{\mathcal {M}}(\bar{x})\) is the zero vector \(u=0\).

Now for the sake of contradiction, suppose that (A.2) fails. Then, the quadratic form \(Q(u)=\langle L u,u\rangle \) is nonnegative on \(T_{\mathcal {M}}(\bar{x})\) and there exists \(0\ne \bar{u}\in T_{\mathcal {M}}(\bar{x})\) satisfying \(Q(\bar{u})=0\). We deduce that \(\bar{u}\) minimizes \(Q(\cdot )\) on \(T_{\mathcal {M}}(\bar{x})\), and therefore, the inclusion \(L\bar{u}\in N_{\mathcal {M}}(\bar{x})\) holds, a clear contradiction. \(\square \)

The following corollary for active manifolds will now quickly follow.

Corollary A.2

(Subdifferential metric regularity and active manifolds). Consider a closed and weakly convex function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{\infty \}\). Suppose that f admits a \(C^2\)-smooth active manifold around a critical point \(\bar{x}\) and that the subdifferential map \(\partial f:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Then, \(\bar{x}\) is either a strong local minimizer of f or satisfies the curvature condition \(d^2 f_{\mathcal {M}}(\bar{x})(u)<0\) for some \(u\in T_{\mathcal {M}}(\bar{x})\).

Proof

The result [19, Proposition 10.2] implies that \(\mathrm{gph}\,\partial f\) coincides with \(\mathrm{gph}\,\partial f_{\mathcal {M}}\) on a neighborhood of \((\bar{x},0)\). Therefore, the subdifferential map \(\partial f_{\mathcal {M}}:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Using Lemma A.1, we obtain the guarantee:

$$\begin{aligned} \inf _{u\in \mathbb {S}^{d-1}\cap T_{\mathcal {M}}(\bar{x})} d^2 f_{\mathcal {M}}(\bar{x})(u)\ne 0. \end{aligned}$$

If the infimum is strictly negative, the proof is complete. Otherwise, the infimum is strictly positive. In this case, \(\bar{x}\) is a strong local minimizer of \(f_{\mathcal {M}}\), and therefore by [19, Proposition 7.2] a strong local minimizer of f. \(\square \)

We are now ready for the proofs of Theorems 2.9 and 5.2.

Proof of Theorem 2.9

The result [18, Corollary 4.8] shows that for almost all \(v\in \mathbb {R}^d\), the function \(g(x):=f(x)-\langle v,x\rangle \) has at most finitely many critical points. Moreover each such critical point \(\bar{x}\) lies on some \(C^2\) active manifold \(\mathcal {M}\) of g and the subdifferential map \(\partial g:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Applying Corollary A.2 to g for such generic vectors v, we deduce that every critical point \(\bar{x}\) of g is either a strong local minimizer or a strict saddle of g. The proof is complete. \(\square \)

Proof of Theorem 5.2

The proof is identical to that of Theorem 2.9 with [18, Theorem 5.2] playing the role of [18, Corollary 4.8]. \(\square \)

Pathological Example

Theorem B.1

Consider the following function

$$\begin{aligned} f(x, y) = \frac{1}{2}(|x| + |y|)^2 - \frac{\rho }{2} x^2 \end{aligned}$$

Assume that \(\lambda > \rho \). Define a mapping \(T :\mathbb {R}^d \rightarrow \mathbb {R}\) by the following formula.

$$\begin{aligned} S(x, y) = {\left\{ \begin{array}{ll} 0 &{} \text {if }(x, y) = 0;\\ \left( 0, \frac{\lambda }{1+\lambda } y\right) &{} \text {if }|x| \le \frac{1}{1+\lambda } |y|; \\ \left( \frac{\lambda }{1+\lambda - \rho }x, 0\right) &{} \text {if }|y| \le \frac{1}{1+\lambda -\rho }|x|,\\ \end{array}\right. } \end{aligned}$$

and if \(\frac{1}{ (1+\lambda - \rho )} |x|< |y| < (1+\lambda ) |x|\), we have

$$\begin{aligned} S(x,y) = {\left\{ \begin{array}{ll} \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} -1 \\ -1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix} &{} \text {if }\mathrm {sign}(x) = \mathrm {sign}(y);\\ \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )-1} \begin{bmatrix} (1+\lambda ) &{} 1 \\ 1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix}&\text {if }\mathrm {sign}(x) \ne \mathrm {sign}(y). \end{array}\right. } \end{aligned}$$

Then, \(\mathrm{prox}_{(1/\lambda ) f}(x, y) = S(x,y)\).

Proof

Let us denote the components of S(xy) by \((x_+, y_+) = S(x,y)\). By first-order optimality conditions, we have \(\mathrm{prox}_{(1/\lambda ) f}(x, y) = (x_+, y_+) \) if and only if

$$\begin{aligned}&\lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+) \in \\&{\left\{ \begin{array}{ll} \{x_+ + \mathrm {sign}(x_+)|y_+|\}\times \{\mathrm {sign}(y_+)|x_+| + y_+\} &{} \text {if }x_+ \ne 0\text { and }y_+ \ne 0;\\ ([-1, 1]y_+)\times \{y_+\} &{} \text {if }x_+ = 0\text { and }y_+ \ne 0;\\ \{x_+\}\times ([-1, 1]x_+) &{} \text {if }x_+ \ne 0\text { and }y_+ = 0;\\ \{0\}\times \{0\} &{} \text {if }x_+ = 0\text { and }y_+ = 0.\\ \end{array}\right. } \end{aligned}$$

Let us show that \((x_+, y_+)\) indeed satisfies this inclusion.

  1. 1.

    If \((x,y) = 0\), then \(x_+ = y_+ = 0\), and the pair satisfies the inclusion.

  2. 2.

    If \(|x| \le \frac{1}{1 + \lambda }|y|\) and \(y \ne 0\), then \(x_+ = 0\), \(y_+ = \frac{\lambda }{1+\lambda }y\), and

    $$\begin{aligned} \lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+) = \lambda \left( x, \frac{1}{1 + \lambda }y\right) \in ([-1, 1]y_+) \times \{y_+\}. \end{aligned}$$

    Thus, the pair satisfies the inclusion.

  3. 3.

    If \(|y| \le \frac{1}{1+\lambda -\rho }|x|\) and \(x \ne 0\), then \(x_+ = \frac{\lambda }{(1+\lambda - \rho )}x\), \(y_+ = 0\), and

    $$\begin{aligned}&\lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+)\\&= \lambda \left( x - \frac{\lambda -\rho }{(1+\lambda - \rho )}x, y\right) \in \{x_+\}\times ([-1, 1]x_+). \end{aligned}$$

For the remaining two cases, let us assume that \(\frac{1}{ (1+\lambda - \rho )} |x|< |y| < (1+\lambda ) |x|\).

  1. 4.

    If \(\mathrm {sign}(x) = \mathrm {sign}(y)\), let \(s = \mathrm {sign}(x)\) and note that

    $$\begin{aligned} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} -1 \\ -1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}\\&= \frac{s\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda )|x| -|y| \\ -|x| + (1+\lambda - \rho )|y| \end{bmatrix} \end{aligned}$$

    From this equation we learn \(\mathrm {sign}(x_+) = \mathrm {sign}(y_+) = s\). Inverting the matrix, we also learn

    $$\begin{aligned} \lambda \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} (1+\lambda - \rho ) &{} 1 \\ 1 &{} (1+\lambda ) \end{bmatrix} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \begin{bmatrix} x_+ + \lambda (1 - \rho /\lambda )x_+ + y_+ \\ x_+ + y_+ + \lambda y_+ \end{bmatrix} \\&= \begin{bmatrix} x_+ + \mathrm {sign}(x_+) |y_+| + \lambda (1 - \rho /\lambda )x_+ \\ \mathrm {sign}(y_+) |x_+| + y_+ + \lambda y_+ \end{bmatrix}. \end{aligned}$$

    Thus, the pair satisfies the inclusion.

  2. 5.

    If \(\mathrm {sign}(x) \ne \mathrm {sign}(y)\), let \(s = \mathrm {sign}(x)\) and note that

    $$\begin{aligned} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} 1 \\ 1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}\\&= \frac{s\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda )|x| -|y| \\ |x| - (1+\lambda - \rho )|y| \end{bmatrix} \end{aligned}$$

    From this equation we learn \(\mathrm {sign}(x_+) \ne \mathrm {sign}(y_+) \). Inverting the matrix we also learn

    $$\begin{aligned} \lambda \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} (1+\lambda - \rho ) &{} -1 \\ -1 &{} (1+\lambda ) \end{bmatrix} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \begin{bmatrix} x_+ + \lambda (1 - \rho /\lambda )x_+ - y_+ \\ -x_+ + y_+ + \lambda y_+ \end{bmatrix} \\&= \begin{bmatrix} x_+ + \mathrm {sign}(x_+) |y_+| + \lambda (1 - \rho /\lambda )x_+ \\ \mathrm {sign}(y_+) |x_+| + y_+ + \lambda y_+ \end{bmatrix}. \end{aligned}$$

    Thus, the pair satisfies the inclusion.

Therefore, the proof is complete. \(\square \)

Corollary B.2

(Convergence to Saddles). Assume the setting of Theorem B.1. Let \(\alpha \in (0, 1]\) and define the operator \(T : = (1-\alpha ) I + \alpha S\) on \(\mathbb {R}^2\). Then, the cone \( \mathcal {K}= \{(x,y) :|x| \le (1+\lambda )^{-1}y\} \) satisfies \(T\mathcal {K}\subseteq \mathcal {K}\). Moreover, for any \((x, y) \in \mathcal {K}\), it holds that \(T^k(x,y) = ((1-\alpha )^k x, (1 - \alpha (1 - \lambda (1+\lambda )^{-1}))^ky)\) linearly converges to the origin as k tends to infinity.

Proof

Since \(\mathcal {K}\) is convex, it suffices to show that \(S \mathcal {K}\subseteq \mathcal {K}\). This follows from Theorem B.1. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davis, D., Drusvyatskiy, D. Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions. Found Comput Math 22, 561–606 (2022). https://doi.org/10.1007/s10208-021-09516-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-021-09516-w

Keywords

Mathematics Subject Classification

Navigation