Skip to main content
Log in

Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

We develop a theoretical foundation for the application of Nesterov’s accelerated gradient descent method (AGD) to the approximation of solutions of a wide class of partial differential equations (PDEs). This is achieved by proving the existence of an invariant set and exponential convergence rates when its preconditioned version (PAGD) is applied to minimize locally Lipschitz smooth, strongly convex objective functionals. We introduce a second-order ordinary differential equation (ODE) with a preconditioner built-in and show that PAGD is an explicit time-discretization of this ODE, which requires a natural time step restriction for energy stability. At the continuous time level, we show an exponential convergence of the ODE solution to its steady state using a simple energy argument. At the discrete level, assuming the aforementioned step size restriction, the existence of an invariant set is proved and a matching exponential rate of convergence of the PAGD scheme is derived by mimicking the energy argument and the convergence at the continuous level. Applications of the PAGD method to numerical PDEs are demonstrated with certain nonlinear elliptic PDEs using pseudo-spectral methods for spatial discretization, and several numerical experiments are conducted. The results confirm the global geometric and mesh size-independent convergence of the PAGD method, with an accelerated rate that is improved over the preconditioned gradient descent (PGD) method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The results of numerical experiments are either fully documented in the manuscript or can be made available on reasonable request.

Code Availability

The code can be made available on reasonable request.

References

  1. Adams, R.A., Fournier, J.J.F.: Sobolev Spaces, Volume 140 of Pure and Applied Mathematics (Amsterdam), 2nd edn. Elsevier/Academic Press, Amsterdam (2003). ISBN 0-12-044143-8

  2. Ainsworth, M., Mao, Z.: Well-posedness of the Cahn–Hilliard equation with fractional free energy and its Fourier Galerkin approximation. Chaos Solitons Fractals 102, 264–273 (2017). https://doi.org/10.1016/j.chaos.2017.05.022

    Article  MathSciNet  MATH  Google Scholar 

  3. Ainsworth, M., Mao, Z.: Fractional phase-field crystal modelling: analysis, approximation and pattern formation. IMA J. Appl. Math. 85(2), 231–262 (2020). https://doi.org/10.1093/imamat/hxaa004

    Article  MathSciNet  MATH  Google Scholar 

  4. Allen-Zhu, Z., Orecchia, L.:. Linear coupling: an ultimate unification of gradient and mirror descent. In: 8th Innovations in Theoretical Computer Science Conference, Volume 67 of LIPIcs Leibniz International Proceedings in Informatics, Art. No. 3, 22. Schloss Dagstuhl Leibniz-Zentrum für Informatik, Wadern (2017)

  5. Antil, H., Otárola, E., Salgado, A.J.: Optimization with respect to order in a fractional diffusion model: analysis, approximation and algorithmic aspects. J. Sci. Comput. 77(1), 204–224 (2018). https://doi.org/10.1007/s10915-018-0703-0

    Article  MathSciNet  MATH  Google Scholar 

  6. Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(1), 1–34 (2000). https://doi.org/10.1142/S0219199700000025

    Article  MathSciNet  MATH  Google Scholar 

  7. Barrett, J.W., Liu, W.B.: Finite element approximation of the \(p\)-Laplacian. Math. Comput. 61(204), 523–537 (1993). https://doi.org/10.2307/2153239

    Article  MathSciNet  MATH  Google Scholar 

  8. Beck, A.: First-Order Methods in Optimization, Volume 25 of MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA (2017). ISBN 978-1-611974-98-0. https://doi.org/10.1137/1.9781611974997.ch1

  9. Benyamin, M., Calder, J., Sundaramoorthi, G., Yezzi, A.: Accelerated variational PDEs for efficient solution of regularized inversion problems. J. Math. Imaging Vis. 62(1), 10–36 (2020). https://doi.org/10.1007/s10851-019-00910-2

    Article  MathSciNet  MATH  Google Scholar 

  10. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific Optimization and Computation Series, 2nd edn. Athena Scientific, Belmont (1999)

    Google Scholar 

  11. Bonito, A., Borthagaray, J.P., Nochetto, R.H., Otárola, E., Salgado, A.J.: Numerical methods for fractional diffusion. Comput. Vis. Sci. 19(5–6), 19–46 (2018). https://doi.org/10.1007/s00791-018-0289-y

    Article  MathSciNet  Google Scholar 

  12. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  13. Calder, J., Yezzi, A.: PDE acceleration: a convergence rate analysis and applications to obstacle problems. Res. Math. Sci., 6(4):Paper No. 35, 30 (2019). ISSN 2522-0144. https://doi.org/10.1007/s40687-019-0197-x

  14. Canuto, C., Hussaini, M.Y., Quarteroni, A., Zang, T.A.: Spectral methods. Fundamentals in single domains. In: Scientific Computation. Springer-Verlag, Berlin (2006). ISBN 978-3-540-30725-9; 3-540-30725-7

  15. Chen, L., Hu, X., Wise, S.M.: Convergence analysis of the fast subspace descent method for convex optimization problems. Math. Comput. 89(325), 2249–2282 (2020). https://doi.org/10.1090/mcom/3526

    Article  MathSciNet  MATH  Google Scholar 

  16. Ciarlet, P.G.: Introduction to numerical linear algebra and optimisation. In: Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge (1989). ISBN 0-521-32788-1; 0-521-33984-7. With the assistance of Bernadette Miara and Jean-Marie Thomas, Translated from the French by A. Buttigieg

  17. Ciarlet, P.G.: Linear and Nonlinear Functional Analysis with Applications. Society for Industrial and Applied Mathematics, Philadelphia (2013)

    MATH  Google Scholar 

  18. Evans, L.C.: Partial Differential Equations, Volume 19 of Graduate Studies in Mathematics, 2nd edn. American Mathematical Society, Providence (2010). ISBN 978-0-8218-4974-3. https://doi.org/10.1090/gsm/019

  19. Feng, W., Salgado, A.J., Wang, C., Wise, S.M.: Preconditioned steepest descent methods for some nonlinear elliptic equations involving p-Laplacian terms. J. Comput. Phys. 334, 45–67 (2017). https://doi.org/10.1016/j.jcp.2016.12.046

    Article  MathSciNet  MATH  Google Scholar 

  20. Feng, W., Guan, Z., Lowengrub, J., Wang, C., Wise, S.M., Chen, Y.: A uniquely solvable, energy stable numerical scheme for the functionalized Cahn–Hilliard equation and its convergence analysis. J. Sci. Comput. 76(3), 1938–1967 (2018). https://doi.org/10.1007/s10915-018-0690-1

    Article  MathSciNet  MATH  Google Scholar 

  21. Goudou, X., Munier, J.: The gradient and heavy ball with friction dynamical systems: the quasiconvex case. Math. Program. 116(1–2, Ser. B), 173–191 (2009). https://doi.org/10.1007/s10107-007-0109-5

    Article  MathSciNet  MATH  Google Scholar 

  22. Jovanović, B. S.,Süli, E.: Analysis of Finite Difference Schemes. Springer Series in Computational Mathematics, vol. 46. Springer, London (2014). ISBN 978-1-4471-5459-4; 978-1-4471-5460-0. https://doi.org/10.1007/978-1-4471-5460-0

  23. Laborde, M., Oberman, A.: A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In: Chiappa, S., Calandra, R. (eds.) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Volume 108 of Proceedings of Machine Learning Research, pp. 602–612 (2020). PMLR. http://proceedings.mlr.press/v108/laborde20a.html

  24. Luo, H., Chen, L.: From differential equation solvers to accelerated first-order methods for convex optimization (2020). arXiv:1909.03145

  25. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/k^{2})\). Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)

    MathSciNet  Google Scholar 

  26. Nesterov, Y.E.: Introductory Lectures on Convex Optimization. Applied Optimization, vol. 87. Kluwer Academic Publishers, Boston (2004). ISBN 1-4020-7553-7. https://doi.org/10.1007/978-1-4419-8853-9

  27. Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability Lecture Notes in Mathematics, 2nd edn, vol. 1364. Springer-Verlag, Berlin, (1993). ISBN 3-540-56715-1

  28. Poljak, B.T.: Some methods of speeding up the convergence of iterative methods. Ž. Vyčisl. Mat i Mat. Fiz. 4, 791–803 (1964)

    MathSciNet  Google Scholar 

  29. Schaeffer, H., Hou, T.Y.: An accelerated method for nonlinear elliptic PDE. J. Sci. Comput. 69(2), 556–580 (2016). https://doi.org/10.1007/s10915-016-0215-8

    Article  MathSciNet  MATH  Google Scholar 

  30. Shen, J., Tang, T., Wang, L.-L.: Algorithms, analysis and applications. In: Spectral Methods, Volume 41 of Springer Series in Computational Mathematics. Springer, Heidelberg (2011). ISBN 978-3-540-71040-0. https://doi.org/10.1007/978-3-540-71041-7

  31. Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-resolution differential equations (2018). arXiv:1810.08907

  32. Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions (2019). arXiv:1903.05671

  33. Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2510–2518. Curran Associates, Inc. (2014)

  34. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA 113(47), E7351–E7358 (2016). https://doi.org/10.1073/pnas.1614734113

    Article  MathSciNet  MATH  Google Scholar 

  35. Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization (2018). arXiv:611.02635

Download references

Funding

SMW acknowledges partial financial support from NSF-DMS 1719854. AJS has been partially supported by NSF-DMS 1720123.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abner J. Salgado.

Ethics declarations

Conflict of interest

We declare we have not conflict of interest.

Ethics Approval

We have not submitted this manuscript anywhere, and it will not be submitted anywhere while it is under review.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: An IVP as the Limit of the PAGD Method

1.1 A.1. Derivation of the ODE

Let us start with the same approach as in [33]. We assume, as an ansatz, that PAGD is a discretization of an ODE, which has a solution \(X:[0,\infty )\rightarrow \mathbb {H}\), or a trajectory. We also assume that X is smooth enough, e.g., twice continuously differentiable in time. For a fixed \(t\in (0,\infty )\), the assumed smoothness on X, together with the identification \(t= {\sqrt{s}}k\) and Taylor’s formula in a normed vector space (e.g., [17, Theorem 7.9-1]) implies:

$$\begin{aligned} \frac{x_{k+1}-x_k}{{\sqrt{s}}}&=\dot{X}(t)+\frac{1}{2}\ddot{X}(t){\sqrt{s}}+{\text {o}}\bigl ( {\sqrt{s}} \bigr ) \quad \text { as } s\rightarrow 0, \nonumber \\ \frac{x_{k}-x_{k-1}}{{\sqrt{s}}}&=\dot{X}(t)-\frac{1}{2}\ddot{X}(t){\sqrt{s}}+{\text {o}}\bigl ( {\sqrt{s}} \bigr ) \quad \text { as } s\rightarrow 0, \nonumber \\ {\sqrt{s}}{\mathcal {L}^{-1}}G'(y_k)&={\sqrt{s}}{\mathcal {L}^{-1}}G'(X(t))+{\text {o}}\bigl ( {\sqrt{s}} \bigr ) \quad \text { as } s\rightarrow 0. \end{aligned}$$
(A.1)

The last identity follows from the continuity of \(G'\), that of \({\mathcal {L}^{-1}}\), and (3.2), from which we can deduce \(y_k\rightarrow X(t)\) as \(s\rightarrow 0\). Plugging (3.2) into (3.3) and dividing by \({\sqrt{s}}\), we have \( \frac{x_{k+1}-x_k}{{\sqrt{s}}}- \lambda \frac{x_{k}-x_{k-1}}{{\sqrt{s}}}+{\sqrt{s}}{\mathcal {L}^{-1}}G'(y_k)=0 \). Substituting the above Taylor expansions, and then rearranging, we arrive at

$$\begin{aligned} \frac{1}{2}(1+\lambda ) \ddot{X}(t) +\frac{1-\lambda }{{\sqrt{s}}}\dot{X}(t)+ {\mathcal {L}^{-1}}G'(X(t))+{\text {o}}\bigl ( 1 \bigr )=0 \quad \text { as } s\rightarrow 0. \end{aligned}$$
(A.2)

To make this estimate consistent, interpret \(\lambda \) as a function of s and further assume that \((1-\lambda )/{\sqrt{s}}\rightarrow 2\eta \) as \({\sqrt{s}}\rightarrow 0\) for some \(\eta \in (0,\infty )\), which yields

$$\begin{aligned} \ddot{X}(t)+2\eta \dot{X}(t)+{\mathcal {L}^{-1}}G'(X(t))=0. \end{aligned}$$
(A.3)

1.2 A.2. Derivation of the Initial Conditions

The initialization \(y_0 = x_0\) and (3.3) with \(k=0\) imply

$$\begin{aligned} \frac{x_1-x_0}{{\sqrt{s}}}={\sqrt{s}}{\mathcal {L}^{-1}}G'(x_0) . \end{aligned}$$

Take the limit \(s\rightarrow 0\) and conclude \(\dot{X}(0)=0\) since \(G'\) and \(\dot{X}\) are assumed to be continuous. Therefore, we arrive at the desired IVP (4.1).

Remark A.1

(momentum method) A similar procedure can be carried out far more easily for the so-called momentum method (MM). To see this, we recall that

$$\begin{aligned} \ddot{X}(t)\approx \frac{x_{k+1}-2x_k+x_{k-1}}{s}, \quad \dot{X}(t)\approx \frac{x_k-x_{k-1}}{{\sqrt{s}}}, \quad G'(X(t))\approx G'(x_k) . \end{aligned}$$

Then, the discrete version of the ODE (4.1) becomes

$$\begin{aligned} x_{k+1}=x_k-sG'(x_k)+(1-2\eta {\sqrt{s}})(x_k-x_{k-1}), \end{aligned}$$

which is MM with the weight \(1-2\eta {\sqrt{s}}\); see [28, p. 12 (9)]. This weight is close to \(\lambda \):

$$\begin{aligned} \lambda =\frac{1-\eta {\sqrt{s}}}{1+\eta {\sqrt{s}}}=1-\frac{2\eta {\sqrt{s}}}{1+\eta {\sqrt{s}}}\approx 1-2\eta {\sqrt{s}}. \end{aligned}$$

In this sense, MM seems more natural and amenable for analysis than AGD. \(\square \)

The limiting behavior of MM can also be explained by the IVP (4.1). Observe that the only essential difference between MM and PAGD is where \(G'\) is evaluated, that is, \(x_k\) and \(y_k\) respectively. And in the limit \(s\rightarrow 0\), \(x_k\) and \(y_k\) are not distinguishable in this setting. However, PAGD exhibits less oscillation than MM since evaluating \(G'\) at \(y_k\) serves as “foreseeing” the uphill of the objective functional, if exists, along the trajectory and “steering” to avoid unnecessary oscillating behaviors. Recently, a higher order Taylor expansion turns out to help differentiate their performaces (see [31]).

Appendix 2: PAGD as a Discretization of the IVP

Let us label the step size \({\sqrt{s}}\), rather than s, in order to make the setting more in line with the PAGD algorithm. Again, it is helpful to have in mind the correspondence: time \(t\longleftrightarrow k{\sqrt{s}}\) (\(k=0,1,2,\ldots \)) and position \(X(t)\longleftrightarrow x_k\). First, we will see \(y_k\) corresponds to a “drifted” position without the potential landscape over \([t,t+{\sqrt{s}}]\). This can be modeled by \(\ddot{X}(t)+2\eta \dot{X}(t)=0\), which leads to another energy law \( \frac{1}{2}\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}^2 = \frac{1}{2}\bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}}^2 -2\eta \int _t^{t+{\sqrt{s}}}\bigl \Vert \dot{X}(\tau ) \bigr \Vert _{\mathcal {L}}^2{d }\tau . \) Approximate the speed in the integrand by the average \(\frac{1}{2}(\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}+\bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}})\), then after a short calculation, one obtains \(\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}=\lambda \bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}}\). Since the dynamics takes place in a single direction, this implies \(\dot{X}(t+{\sqrt{s}})=\lambda \dot{X}(t)\). The approximations \(\dot{X}(t)\approx \frac{x_k-x_{k-1}}{{\sqrt{s}}}\) and \(\dot{X}(t+{\sqrt{s}})=\frac{y_k-x_k}{{\sqrt{s}}}\) lead us to (3.2).

Next, we discretize the vector V(t). Since we do not know the minimizer in practice, we remove it from the definition of \(v_k\) and discretize \(V(t)+x^*=X(t)+\frac{1}{\eta }\dot{X}(t)\). The approximations \(X(t)\approx y_k\) and \(\dot{X}(t)\approx \frac{y_k-x_k}{{\sqrt{s}}}\) suggest

$$\begin{aligned} v_k = y_{k} +\frac{1}{\theta }(y_k-x_{k}), \end{aligned}$$
(B.1)

which leads to the definition of \(\{v_k\}_{k\ge 1}\) (3.4) upon combining with the definition of \(\{y_k\}\).

Finally, to get the main iterates, \(\{x_{k}\}_{k\ge 1}\), we discretize (4.3) using the approximations \(\dot{V} (t) \approx \frac{v_{k+1}-v_k}{{\sqrt{s}}}\), \(\dot{X} (t) \approx \frac{y_{k}-x_k}{{\sqrt{s}}}\), and the evaluation of \(G'\) at \(y_k\), then it follows \( \eta \frac{v_{k+1}-v_k}{{\sqrt{s}}}+\eta \frac{y_k-x_k}{{\sqrt{s}}}+{\mathcal {L}^{-1}}G'(y_k)=0 \). Plugging in (3.4) and (A.1), one obtains (3.3), the definition of \(\{x_k\}_{k\ge 1}\).

Appendix 3: Literature Comparison

We summarize our discussion on the existing literature works, and contrast them with our contributions, in Table 2.

Table 2 We summarize our discussion on the existing literature works, and contrast them with our contributions, in this table

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, JH., Salgado, A.J. & Wise, S.M. Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs. J Sci Comput 89, 17 (2021). https://doi.org/10.1007/s10915-021-01615-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-021-01615-8

Keywords

Mathematics Subject Classification

Navigation