Abstract
We develop a theoretical foundation for the application of Nesterov’s accelerated gradient descent method (AGD) to the approximation of solutions of a wide class of partial differential equations (PDEs). This is achieved by proving the existence of an invariant set and exponential convergence rates when its preconditioned version (PAGD) is applied to minimize locally Lipschitz smooth, strongly convex objective functionals. We introduce a second-order ordinary differential equation (ODE) with a preconditioner built-in and show that PAGD is an explicit time-discretization of this ODE, which requires a natural time step restriction for energy stability. At the continuous time level, we show an exponential convergence of the ODE solution to its steady state using a simple energy argument. At the discrete level, assuming the aforementioned step size restriction, the existence of an invariant set is proved and a matching exponential rate of convergence of the PAGD scheme is derived by mimicking the energy argument and the convergence at the continuous level. Applications of the PAGD method to numerical PDEs are demonstrated with certain nonlinear elliptic PDEs using pseudo-spectral methods for spatial discretization, and several numerical experiments are conducted. The results confirm the global geometric and mesh size-independent convergence of the PAGD method, with an accelerated rate that is improved over the preconditioned gradient descent (PGD) method.
Similar content being viewed by others
Data Availability
The results of numerical experiments are either fully documented in the manuscript or can be made available on reasonable request.
Code Availability
The code can be made available on reasonable request.
References
Adams, R.A., Fournier, J.J.F.: Sobolev Spaces, Volume 140 of Pure and Applied Mathematics (Amsterdam), 2nd edn. Elsevier/Academic Press, Amsterdam (2003). ISBN 0-12-044143-8
Ainsworth, M., Mao, Z.: Well-posedness of the Cahn–Hilliard equation with fractional free energy and its Fourier Galerkin approximation. Chaos Solitons Fractals 102, 264–273 (2017). https://doi.org/10.1016/j.chaos.2017.05.022
Ainsworth, M., Mao, Z.: Fractional phase-field crystal modelling: analysis, approximation and pattern formation. IMA J. Appl. Math. 85(2), 231–262 (2020). https://doi.org/10.1093/imamat/hxaa004
Allen-Zhu, Z., Orecchia, L.:. Linear coupling: an ultimate unification of gradient and mirror descent. In: 8th Innovations in Theoretical Computer Science Conference, Volume 67 of LIPIcs Leibniz International Proceedings in Informatics, Art. No. 3, 22. Schloss Dagstuhl Leibniz-Zentrum für Informatik, Wadern (2017)
Antil, H., Otárola, E., Salgado, A.J.: Optimization with respect to order in a fractional diffusion model: analysis, approximation and algorithmic aspects. J. Sci. Comput. 77(1), 204–224 (2018). https://doi.org/10.1007/s10915-018-0703-0
Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(1), 1–34 (2000). https://doi.org/10.1142/S0219199700000025
Barrett, J.W., Liu, W.B.: Finite element approximation of the \(p\)-Laplacian. Math. Comput. 61(204), 523–537 (1993). https://doi.org/10.2307/2153239
Beck, A.: First-Order Methods in Optimization, Volume 25 of MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA (2017). ISBN 978-1-611974-98-0. https://doi.org/10.1137/1.9781611974997.ch1
Benyamin, M., Calder, J., Sundaramoorthi, G., Yezzi, A.: Accelerated variational PDEs for efficient solution of regularized inversion problems. J. Math. Imaging Vis. 62(1), 10–36 (2020). https://doi.org/10.1007/s10851-019-00910-2
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific Optimization and Computation Series, 2nd edn. Athena Scientific, Belmont (1999)
Bonito, A., Borthagaray, J.P., Nochetto, R.H., Otárola, E., Salgado, A.J.: Numerical methods for fractional diffusion. Comput. Vis. Sci. 19(5–6), 19–46 (2018). https://doi.org/10.1007/s00791-018-0289-y
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Calder, J., Yezzi, A.: PDE acceleration: a convergence rate analysis and applications to obstacle problems. Res. Math. Sci., 6(4):Paper No. 35, 30 (2019). ISSN 2522-0144. https://doi.org/10.1007/s40687-019-0197-x
Canuto, C., Hussaini, M.Y., Quarteroni, A., Zang, T.A.: Spectral methods. Fundamentals in single domains. In: Scientific Computation. Springer-Verlag, Berlin (2006). ISBN 978-3-540-30725-9; 3-540-30725-7
Chen, L., Hu, X., Wise, S.M.: Convergence analysis of the fast subspace descent method for convex optimization problems. Math. Comput. 89(325), 2249–2282 (2020). https://doi.org/10.1090/mcom/3526
Ciarlet, P.G.: Introduction to numerical linear algebra and optimisation. In: Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge (1989). ISBN 0-521-32788-1; 0-521-33984-7. With the assistance of Bernadette Miara and Jean-Marie Thomas, Translated from the French by A. Buttigieg
Ciarlet, P.G.: Linear and Nonlinear Functional Analysis with Applications. Society for Industrial and Applied Mathematics, Philadelphia (2013)
Evans, L.C.: Partial Differential Equations, Volume 19 of Graduate Studies in Mathematics, 2nd edn. American Mathematical Society, Providence (2010). ISBN 978-0-8218-4974-3. https://doi.org/10.1090/gsm/019
Feng, W., Salgado, A.J., Wang, C., Wise, S.M.: Preconditioned steepest descent methods for some nonlinear elliptic equations involving p-Laplacian terms. J. Comput. Phys. 334, 45–67 (2017). https://doi.org/10.1016/j.jcp.2016.12.046
Feng, W., Guan, Z., Lowengrub, J., Wang, C., Wise, S.M., Chen, Y.: A uniquely solvable, energy stable numerical scheme for the functionalized Cahn–Hilliard equation and its convergence analysis. J. Sci. Comput. 76(3), 1938–1967 (2018). https://doi.org/10.1007/s10915-018-0690-1
Goudou, X., Munier, J.: The gradient and heavy ball with friction dynamical systems: the quasiconvex case. Math. Program. 116(1–2, Ser. B), 173–191 (2009). https://doi.org/10.1007/s10107-007-0109-5
Jovanović, B. S.,Süli, E.: Analysis of Finite Difference Schemes. Springer Series in Computational Mathematics, vol. 46. Springer, London (2014). ISBN 978-1-4471-5459-4; 978-1-4471-5460-0. https://doi.org/10.1007/978-1-4471-5460-0
Laborde, M., Oberman, A.: A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In: Chiappa, S., Calandra, R. (eds.) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Volume 108 of Proceedings of Machine Learning Research, pp. 602–612 (2020). PMLR. http://proceedings.mlr.press/v108/laborde20a.html
Luo, H., Chen, L.: From differential equation solvers to accelerated first-order methods for convex optimization (2020). arXiv:1909.03145
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/k^{2})\). Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
Nesterov, Y.E.: Introductory Lectures on Convex Optimization. Applied Optimization, vol. 87. Kluwer Academic Publishers, Boston (2004). ISBN 1-4020-7553-7. https://doi.org/10.1007/978-1-4419-8853-9
Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability Lecture Notes in Mathematics, 2nd edn, vol. 1364. Springer-Verlag, Berlin, (1993). ISBN 3-540-56715-1
Poljak, B.T.: Some methods of speeding up the convergence of iterative methods. Ž. Vyčisl. Mat i Mat. Fiz. 4, 791–803 (1964)
Schaeffer, H., Hou, T.Y.: An accelerated method for nonlinear elliptic PDE. J. Sci. Comput. 69(2), 556–580 (2016). https://doi.org/10.1007/s10915-016-0215-8
Shen, J., Tang, T., Wang, L.-L.: Algorithms, analysis and applications. In: Spectral Methods, Volume 41 of Springer Series in Computational Mathematics. Springer, Heidelberg (2011). ISBN 978-3-540-71040-0. https://doi.org/10.1007/978-3-540-71041-7
Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-resolution differential equations (2018). arXiv:1810.08907
Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions (2019). arXiv:1903.05671
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2510–2518. Curran Associates, Inc. (2014)
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA 113(47), E7351–E7358 (2016). https://doi.org/10.1073/pnas.1614734113
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization (2018). arXiv:611.02635
Funding
SMW acknowledges partial financial support from NSF-DMS 1719854. AJS has been partially supported by NSF-DMS 1720123.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare we have not conflict of interest.
Ethics Approval
We have not submitted this manuscript anywhere, and it will not be submitted anywhere while it is under review.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: An IVP as the Limit of the PAGD Method
1.1 A.1. Derivation of the ODE
Let us start with the same approach as in [33]. We assume, as an ansatz, that PAGD is a discretization of an ODE, which has a solution \(X:[0,\infty )\rightarrow \mathbb {H}\), or a trajectory. We also assume that X is smooth enough, e.g., twice continuously differentiable in time. For a fixed \(t\in (0,\infty )\), the assumed smoothness on X, together with the identification \(t= {\sqrt{s}}k\) and Taylor’s formula in a normed vector space (e.g., [17, Theorem 7.9-1]) implies:
The last identity follows from the continuity of \(G'\), that of \({\mathcal {L}^{-1}}\), and (3.2), from which we can deduce \(y_k\rightarrow X(t)\) as \(s\rightarrow 0\). Plugging (3.2) into (3.3) and dividing by \({\sqrt{s}}\), we have \( \frac{x_{k+1}-x_k}{{\sqrt{s}}}- \lambda \frac{x_{k}-x_{k-1}}{{\sqrt{s}}}+{\sqrt{s}}{\mathcal {L}^{-1}}G'(y_k)=0 \). Substituting the above Taylor expansions, and then rearranging, we arrive at
To make this estimate consistent, interpret \(\lambda \) as a function of s and further assume that \((1-\lambda )/{\sqrt{s}}\rightarrow 2\eta \) as \({\sqrt{s}}\rightarrow 0\) for some \(\eta \in (0,\infty )\), which yields
1.2 A.2. Derivation of the Initial Conditions
The initialization \(y_0 = x_0\) and (3.3) with \(k=0\) imply
Take the limit \(s\rightarrow 0\) and conclude \(\dot{X}(0)=0\) since \(G'\) and \(\dot{X}\) are assumed to be continuous. Therefore, we arrive at the desired IVP (4.1).
Remark A.1
(momentum method) A similar procedure can be carried out far more easily for the so-called momentum method (MM). To see this, we recall that
Then, the discrete version of the ODE (4.1) becomes
which is MM with the weight \(1-2\eta {\sqrt{s}}\); see [28, p. 12 (9)]. This weight is close to \(\lambda \):
In this sense, MM seems more natural and amenable for analysis than AGD. \(\square \)
The limiting behavior of MM can also be explained by the IVP (4.1). Observe that the only essential difference between MM and PAGD is where \(G'\) is evaluated, that is, \(x_k\) and \(y_k\) respectively. And in the limit \(s\rightarrow 0\), \(x_k\) and \(y_k\) are not distinguishable in this setting. However, PAGD exhibits less oscillation than MM since evaluating \(G'\) at \(y_k\) serves as “foreseeing” the uphill of the objective functional, if exists, along the trajectory and “steering” to avoid unnecessary oscillating behaviors. Recently, a higher order Taylor expansion turns out to help differentiate their performaces (see [31]).
Appendix 2: PAGD as a Discretization of the IVP
Let us label the step size \({\sqrt{s}}\), rather than s, in order to make the setting more in line with the PAGD algorithm. Again, it is helpful to have in mind the correspondence: time \(t\longleftrightarrow k{\sqrt{s}}\) (\(k=0,1,2,\ldots \)) and position \(X(t)\longleftrightarrow x_k\). First, we will see \(y_k\) corresponds to a “drifted” position without the potential landscape over \([t,t+{\sqrt{s}}]\). This can be modeled by \(\ddot{X}(t)+2\eta \dot{X}(t)=0\), which leads to another energy law \( \frac{1}{2}\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}^2 = \frac{1}{2}\bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}}^2 -2\eta \int _t^{t+{\sqrt{s}}}\bigl \Vert \dot{X}(\tau ) \bigr \Vert _{\mathcal {L}}^2{d }\tau . \) Approximate the speed in the integrand by the average \(\frac{1}{2}(\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}+\bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}})\), then after a short calculation, one obtains \(\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}=\lambda \bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}}\). Since the dynamics takes place in a single direction, this implies \(\dot{X}(t+{\sqrt{s}})=\lambda \dot{X}(t)\). The approximations \(\dot{X}(t)\approx \frac{x_k-x_{k-1}}{{\sqrt{s}}}\) and \(\dot{X}(t+{\sqrt{s}})=\frac{y_k-x_k}{{\sqrt{s}}}\) lead us to (3.2).
Next, we discretize the vector V(t). Since we do not know the minimizer in practice, we remove it from the definition of \(v_k\) and discretize \(V(t)+x^*=X(t)+\frac{1}{\eta }\dot{X}(t)\). The approximations \(X(t)\approx y_k\) and \(\dot{X}(t)\approx \frac{y_k-x_k}{{\sqrt{s}}}\) suggest
which leads to the definition of \(\{v_k\}_{k\ge 1}\) (3.4) upon combining with the definition of \(\{y_k\}\).
Finally, to get the main iterates, \(\{x_{k}\}_{k\ge 1}\), we discretize (4.3) using the approximations \(\dot{V} (t) \approx \frac{v_{k+1}-v_k}{{\sqrt{s}}}\), \(\dot{X} (t) \approx \frac{y_{k}-x_k}{{\sqrt{s}}}\), and the evaluation of \(G'\) at \(y_k\), then it follows \( \eta \frac{v_{k+1}-v_k}{{\sqrt{s}}}+\eta \frac{y_k-x_k}{{\sqrt{s}}}+{\mathcal {L}^{-1}}G'(y_k)=0 \). Plugging in (3.4) and (A.1), one obtains (3.3), the definition of \(\{x_k\}_{k\ge 1}\).
Appendix 3: Literature Comparison
We summarize our discussion on the existing literature works, and contrast them with our contributions, in Table 2.
Rights and permissions
About this article
Cite this article
Park, JH., Salgado, A.J. & Wise, S.M. Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs. J Sci Comput 89, 17 (2021). https://doi.org/10.1007/s10915-021-01615-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-021-01615-8
Keywords
- Preconditioning
- Nesterov acceleration
- Momentum method
- Convex optimization
- Nonlinear elliptic partial differential equations
- Pseudo-spectral methods
- Lyapunov