Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs

Park, Jea-Hyun; Salgado, Abner J.; Wise, Steven M.

doi:10.1007/s10915-021-01615-8

Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs

Published: 28 August 2021

Volume 89, article number 17, (2021)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

Jea-Hyun Park¹,
Abner J. Salgado¹ &
Steven M. Wise¹

473 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

We develop a theoretical foundation for the application of Nesterov’s accelerated gradient descent method (AGD) to the approximation of solutions of a wide class of partial differential equations (PDEs). This is achieved by proving the existence of an invariant set and exponential convergence rates when its preconditioned version (PAGD) is applied to minimize locally Lipschitz smooth, strongly convex objective functionals. We introduce a second-order ordinary differential equation (ODE) with a preconditioner built-in and show that PAGD is an explicit time-discretization of this ODE, which requires a natural time step restriction for energy stability. At the continuous time level, we show an exponential convergence of the ODE solution to its steady state using a simple energy argument. At the discrete level, assuming the aforementioned step size restriction, the existence of an invariant set is proved and a matching exponential rate of convergence of the PAGD scheme is derived by mimicking the energy argument and the convergence at the continuous level. Applications of the PAGD method to numerical PDEs are demonstrated with certain nonlinear elliptic PDEs using pseudo-spectral methods for spatial discretization, and several numerical experiments are conducted. The results confirm the global geometric and mesh size-independent convergence of the PAGD method, with an accelerated rate that is improved over the preconditioned gradient descent (PGD) method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interior Penalty Discontinuous Galerkin Methods for Second Order Linear Non-divergence Form Elliptic PDEs

Article 08 August 2017

A Systematic Study on Weak Galerkin Finite Element Methods for Second Order Elliptic Problems

Article 26 July 2017

On the superconvergence of a WG method for the elliptic problem with variable coefficients

Article 23 August 2023

Data Availability

The results of numerical experiments are either fully documented in the manuscript or can be made available on reasonable request.

Code Availability

The code can be made available on reasonable request.

References

Adams, R.A., Fournier, J.J.F.: Sobolev Spaces, Volume 140 of Pure and Applied Mathematics (Amsterdam), 2nd edn. Elsevier/Academic Press, Amsterdam (2003). ISBN 0-12-044143-8
Ainsworth, M., Mao, Z.: Well-posedness of the Cahn–Hilliard equation with fractional free energy and its Fourier Galerkin approximation. Chaos Solitons Fractals 102, 264–273 (2017). https://doi.org/10.1016/j.chaos.2017.05.022
Article MathSciNet MATH Google Scholar
Ainsworth, M., Mao, Z.: Fractional phase-field crystal modelling: analysis, approximation and pattern formation. IMA J. Appl. Math. 85(2), 231–262 (2020). https://doi.org/10.1093/imamat/hxaa004
Article MathSciNet MATH Google Scholar
Allen-Zhu, Z., Orecchia, L.:. Linear coupling: an ultimate unification of gradient and mirror descent. In: 8th Innovations in Theoretical Computer Science Conference, Volume 67 of LIPIcs Leibniz International Proceedings in Informatics, Art. No. 3, 22. Schloss Dagstuhl Leibniz-Zentrum für Informatik, Wadern (2017)
Antil, H., Otárola, E., Salgado, A.J.: Optimization with respect to order in a fractional diffusion model: analysis, approximation and algorithmic aspects. J. Sci. Comput. 77(1), 204–224 (2018). https://doi.org/10.1007/s10915-018-0703-0
Article MathSciNet MATH Google Scholar
Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(1), 1–34 (2000). https://doi.org/10.1142/S0219199700000025
Article MathSciNet MATH Google Scholar
Barrett, J.W., Liu, W.B.: Finite element approximation of the $p$-Laplacian. Math. Comput. 61(204), 523–537 (1993). https://doi.org/10.2307/2153239
Article MathSciNet MATH Google Scholar
Beck, A.: First-Order Methods in Optimization, Volume 25 of MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA (2017). ISBN 978-1-611974-98-0. https://doi.org/10.1137/1.9781611974997.ch1
Benyamin, M., Calder, J., Sundaramoorthi, G., Yezzi, A.: Accelerated variational PDEs for efficient solution of regularized inversion problems. J. Math. Imaging Vis. 62(1), 10–36 (2020). https://doi.org/10.1007/s10851-019-00910-2
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific Optimization and Computation Series, 2nd edn. Athena Scientific, Belmont (1999)
Google Scholar
Bonito, A., Borthagaray, J.P., Nochetto, R.H., Otárola, E., Salgado, A.J.: Numerical methods for fractional diffusion. Comput. Vis. Sci. 19(5–6), 19–46 (2018). https://doi.org/10.1007/s00791-018-0289-y
Article MathSciNet Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Calder, J., Yezzi, A.: PDE acceleration: a convergence rate analysis and applications to obstacle problems. Res. Math. Sci., 6(4):Paper No. 35, 30 (2019). ISSN 2522-0144. https://doi.org/10.1007/s40687-019-0197-x
Canuto, C., Hussaini, M.Y., Quarteroni, A., Zang, T.A.: Spectral methods. Fundamentals in single domains. In: Scientific Computation. Springer-Verlag, Berlin (2006). ISBN 978-3-540-30725-9; 3-540-30725-7
Chen, L., Hu, X., Wise, S.M.: Convergence analysis of the fast subspace descent method for convex optimization problems. Math. Comput. 89(325), 2249–2282 (2020). https://doi.org/10.1090/mcom/3526
Article MathSciNet MATH Google Scholar
Ciarlet, P.G.: Introduction to numerical linear algebra and optimisation. In: Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge (1989). ISBN 0-521-32788-1; 0-521-33984-7. With the assistance of Bernadette Miara and Jean-Marie Thomas, Translated from the French by A. Buttigieg
Ciarlet, P.G.: Linear and Nonlinear Functional Analysis with Applications. Society for Industrial and Applied Mathematics, Philadelphia (2013)
MATH Google Scholar
Evans, L.C.: Partial Differential Equations, Volume 19 of Graduate Studies in Mathematics, 2nd edn. American Mathematical Society, Providence (2010). ISBN 978-0-8218-4974-3. https://doi.org/10.1090/gsm/019
Feng, W., Salgado, A.J., Wang, C., Wise, S.M.: Preconditioned steepest descent methods for some nonlinear elliptic equations involving p-Laplacian terms. J. Comput. Phys. 334, 45–67 (2017). https://doi.org/10.1016/j.jcp.2016.12.046
Article MathSciNet MATH Google Scholar
Feng, W., Guan, Z., Lowengrub, J., Wang, C., Wise, S.M., Chen, Y.: A uniquely solvable, energy stable numerical scheme for the functionalized Cahn–Hilliard equation and its convergence analysis. J. Sci. Comput. 76(3), 1938–1967 (2018). https://doi.org/10.1007/s10915-018-0690-1
Article MathSciNet MATH Google Scholar
Goudou, X., Munier, J.: The gradient and heavy ball with friction dynamical systems: the quasiconvex case. Math. Program. 116(1–2, Ser. B), 173–191 (2009). https://doi.org/10.1007/s10107-007-0109-5
Article MathSciNet MATH Google Scholar
Jovanović, B. S.,Süli, E.: Analysis of Finite Difference Schemes. Springer Series in Computational Mathematics, vol. 46. Springer, London (2014). ISBN 978-1-4471-5459-4; 978-1-4471-5460-0. https://doi.org/10.1007/978-1-4471-5460-0
Laborde, M., Oberman, A.: A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In: Chiappa, S., Calandra, R. (eds.) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Volume 108 of Proceedings of Machine Learning Research, pp. 602–612 (2020). PMLR. http://proceedings.mlr.press/v108/laborde20a.html
Luo, H., Chen, L.: From differential equation solvers to accelerated first-order methods for convex optimization (2020). arXiv:1909.03145
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate $O(1/k^{2})$. Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
MathSciNet Google Scholar
Nesterov, Y.E.: Introductory Lectures on Convex Optimization. Applied Optimization, vol. 87. Kluwer Academic Publishers, Boston (2004). ISBN 1-4020-7553-7. https://doi.org/10.1007/978-1-4419-8853-9
Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability Lecture Notes in Mathematics, 2nd edn, vol. 1364. Springer-Verlag, Berlin, (1993). ISBN 3-540-56715-1
Poljak, B.T.: Some methods of speeding up the convergence of iterative methods. Ž. Vyčisl. Mat i Mat. Fiz. 4, 791–803 (1964)
MathSciNet Google Scholar
Schaeffer, H., Hou, T.Y.: An accelerated method for nonlinear elliptic PDE. J. Sci. Comput. 69(2), 556–580 (2016). https://doi.org/10.1007/s10915-016-0215-8
Article MathSciNet MATH Google Scholar
Shen, J., Tang, T., Wang, L.-L.: Algorithms, analysis and applications. In: Spectral Methods, Volume 41 of Springer Series in Computational Mathematics. Springer, Heidelberg (2011). ISBN 978-3-540-71040-0. https://doi.org/10.1007/978-3-540-71041-7
Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-resolution differential equations (2018). arXiv:1810.08907
Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions (2019). arXiv:1903.05671
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2510–2518. Curran Associates, Inc. (2014)
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA 113(47), E7351–E7358 (2016). https://doi.org/10.1073/pnas.1614734113
Article MathSciNet MATH Google Scholar
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization (2018). arXiv:611.02635

Download references

Funding

SMW acknowledges partial financial support from NSF-DMS 1719854. AJS has been partially supported by NSF-DMS 1720123.

Author information

Authors and Affiliations

Department of Mathematics, The University of Tennessee, Knoxville, TN, 37996, USA
Jea-Hyun Park, Abner J. Salgado & Steven M. Wise

Authors

Jea-Hyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Abner J. Salgado
View author publications
You can also search for this author in PubMed Google Scholar
Steven M. Wise
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abner J. Salgado.

Ethics declarations

Conflict of interest

We declare we have not conflict of interest.

Ethics Approval

We have not submitted this manuscript anywhere, and it will not be submitted anywhere while it is under review.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: An IVP as the Limit of the PAGD Method

1.1 A.1. Derivation of the ODE

Let us start with the same approach as in [33]. We assume, as an ansatz, that PAGD is a discretization of an ODE, which has a solution $X:[0,\infty )\rightarrow \mathbb {H}$, or a trajectory. We also assume that X is smooth enough, e.g., twice continuously differentiable in time. For a fixed $t\in (0,\infty )$, the assumed smoothness on X, together with the identification $t= {\sqrt{s}}k$ and Taylor’s formula in a normed vector space (e.g., [17, Theorem 7.9-1]) implies:

$$\begin{aligned} \frac{x_{k+1}-x_k}{{\sqrt{s}}}&=\dot{X}(t)+\frac{1}{2}\ddot{X}(t){\sqrt{s}}+{\text {o}}\bigl ( {\sqrt{s}} \bigr ) \quad \text { as } s\rightarrow 0, \nonumber \\ \frac{x_{k}-x_{k-1}}{{\sqrt{s}}}&=\dot{X}(t)-\frac{1}{2}\ddot{X}(t){\sqrt{s}}+{\text {o}}\bigl ( {\sqrt{s}} \bigr ) \quad \text { as } s\rightarrow 0, \nonumber \\ {\sqrt{s}}{\mathcal {L}^{-1}}G'(y_k)&={\sqrt{s}}{\mathcal {L}^{-1}}G'(X(t))+{\text {o}}\bigl ( {\sqrt{s}} \bigr ) \quad \text { as } s\rightarrow 0. \end{aligned}$$

(A.1)

The last identity follows from the continuity of $G'$, that of ${\mathcal {L}^{-1}}$, and (3.2), from which we can deduce $y_k\rightarrow X(t)$ as $s\rightarrow 0$. Plugging (3.2) into (3.3) and dividing by ${\sqrt{s}}$, we have $ \frac{x_{k+1}-x_k}{{\sqrt{s}}}- \lambda \frac{x_{k}-x_{k-1}}{{\sqrt{s}}}+{\sqrt{s}}{\mathcal {L}^{-1}}G'(y_k)=0 $. Substituting the above Taylor expansions, and then rearranging, we arrive at

$$\begin{aligned} \frac{1}{2}(1+\lambda ) \ddot{X}(t) +\frac{1-\lambda }{{\sqrt{s}}}\dot{X}(t)+ {\mathcal {L}^{-1}}G'(X(t))+{\text {o}}\bigl ( 1 \bigr )=0 \quad \text { as } s\rightarrow 0. \end{aligned}$$

(A.2)

To make this estimate consistent, interpret $\lambda $ as a function of s and further assume that $(1-\lambda )/{\sqrt{s}}\rightarrow 2\eta $ as ${\sqrt{s}}\rightarrow 0$ for some $\eta \in (0,\infty )$, which yields

$$\begin{aligned} \ddot{X}(t)+2\eta \dot{X}(t)+{\mathcal {L}^{-1}}G'(X(t))=0. \end{aligned}$$

(A.3)

1.2 A.2. Derivation of the Initial Conditions

The initialization $y_0 = x_0$ and (3.3) with $k=0$ imply

$$\begin{aligned} \frac{x_1-x_0}{{\sqrt{s}}}={\sqrt{s}}{\mathcal {L}^{-1}}G'(x_0) . \end{aligned}$$

Take the limit $s\rightarrow 0$ and conclude $\dot{X}(0)=0$ since $G'$ and $\dot{X}$ are assumed to be continuous. Therefore, we arrive at the desired IVP (4.1).

Remark A.1

(momentum method) A similar procedure can be carried out far more easily for the so-called momentum method (MM). To see this, we recall that

$$\begin{aligned} \ddot{X}(t)\approx \frac{x_{k+1}-2x_k+x_{k-1}}{s}, \quad \dot{X}(t)\approx \frac{x_k-x_{k-1}}{{\sqrt{s}}}, \quad G'(X(t))\approx G'(x_k) . \end{aligned}$$

Then, the discrete version of the ODE (4.1) becomes

$$\begin{aligned} x_{k+1}=x_k-sG'(x_k)+(1-2\eta {\sqrt{s}})(x_k-x_{k-1}), \end{aligned}$$

which is MM with the weight $1-2\eta {\sqrt{s}}$; see [28, p. 12 (9)]. This weight is close to $\lambda $:

$$\begin{aligned} \lambda =\frac{1-\eta {\sqrt{s}}}{1+\eta {\sqrt{s}}}=1-\frac{2\eta {\sqrt{s}}}{1+\eta {\sqrt{s}}}\approx 1-2\eta {\sqrt{s}}. \end{aligned}$$

In this sense, MM seems more natural and amenable for analysis than AGD. $\square $

The limiting behavior of MM can also be explained by the IVP (4.1). Observe that the only essential difference between MM and PAGD is where $G'$ is evaluated, that is, $x_k$ and $y_k$ respectively. And in the limit $s\rightarrow 0$, $x_k$ and $y_k$ are not distinguishable in this setting. However, PAGD exhibits less oscillation than MM since evaluating $G'$ at $y_k$ serves as “foreseeing” the uphill of the objective functional, if exists, along the trajectory and “steering” to avoid unnecessary oscillating behaviors. Recently, a higher order Taylor expansion turns out to help differentiate their performaces (see [31]).

Appendix 2: PAGD as a Discretization of the IVP

Let us label the step size ${\sqrt{s}}$, rather than s, in order to make the setting more in line with the PAGD algorithm. Again, it is helpful to have in mind the correspondence: time $t\longleftrightarrow k{\sqrt{s}}$ ($k=0,1,2,\ldots $) and position $X(t)\longleftrightarrow x_k$. First, we will see $y_k$ corresponds to a “drifted” position without the potential landscape over $[t,t+{\sqrt{s}}]$. This can be modeled by $\ddot{X}(t)+2\eta \dot{X}(t)=0$, which leads to another energy law $ \frac{1}{2}\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}^2 = \frac{1}{2}\bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}}^2 -2\eta \int _t^{t+{\sqrt{s}}}\bigl \Vert \dot{X}(\tau ) \bigr \Vert _{\mathcal {L}}^2{d }\tau . $ Approximate the speed in the integrand by the average $\frac{1}{2}(\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}+\bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}})$, then after a short calculation, one obtains $\bigl \Vert \dot{X}(t+{\sqrt{s}}) \bigr \Vert _{\mathcal {L}}=\lambda \bigl \Vert \dot{X}(t) \bigr \Vert _{\mathcal {L}}$. Since the dynamics takes place in a single direction, this implies $\dot{X}(t+{\sqrt{s}})=\lambda \dot{X}(t)$. The approximations $\dot{X}(t)\approx \frac{x_k-x_{k-1}}{{\sqrt{s}}}$ and $\dot{X}(t+{\sqrt{s}})=\frac{y_k-x_k}{{\sqrt{s}}}$ lead us to (3.2).

Next, we discretize the vector V(t). Since we do not know the minimizer in practice, we remove it from the definition of $v_k$ and discretize $V(t)+x^*=X(t)+\frac{1}{\eta }\dot{X}(t)$. The approximations $X(t)\approx y_k$ and $\dot{X}(t)\approx \frac{y_k-x_k}{{\sqrt{s}}}$ suggest

$$\begin{aligned} v_k = y_{k} +\frac{1}{\theta }(y_k-x_{k}), \end{aligned}$$

(B.1)

which leads to the definition of $\{v_k\}_{k\ge 1}$ (3.4) upon combining with the definition of $\{y_k\}$.

Finally, to get the main iterates, $\{x_{k}\}_{k\ge 1}$, we discretize (4.3) using the approximations $\dot{V} (t) \approx \frac{v_{k+1}-v_k}{{\sqrt{s}}}$, $\dot{X} (t) \approx \frac{y_{k}-x_k}{{\sqrt{s}}}$, and the evaluation of $G'$ at $y_k$, then it follows $ \eta \frac{v_{k+1}-v_k}{{\sqrt{s}}}+\eta \frac{y_k-x_k}{{\sqrt{s}}}+{\mathcal {L}^{-1}}G'(y_k)=0 $. Plugging in (3.4) and (A.1), one obtains (3.3), the definition of $\{x_k\}_{k\ge 1}$.

Appendix 3: Literature Comparison

We summarize our discussion on the existing literature works, and contrast them with our contributions, in Table 2.

Table 2 We summarize our discussion on the existing literature works, and contrast them with our contributions, in this table

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, JH., Salgado, A.J. & Wise, S.M. Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs. J Sci Comput 89, 17 (2021). https://doi.org/10.1007/s10915-021-01615-8

Download citation

Received: 25 January 2021
Revised: 25 June 2021
Accepted: 14 July 2021
Published: 28 August 2021
DOI: https://doi.org/10.1007/s10915-021-01615-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs

Abstract

Access this article

Similar content being viewed by others

Interior Penalty Discontinuous Galerkin Methods for Second Order Linear Non-divergence Form Elliptic PDEs

A Systematic Study on Weak Galerkin Finite Element Methods for Second Order Elliptic Problems

On the superconvergence of a WG method for the elliptic problem with variable coefficients

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics Approval

Additional information

Publisher's Note

Appendices

Appendix 1: An IVP as the Limit of the PAGD Method

1.1 A.1. Derivation of the ODE

1.2 A.2. Derivation of the Initial Conditions

Remark A.1

Appendix 2: PAGD as a Discretization of the IVP

Appendix 3: Literature Comparison

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Preconditioned Accelerated Gradient Descent Methods for Locally Lipschitz Smooth Objectives with Applications to the Solution of Nonlinear PDEs

Abstract

Access this article

Similar content being viewed by others

Interior Penalty Discontinuous Galerkin Methods for Second Order Linear Non-divergence Form Elliptic PDEs

A Systematic Study on Weak Galerkin Finite Element Methods for Second Order Elliptic Problems

On the superconvergence of a WG method for the elliptic problem with variable coefficients

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics Approval

Additional information

Publisher's Note

Appendices

Appendix 1: An IVP as the Limit of the PAGD Method

1.1 A.1. Derivation of the ODE

1.2 A.2. Derivation of the Initial Conditions

Remark A.1

Appendix 2: PAGD as a Discretization of the IVP

Appendix 3: Literature Comparison

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation