Optimized first-order methods for smooth convex minimization

Kim, Donghwan; Fessler, Jeffrey A.

doi:10.1007/s10107-015-0949-3

Optimized first-order methods for smooth convex minimization

Full Length Paper
Series A
Published: 17 October 2015

Volume 159, pages 81–107, (2016)
Cite this article

Mathematical Programming Submit manuscript

Donghwan Kim¹ &
Jeffrey A. Fessler¹

4441 Accesses
82 Citations
5 Altmetric
Explore all metrics

Abstract

We introduce new optimized first-order methods for smooth unconstrained convex minimization. Drori and Teboulle (Math Program 145(1–2):451–482, 2014. doi:10.1007/s10107-013-0653-0) recently described a numerical method for computing the N-iteration optimal step coefficients in a class of first-order algorithms that includes gradient methods, heavy-ball methods (Polyak in USSR Comput Math Math Phys 4(5):1–17, 1964. doi:10.1016/0041-5553(64)90137-5), and Nesterov’s fast gradient methods (Nesterov in Sov Math Dokl 27(2):372–376, 1983; Math Program 103(1):127–152, 2005. doi:10.1007/s10107-004-0552-5). However, the numerical method in Drori and Teboulle (2014) is computationally expensive for large N, and the corresponding numerically optimized first-order algorithm in Drori and Teboulle (2014) requires impractical memory and computation for large-scale optimization problems. In this paper, we propose optimized first-order algorithms that achieve a convergence bound that is two times smaller than for Nesterov’s fast gradient methods; our bound is found analytically and refines the numerical bound in Drori and Teboulle (2014). Furthermore, the proposed optimized first-order methods have efficient forms that are remarkably similar to Nesterov’s fast gradient methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiency of higher-order algorithms for minimizing composite functions

Article 10 October 2023

Yassine Nabou & Ion Necoara

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Sebastian Pokutta

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Yurii Nesterov & Vladimir Spokoiny

Notes

The problem $\mathcal {B}_{\mathrm {P}}(\varvec{h},N,d,L,R)$ was shown to be independent of d in [17]; thus this paper’s results are independent of d.
Substituting $\varvec{x}' = \frac{1}{R}\varvec{x}$ and $\breve{f} (\varvec{x}') = \frac{1}{LR^2}f(R\varvec{x}')\in \mathcal {F}_1(\mathbb {R} ^d)$ in problem (P), we get $\mathcal {B}_{\mathrm {P}}(\varvec{h},N,L,R) = LR^2\mathcal {B}_{\mathrm {P}}(\varvec{h},N,1,1)$. This leads to $\varvec{\hat{h}}_{\mathrm {P}} = \hbox {arg min}_{\varvec{h}} \mathcal {B}_{\mathrm {P}}(\varvec{h},N,L,R) = \hbox {arg min}_{\varvec{h}} \mathcal {B}_{\mathrm {P}}(\varvec{h},N,1,1)$.
Using the term ‘best’ or ‘optimal’ here for DT [5] may be too strong, since DT [5] relaxed (HP) to a solvable form. We also use these relaxations, so we use the term “optimized” for our proposed algorithms.
If coefficients $\varvec{h}$ in Algorithm FO have a special recursive form, it is possible to find an equivalent efficient form, as discussed in Sects. 3 and 7.
The equivalence of two of Nesterov’s fast gradient methods for smooth unconstrained convex minimization was previously mentioned without details in [18].
The fast gradient method in [12] was originally developed to generalize FGM1 to the constrained case. Here, this second form is introduced for use in later proofs.
The second inequality of (3.5) is widely known since it provides simpler interpretation of a convergence bound, compared to the first inequality of (3.5).
The vector $\varvec{e}_{N,i}^{ }$ is the $i\hbox {th}$ standard basis vector in $\mathbb {R} ^{N}$, having 1 for the $i\hbox {th}$ entry and zero for all other elements.
Equation (5.2) in [5, Theorem 3] that is derived from (6.1) has typos that we fixed in (6.3).

References

Allen-Zhu, Z., Orecchia, L.: Linear coupling: an ultimate unification of gradient and mirror descent (2015). arXiv:1407.1537
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). doi:10.1137/080716542
Article MathSciNet MATH Google Scholar
CVX Research, I.: CVX: Matlab software for disciplined convex programming, version 2.0. (2012). http://cvxr.com/cvx
Drori, Y.: Contributions to the complexity analysis of optimization algorithms. Ph.D. thesis, Tel-aviv Univ., Israel (2014)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014). doi:10.1007/s10107-013-0653-0
Article MathSciNet MATH Google Scholar
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, Berlin (2008). http://stanford.edu/~boyd/graph_dcp.html
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization (2014). arXiv:1406.5468v1
Kim, D., Fessler, J.A.: Optimized momentum steps for accelerating X-ray CT ordered subsets image reconstruction. In: Proceedings of 3rd International Meeting on Image Formation in X-ray CT, pp. 103–106 (2014)
Kim, D., Fessler, J.A.: An optimized first-order method for image restoration. In: Proceedings of IEEE International Conference on Image Processing (2015). (to appear)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Sov. Math. Dokl. 27(2), 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Dordrecht (2004)
Book MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005). doi:10.1007/s10107-004-0552-5
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013). doi:10.1007/s10107-012-0629-5
Article MathSciNet MATH Google Scholar
O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015). doi:10.1007/s10208-013-9150-3
Article MathSciNet MATH Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964). doi:10.1016/0041-5553(64)90137-5
Article Google Scholar
Su, W., Boyd, S., Candes, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights (2015). arXiv:1503.01243
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first- order methods (2015). arXiv:1502.05666
Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125(2), 263–295 (2010). doi:10.1007/s10107-010-0394-2
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, 48109, USA
Donghwan Kim & Jeffrey A. Fessler

Authors

Donghwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey A. Fessler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donghwan Kim.

Additional information

This research was supported in part by NIH Grants R01-HL-098686 and U01-EB-018753.

Appendix

1.1 Proof of Lemma 2

We prove that the choice in (6.9), (6.10), (6.11) and (6.12) satisfies the feasible conditions (6.14) of (RD1).

Using the definition of $\varvec{\breve{Q}}(\varvec{r},\varvec{\lambda },\varvec{{\tau }})$ in (6.7), and considering the first two conditions of (6.14), we get

$$\begin{aligned} \lambda _{i+1}&= \breve{Q}_{i,i}(\varvec{r},\varvec{\lambda },\varvec{{\tau }}) = 2\breve{q}_i^2(\varvec{r},\varvec{\lambda },\varvec{{\tau }}) = \frac{1}{2\tau _N^2}\tau _i^2 \\&= {\left\{ \begin{array}{ll} \frac{1}{2(1-\lambda _N)^2}\lambda _1^2, &{} i = 0 \\ \frac{1}{2(1-\lambda _N)^2} (\lambda _{i+1} - \lambda _i)^2, &{} i = 1,\ldots ,N-1, \end{array}\right. } \end{aligned}$$

where the last equality comes from $(\varvec{\lambda },\varvec{{\tau }})\in {\varLambda }$, and this reduces to the following recursion:

$$\begin{aligned} {\left\{ \begin{array}{ll} \lambda _1 = 2(1 - \lambda _N)^2, &{} \\ (\lambda _i - \lambda _{i-1})^2 - \lambda _1\lambda _i = 0. \quad i=2,\ldots ,N. &{} \end{array}\right. } \end{aligned}$$

(10.1)

We use induction to prove that the solution of (10.1) is

$$\begin{aligned} \lambda _i = {\left\{ \begin{array}{ll} \frac{2}{\theta _N^2}, &{} i=1, \\ \theta _{i-1}^2\lambda _1, &{} i=2,\ldots ,N, \end{array}\right. } \end{aligned}$$

which is equivalent to $\varvec{\hat{\lambda }}$ (6.10). It is obvious that $\lambda _1 = \theta _0\lambda _1$, and for $i=2$ in (10.1), we get

$$\begin{aligned} \lambda _2 = \frac{3\lambda _1 + \sqrt{9\lambda _1^2 - 4\lambda _1^2}}{2} = \frac{3+\sqrt{5}}{2}\lambda _1 = \theta _1^2\lambda _1 . \end{aligned}$$

Then, assuming $\lambda _i = \theta _{i-1}^2\lambda _1$ for $i=1,\ldots ,n$ and $n\le N-1$, and using the second equality in (10.1) for $i=n+1$, we get

$$\begin{aligned} \lambda _{n+1}&= \frac{\lambda _1 + 2\lambda _n + \sqrt{(\lambda _1 + 2\lambda _n)^2 - 4\lambda _n^2}}{2} = \frac{1 + 2\theta _{n-1}^2 + \sqrt{1 + 4\theta _{n-1}^2}}{2}\lambda _1 \\&= \left( \theta _{n-1}^2 + \frac{1 + \sqrt{1 + 4\theta _{n-1}^2}}{2}\right) \lambda _1 = \theta _n^2\lambda _1, \end{aligned}$$

where the last equality uses (3.2). Then we use the first equality in (10.1) to find the value of $\lambda _1$ as

$$\begin{aligned}&\lambda _1 = 2(1 - \theta _{N-1}^2\lambda _1)^2 \\&\theta _{N-1}^4\lambda _1^2 - 2\left( \theta _{N-1}^2 + \frac{1}{4}\right) \lambda _1 + 1 = 0 \\&\lambda _1 = \frac{\theta _{N-1}^2 + \frac{1}{4} - \sqrt{\left( \theta _{N-1}^2 + \frac{1}{4}\right) ^2 - \theta _{N-1}^4}}{\theta _{N-1}^4}\\&\quad = \frac{1}{\theta _{N-1}^2 + \frac{1}{4} + \sqrt{\frac{\theta _{N-1}^2}{2} + \frac{1}{16}}} \\&\quad = \frac{8}{\left( 1 + \sqrt{1 + 8\theta _{N-1}^2}\right) ^2} = \frac{2}{\theta _N^2} \end{aligned}$$

with $\theta _N$ in (6.13).

Until now, we derived $\varvec{\hat{\lambda }}$ (6.10) using some conditions of (6.14). Consequently, using the last two conditions in (6.14) with (3.2) and (6.15), we can easily derive the following:

$$\begin{aligned} \tau _i&= {\left\{ \begin{array}{ll} \hat{\lambda } _1 = \frac{2}{\theta _N^2}, &{} i = 0, \\ \hat{\lambda } _{i+1} - \hat{\lambda } _i = \frac{2\theta _i^2}{\theta _N^2} - \frac{2\theta _{i-1}^2}{\theta _N^2} = \frac{2\theta _i}{\theta _N^2}, &{} i=1,\ldots ,N-1, \\ 1 - \hat{\lambda } _N = 1 - \frac{2\theta _{N-1}^2}{\theta _N^2} = \frac{1}{\theta _N}, &{} i = N, \end{array}\right. } \\ \gamma&= \tau _N^2 = \frac{1}{\theta _N^2}, \end{aligned}$$

which are equivalent to $\varvec{{\hat{\tau }}}$ (6.11) and $\hat{\gamma } $ (6.12).

Next, we derive for given $\varvec{\hat{\lambda }}$ (6.10) and $\varvec{{\hat{\tau }}}$ (6.11). Inserting $\varvec{{\hat{\tau }}}$ (6.11) to the first two conditions of (6.14), we get

(10.2)

for $i,k=0,\ldots ,N-1$, and considering (6.5) and (10.2), we get

(10.3)

Finally, using the two equivalent forms (6.2) and (10.3) of , we get

(10.4)

and this can be easily converted to the choice $\hat{r}_{i,k}$ in (6.9).

For these given , we can easily notice that

(10.5)

for $\varvec{\check{\theta }}= \left( \theta _0,\ldots ,\theta _{N-1},\frac{\theta _N}{2}\right) ^\top $, showing that the choice is feasible in both (RD) and (RD1). $\square $

1.2 Proof of (8.2)

We prove that (8.2) holds for the coefficients $\varvec{\hat{h}}$ (7.1) of OGM1 and OGM2.

We first show the following property using induction:

$$\begin{aligned} \sum _{k=0}^{j-1}\hat{h}_{j,k} = {\left\{ \begin{array}{ll} \theta _j, &{} j = 1,\ldots ,N-1, \\ \frac{1}{2}(\theta _N+1), &{} j = N. \end{array}\right. } \end{aligned}$$

Clearly, $\hat{h}_{1,0} = 1 + \frac{2\theta _0 - 1}{\theta _1} = \theta _1$ using (3.2). Assuming $\sum _{k=0}^{j-1}\hat{h}_{j,k} = \theta _j$ for $j=1,\ldots ,n$ and $n\le N-1$, we get

$$\begin{aligned} \sum _{k=0}^n\hat{h}_{n+1,k}&= 1 + \frac{2\theta _n - 1}{\theta _{n+1}} + \frac{\theta _n - 1}{\theta _{n+1}}(\hat{h}_{n,n-1}- 1) + \frac{\theta _n - 1}{\theta _{n+1}}\sum _{k=0}^{n-2}\hat{h}_{n,k} \\&= 1 + \frac{\theta _n}{\theta _{n+1}} + \frac{\theta _n - 1}{\theta _{n+1}}\sum _{k=0}^{n-1}\hat{h}_{n,k} = \frac{\theta _{n+1} + \theta _n^2}{\theta _{n+1}} \\&= {\left\{ \begin{array}{ll} \theta _n, &{} n = 1,\ldots ,N-2, \\ \frac{1}{2}(\theta _N + 1), &{} n = N-1, \end{array}\right. } \end{aligned}$$

where the last equality uses (3.2) and (6.15).

Then, (8.2) can be easily derived using (3.2) and (6.15) as

$$\begin{aligned} \sum _{j=1}^i\sum _{k=0}^{j-1}\hat{h}_{j,k}&= {\left\{ \begin{array}{ll} \sum \nolimits _{j=1}^i\theta _j, &{} i = 1,\ldots ,N-1, \\ \sum \nolimits _{j=1}^{N-1}\theta _j + \frac{1}{2}(\theta _N+1), &{} i = N, \end{array}\right. } \\&= {\left\{ \begin{array}{ll} \theta _i^2 - 1, &{} i = 1,\ldots ,N-1, \\ \frac{1}{2}(\theta _N^2 - 1), &{} i = N. \end{array}\right. } \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, D., Fessler, J.A. Optimized first-order methods for smooth convex minimization. Math. Program. 159, 81–107 (2016). https://doi.org/10.1007/s10107-015-0949-3

Download citation

Received: 21 May 2014
Accepted: 12 September 2015
Published: 17 October 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10107-015-0949-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimized first-order methods for smooth convex minimization

Abstract

Access this article

Similar content being viewed by others

Efficiency of higher-order algorithms for minimizing composite functions

The Frank-Wolfe Algorithm: A Short Introduction

Random Gradient-Free Minimization of Convex Functions

Notes

References