Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs

Erdogdu, Murat A.; Ozdaglar, Asuman; Parrilo, Pablo A.; Vanli, Nuri Denizcan

doi:10.1007/s10107-021-01686-3

Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs

Full Length Paper
Series A
Published: 30 July 2021

Volume 195, pages 243–281, (2022)
Cite this article

Mathematical Programming Submit manuscript

Murat A. Erdogdu^1,2,
Asuman Ozdaglar³,
Pablo A. Parrilo³ &
…
Nuri Denizcan Vanli³

895 Accesses
2 Citations
Explore all metrics

Abstract

Semidefinite programming (SDP) with diagonal constraints arise in many optimization problems, such as Max-Cut, community detection and group synchronization. Although SDPs can be solved to arbitrary precision in polynomial time, generic convex solvers do not scale well with the dimension of the problem. In order to address this issue, Burer and Monteiro (Math Program 95(2):329–357, 2003) proposed to reduce the dimension of the problem by appealing to a low-rank factorization and solve the subsequent non-convex problem instead. In this paper, we present coordinate ascent based methods to solve this non-convex problem with provable convergence guarantees. More specifically, we prove that the block-coordinate maximization algorithm applied to the non-convex Burer–Monteiro method globally converges to a first-order stationary point with a sublinear rate without any assumptions on the problem. We further show that this algorithm converges linearly around a local maximum provided that the objective function exhibits quadratic decay. We establish that this condition generically holds when the rank of the factorization is sufficiently large. Furthermore, incorporating Lanczos method to the block-coordinate maximization, we propose an algorithm that is guaranteed to return a solution that provides $1-{\mathcal {O}}\left( 1/r\right) $ approximation to the original SDP without any assumptions, where r is the rank of the factorization. This approximation ratio is known to be optimal (up to constants) under the unique games conjecture, and we can explicitly quantify the number of iterations to obtain such a solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence of a Weighted Barrier Algorithm for Stochastic Convex Quadratic Semidefinite Optimization

Article 16 November 2022

Decomposition Methods for Large-Scale Semidefinite Programs with Chordal Aggregate Sparsity and Partial Orthogonality

Sparse PSD approximation of the PSD cone

Article 16 October 2020

Notes

Note that the dimension of ${\mathcal {V}}_{\varvec{\sigma }}$ depends on the rank of $\varvec{\sigma }$, and hence the quotient space is not a manifold.

References

Absil, P.-A., Baker, C.G., Gallivan, K.A.: Trust-region methods on Riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007)
Article MathSciNet Google Scholar
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2007)
MATH Google Scholar
Alizadeh, F., Haeberly, J.-P.A., Overton, M.L.: Complementarity and nondegeneracy in semidefinite programming. Math. Program. 77(1), 111–128 (1997)
Article MathSciNet Google Scholar
Anitescu, M.: Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim. 10(4), 1116–1135 (2000)
Article MathSciNet Google Scholar
Arora, S., Hazan, E., Kale, S.: Fast algorithms for approximate semidefiniite programming using the multiplicative weights update method. In: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, FOCS’05, pp. 339–348 (2005)
Bandeira, A.S., Boumal, N., Voroninski, V.: On the low-rank approach for semidefinite programs arising in synchronization and community detection. arXiv:1602.04426 (2016)
Barvinok, A.I.: Problems of distance geometry and convex properties of quadratic maps. Discrete Comput. Geom. 13(2), 189–202 (1995)
Article MathSciNet Google Scholar
Bonnans, J.F., Ioffe, A.: Second-order sufficiency and quadratic growth for nonisolated minima. Math. Oper. Res. 20(4), 801–817 (1995)
Article MathSciNet Google Scholar
Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. arXiv preprint arXiv:1605.08101 (2016)
Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
MATH Google Scholar
Boumal, N., Voroninski, V., Bandeira, A.S.: The non-convex Burer–Monteiro approach works on smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2757–2765 (2016)
Boumal, N., Voroninski, V., Bandeira, A.S.: Deterministic guarantees for Burer–Monteiro factorizations of smooth semidefinite programs. arXiv preprint arXiv:1804.02008 (2018)
Briat, C.: Linear Parameter-Varying and Time-Delay Systems. Springer (2014)
Briët, J., de Oliveira Filho, F.M., Vallentin, F.: The positive semidefinite grothendieck problem with rank constraint. In: Automata, Languages and Programming, pp. 31–42 (2010)
Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)
Article MathSciNet Google Scholar
Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103(3), 427–444 (2005)
Article MathSciNet Google Scholar
Cifuentes, D., Moitra, A.: Polynomial time guarantees for the Burer–Monteiro method. arXiv preprint arXiv:1912.01745 (2019)
Coakley, E.S., Rokhlin, V.: A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices. Appl. Comput. Harmon. Anal. 34(3), 379–414 (2013)
Article MathSciNet Google Scholar
Erdogdu, M.A., Deshpande, Y., Montanari, A.: Inference in graphical models via semidefinite programming hierarchies. In: Advances in Neural Information Processing Systems, pp. 416–424 (2017)
Gamarnik, D., Li, Q.: On the max-cut of sparse random graphs. arXiv preprint arXiv:1411.1698 (2014)
Garber, D., Hazan, E.: Approximating semidefinite programs in sublinear time. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pp. 1080–1088 (2011)
Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995)
Article MathSciNet Google Scholar
Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A., Vanli, N.D.: When cyclic coordinate descent outperforms randomized coordinate descent. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, and (eds.) Advances in Neural Information Processing Systems, volume 30, pp. 6999–7007. Curran Associates, Inc. (2017)
Gurbuzbalaban, M., Ozdaglar, A., Vanli, N.D., Wright, S.J.: Randomness and permutations in coordinate descent methods. Math. Program. 181, 03 (2018)
MathSciNet MATH Google Scholar
Javanmard, A., Montanari, A., Ricci-Tersenghi, F.: Phase transitions in semidefinite relaxations. Proc. Natl. Acad. Sci. 113(16), E2218–E2223 (2016)
Article MathSciNet Google Scholar
Journee, M., Bach, F., Absil, P.-A., Sepulchre, R.: Low-rank optimization on the cone of positive semidefinite matrices. SIAM J. Optim. 20(5), 2327–2351 (2010)
Article MathSciNet Google Scholar
Klein, P., Lu, H.-I.: Efficient approximation algorithms for semidefinite programs arising from MAX CUT and COLORING. In: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC’96, pp. 338–347. ACM, New York, NY, USA (1996)
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalues by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)
Article MathSciNet Google Scholar
Lee, C.-P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. IMA J. Numer. Anal. 39, 07 (2016)
MathSciNet Google Scholar
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: 29th Annual Conference on Learning Theory, vol. 49, pp. 1246–1257. PMLR (2016)
Lu, Z., Xiao, L.: Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. Technical Report MSR-TR-2013-66 (2013)
Mei, S., Misiakiewicz, T., Montanari, A., Oliveira, R.I.: Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729 (2017)
Montanari, A.: A Grothendieck-type inequality for local maxima. arXiv preprint arXiv:1603.04064 (2016)
Parrilo, P.A.: Semidefinite programming relaxations for semialgebraic problems. Math. Program. 96(2), 293–320 (2003)
Article MathSciNet Google Scholar
Pataki, G.: On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math. Oper. Res. 23(2), 339–358 (1998)
Article MathSciNet Google Scholar
Patrascu, A., Necoara, I.: Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. J. Glob. Optim. 61, 05 (2013)
MathSciNet MATH Google Scholar
Pumir, T., Jelassi, S., Boumal, N.: Smoothed analysis of the low-rank approach for smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2281–2290 (2018)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 07 (2011)
MathSciNet MATH Google Scholar
Steurer, D.: Fast SDP algorithms for constraint satisfaction problems. In: Proceedings of the Twenty-First Annual ACM–SIAM Symposium on Discrete Algorithms, pp. 684–697 (2010)
Tropp, J.A., Yurtsever, A., Udell, M., Cevher, V.: Practical sketching algorithms for low-rank matrix approximation. SIAM J. Matrix Anal. Appl. 38(4), 1454–1485 (2017)
Article MathSciNet Google Scholar
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)
Article MathSciNet Google Scholar
Vandenberghe, L., Boyd, S.: Semidefinite programming. SIAM Rev. 38(1), 49–95 (1996)
Article MathSciNet Google Scholar
Wang, P.-W., Chang, W.-C., Kolter, J.Z.: The mixing method: coordinate descent for low-rank semidefinite programming. arXiv preprint arXiv:1706.00476 (2017)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, Toronto, Canada
Murat A. Erdogdu
Department of Statistical Sciences, University of Toronto, Toronto, Canada
Murat A. Erdogdu
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA
Asuman Ozdaglar, Pablo A. Parrilo & Nuri Denizcan Vanli

Authors

Murat A. Erdogdu
View author publications
You can also search for this author in PubMed Google Scholar
Asuman Ozdaglar
View author publications
You can also search for this author in PubMed Google Scholar
Pablo A. Parrilo
View author publications
You can also search for this author in PubMed Google Scholar
Nuri Denizcan Vanli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuri Denizcan Vanli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Part of this work has previously appeared in ICML 2018 Workshop on Modern Trends in Nonconvex Optimization for Machine Learning.

Appendices

Proof of Corollary 1

Similar to the proof of Theorem 1, from Proposition 1, we have

$$\begin{aligned} f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k)&= 2 \left( \Vert {g_{i_k}^k}\Vert - \langle \sigma _{i_k}^k, g_{i_k}^k \rangle \right) , \nonumber \\&= \frac{2 \Vert {g_{i_k}^k}\Vert \left( \Vert {g_{i_k}^k}\Vert - \langle \sigma _{i_k}^k, g_{i_k}^k \rangle \right) }{\Vert {g_{i_k}^k}\Vert } ,\nonumber \\&\ge \frac{ \Vert {g_{i_k}^k}\Vert ^2 - \langle \sigma _{i_k}^k, g_{i_k}^k \rangle ^2 }{\Vert {g_{i_k}^k}\Vert }, \end{aligned}$$

(47)

where the inequality follows since $\Vert {g_{i_k}^k}\Vert \ge {\langle \sigma _{i_k}^k,g_{i_k}^k \rangle }$, for all $\sigma _{i_k}^k \in {{\mathbb {R}}}^{n \times r}$. Letting $\mathbb {E}_k$ denote the expectation over $i_k$ given $\sigma ^k$, we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \sum _{i=1}^n p_i \frac{ \Vert {g_i^k}\Vert ^2 - \langle \sigma _i^k, g_i^k \rangle ^2 }{\Vert {g_i^k}\Vert }. \end{aligned}$$

In particular, when $p_i=\frac{1}{n}$, for all $i \in [n]$ (i.e., for uniform sampling case), we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{1}{n \Vert {{\varvec{A}}}\Vert _1} \, \sum _{i=1}^n \left( \Vert {g_i^k}\Vert ^2 - \langle \sigma _i^k, g_i^k \rangle ^2 \right) , \end{aligned}$$

since $\Vert {g_i^k}\Vert \le \Vert {{\varvec{A}}}\Vert _1$, for all $i \in [n]$ by (11). Therefore, we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2n \Vert {{\varvec{A}}}\Vert _1}. \end{aligned}$$

(48)

On the other hand, when $p_i=\frac{\Vert {g_i^k}\Vert }{\sum _{j=1}^n \Vert {g_j^k}\Vert }$ (i.e., for importance sampling case), we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{ \sum _{i=1}^n \Vert {g_i^k}\Vert ^2 - \langle \sigma _i^k, g_i^k \rangle ^2 }{\sum _{j=1}^n \Vert {g_j^k}\Vert } = \frac{ \Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2 \sum _{j=1}^n \Vert {g_j^k}\Vert }. \end{aligned}$$

Letting $\Vert {{\varvec{A}}}\Vert _{1,1} = \sum _{i,j=1}^n |{\varvec{A}}_{ij}|$ denote the $L_{1,1}$ norm of matrix ${\varvec{A}}$, we observe that $\sum _{j=1}^n \Vert {g_j^k}\Vert \le \Vert {{\varvec{A}}}\Vert _{1,1}$, which in the above inequality yields

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2\Vert {{\varvec{A}}}\Vert _{1,1}}. \end{aligned}$$

(49)

In order to prove (13), which corresponds to uniform sampling case, we assume the contrary that $\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 > \epsilon $, for all $k\in [K-1]$. Then, using the boundedness of f, we get

$$\begin{aligned} f^* - f(\varvec{\sigma }^0)\ge & {} \mathbb {E}f(\varvec{\sigma }^K) - f(\varvec{\sigma }^0) = \sum _{k=0}^{K-1} \mathbb {E}\left[ f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \right] \\= & {} \sum _{k=0}^{K-1} \mathbb {E}\left[ \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \right] . \end{aligned}$$

Using the expected functional ascent of BCM in (48) above, we get

$$\begin{aligned} f^* - f(\varvec{\sigma }^0) \ge \sum _{k=0}^{K-1} \frac{\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2n\Vert {{\varvec{A}}}\Vert _1} > \frac{K \epsilon }{2n\Vert {{\varvec{A}}}\Vert _1}, \end{aligned}$$

(50)

where the last inequality follows by the assumption. Then, by contradiction, the algorithm returns a solution with $\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 \le \epsilon $, for some $k\in [K-1]$, provided that

$$\begin{aligned} K \ge \frac{2n\Vert {{\varvec{A}}}\Vert _1 (f^* - f(\varvec{\sigma }^0))}{\epsilon }. \end{aligned}$$

The proof of (14), which corresponds to importance sampling case, can be obtained by using (49) (instead of (48)) in (50), and hence is omitted.

Rest of the Proof of Theorem 2

In order to quantify how close $\varvec{\sigma }^0$ and $\varvec{\sigma }$ should be so that this convergence rate holds, we need to derive explicit bounds on the higher order terms in (21) and (23), which we do in the following. The Taylor expansion of $\varvec{\sigma }^k$ around $\varvec{\sigma }$ yields

$$\begin{aligned} \sigma _i^k&= \sigma _i \cos (\Vert {u_i}\Vert t) + \frac{u_i}{\Vert {u_i}\Vert } \sin (\Vert {u_i}\Vert t), \\&= \sigma _i \left[ \sum _{\ell =0}^\infty \frac{(-1)^\ell }{(2\ell )!} \left( \Vert {u_i}\Vert t \right) ^{2\ell } \right] + \frac{u_i}{\Vert {u_i}\Vert } \left[ \sum _{\ell =0}^\infty \frac{(-1)^\ell }{(2\ell +1)!} \left( \Vert {u_i}\Vert t \right) ^{2\ell +1} \right] . \end{aligned}$$

Using this expansion, we can compute $f(\varvec{\sigma }^k) = \sum _{i,j=1}^n A_{ij} {\langle \sigma _i^k, \sigma _j^k \rangle }$. The first three terms in the expansion are already given in (22) as follows

$$\begin{aligned} f(\varvec{\sigma }^k) = f(\varvec{\sigma }) + t^2 \sum _{i=1}^n \left( {\langle u_i,v_i \rangle } - \Vert {u_i}\Vert ^2 \Vert {g_i}\Vert \right) + \beta _f, \end{aligned}$$

(51)

where $\beta _f$ represents the higher order terms. In order to find an upper bound on $|\beta _f|$, we use the Cauchy-Schwarz inequality in the higher order terms in the expansion of $f(\varvec{\sigma }^k)$, which yields the following bound

$$\begin{aligned} |\beta _f| \le \sum _{i,j=1}^n |A_{ij}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} ( \Vert {u_i}\Vert + \Vert {u_j}\Vert )^\ell \right) . \end{aligned}$$

As $\Vert {\varvec{u}}\Vert _{\mathrm {F}}=1$, we have $\Vert {u_i}\Vert \le 1$ for all $i\in [n]$, which implies

$$\begin{aligned} |\beta _f| \le \sum _{i,j=1}^n |A_{ij}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell \right) , \end{aligned}$$

where we note that t denotes the geodesic distance between $\varvec{\sigma }^k$ and $[{\bar{\varvec{\sigma }}}]$ as highlighted in (19). Assuming that $t\le 1$, we obtain the following upper bound

$$\begin{aligned} |\beta _f| \le t^3 n \Vert {{\varvec{A}}}\Vert _1 \left( \sum _{\ell =3}^\infty \frac{2^\ell }{\ell !} \right) . \end{aligned}$$

Using the inequality $\sum _{\ell =3}^\infty \frac{2^\ell }{\ell !} = e^2-5 \le 5/2$ above, we get

$$\begin{aligned} |\beta _f| \le \frac{5n \Vert {{\varvec{A}}}\Vert _1t^3 }{2}. \end{aligned}$$

Plugging this value back in (51), we obtain

$$\begin{aligned} f(\varvec{\sigma }^k) \le f(\varvec{\sigma }) + t^2 \sum _{i=1}^n \left( {\langle u_i,v_i \rangle } - \Vert {u_i}\Vert ^2 \Vert {g_i}\Vert \right) + \frac{5n \Vert {{\varvec{A}}}\Vert _1t^3 }{2}. \end{aligned}$$

(52)

Considering the same expansion for $\Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 = 2 \sum _{i=1}^n (\Vert {g_i^k}\Vert ^2 - {\langle \sigma _i^k,g_i^k \rangle }^2)$, we get the following (see (21)):

$$\begin{aligned} \Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 = 2 t^2 \sum _{i=1}^n \left( \Vert {u_i}\Vert \Vert {g_i}\Vert - {\langle \frac{u_i}{\Vert {u_i}\Vert },v_i \rangle } \right) ^2 + \beta _g, \end{aligned}$$

(53)

where $\beta _g$ represents the higher order terms. Upper bounding each higher order terms using the Cauchy-Schwarz inequality as follows, we obtain

$$\begin{aligned} |\beta _g|\le & {} 2 \sum _{i=1}^n \left[ \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} ( \Vert {u_j}\Vert + \Vert {u_m}\Vert )^\ell \right) \right. \\&\left. + \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\begin{array}{c} \ell ,s=0 \\ \ell +s\ge 3 \end{array}}^\infty \frac{t^{\ell +s}}{\ell ! s!} ( \Vert {u_i}\Vert + \Vert {u_j}\Vert )^{\ell +s} \right) \right] . \end{aligned}$$

Using the fact that $\Vert {u_i}\Vert \le 1$ for all $i\in [n]$, we get the following upper bound

$$\begin{aligned} |\beta _g| \le 2 \sum _{i=1}^n \left[ \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell \right) + \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\begin{array}{c} \ell ,s=0 \\ \ell +s\ge 3 \end{array}}^\infty \frac{t^{\ell +s}}{\ell ! s!} 2^{\ell +s} \right) \right] . \end{aligned}$$

Using the upper bound $\sum _{j,m=1}^n |A_{ij}| |A_{im}| \le \Vert {{\varvec{A}}}\Vert _1^2$ above, we obtain

$$\begin{aligned} |\beta _g| \le 2 \Vert {{\varvec{A}}}\Vert _1^2 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell + \sum _{\begin{array}{c} \ell ,s=0 \\ \ell +s\ge 3 \end{array}}^\infty \frac{t^{\ell +s}}{\ell ! s!} 2^{\ell +s} \right] . \end{aligned}$$

Introducing a change of variables in the last sum, we get

$$\begin{aligned} |\beta _g|&\le 2 \Vert {{\varvec{A}}}\Vert _1^2 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell + \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell \left( \sum _{s=0}^\ell \frac{\ell !}{s!(\ell -s)!} \right) \right] ,\\&= 2 \Vert {{\varvec{A}}}\Vert _1^2 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} \left( 2^\ell + 4^\ell \right) \right] . \end{aligned}$$

Assuming that $t\le 1$, we obtain the following upper bound

$$\begin{aligned} |\beta _g| \le 2 \Vert {{\varvec{A}}}\Vert _1^2 t^3 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{1}{\ell !} \left( 2^\ell + 4^\ell \right) \right] . \end{aligned}$$

Using the inequality $\sum _{\ell =3}^\infty \frac{2^\ell +4^\ell }{\ell !} = e^2+e^4-18 \le 44$ above, we get

$$\begin{aligned} |\beta _g| \le 88 n \Vert {{\varvec{A}}}\Vert _1^2 t^3. \end{aligned}$$

Plugging this value back in (53), we obtain

$$\begin{aligned} \Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 \ge 2 t^2 \sum _{i=1}^n \left( \Vert {u_i}\Vert \Vert {g_i}\Vert - {\langle \frac{u_i}{\Vert {u_i}\Vert },v_i \rangle } \right) ^2 - 88 n \Vert {{\varvec{A}}}\Vert _1^2 t^3. \end{aligned}$$

(54)

Using the same bounding technique as in (24), we get

$$\begin{aligned} \Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2&\ge \frac{\mu }{n} \left( f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k) - \frac{5n \Vert {{\varvec{A}}}\Vert _1t^3 }{2} \right) - 88 n \Vert {{\varvec{A}}}\Vert _1^2 t^3, \\&= \frac{\mu }{n} \left( f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k) \right) - t^3 \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) . \end{aligned}$$

Therefore, in order for (25) to hold, we need

$$\begin{aligned} t^3 \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) \le \frac{\mu }{2n} \left( f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k) \right) , \end{aligned}$$

which can be equivalently rewritten as follows

$$\begin{aligned} t^3 \le \frac{\mu (f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k))}{2n \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) }. \end{aligned}$$

As $f(\varvec{\sigma }^k)$ is a monotonically non-decreasing sequence, then as soon as $\varvec{\sigma }^0$ is sufficiently close to $[{\bar{\varvec{\sigma }}}]$ in the sense that

$$\begin{aligned} \mathrm {dist}(\varvec{\sigma }^0, [{\bar{\varvec{\sigma }}}]) \le \left( \frac{\mu (f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k))}{2n \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) } \right) ^{1/3}, \end{aligned}$$

then the linear convergence rate presented in (26) holds.

Proof of Theorem 7

Before presenting the proof of Theorem 7, we first introduce the following theorem that characterizes the convergence rate of the Lanczos method with random initialization.

Theorem 8

[28, Theorem 4.2] Let ${\varvec{A}}\in {{\mathbb {R}}}^{n \times n}$ be a positive semidefinite matrix, $b\in {{\mathbb {R}}}^n$ be an arbitrary vector and $\lambda _L^\ell ({\varvec{A}},b)$ denote the output of the Lanczos algorithm after $\ell $ iterations when applied to find the leading eigenvalue of ${\varvec{A}}$ (denoted by $\lambda _1({\varvec{A}})$) with initialization b. In particular,

$$\begin{aligned} \lambda _L^\ell ({\varvec{A}},b) = \max \left\{ \frac{{\langle x,{\varvec{A}}x \rangle }}{{\langle x,x \rangle }} : 0 \ne x \in \mathrm {span}(b,\dots ,{\varvec{A}}^{\ell -1}b) \right\} . \end{aligned}$$

Assume that b is uniformly distributed over the set $\{b\in {{\mathbb {R}}}^n : \Vert {b}\Vert =1\}$ and let $\epsilon \in [0,1)$. Then, the probability that the Lanczos algorithm does not return an $\epsilon $-approximation to the leading eigenvalue of ${\varvec{A}}$ exponentially decreases as follows

$$\begin{aligned} \mathbb {P}\left( \lambda _L^\ell ({\varvec{A}},b)< (1-\epsilon ) \lambda _1({\varvec{A}}) \right) {\left\{ \begin{array}{ll} \le 1.648 \sqrt{n} e^{-\sqrt{\epsilon }(2\ell -1)}, &{} \text {if } 0<\ell <n(r-1), \\ = 0, &{} \text {if } \ell \ge n(r-1). \end{array}\right. } \end{aligned}$$

Using this result, Theorem 7 is proven as follows. Since the tangent space $T_{\varvec{\sigma }}{\mathcal {M}}_r$ has dimension $n(r-1)$, then we can define a symmetric matrix (where we drop the notational dependency on $\varvec{\sigma }$ for simplicity) ${\varvec{H}}\in {{\mathbb {R}}}^{n(r-1) \times n(r-1)}$ that represents the linear operator ${\mathrm {Hess}}f(\varvec{\sigma })$ in the basis $\{ {\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)} \}$ such that $\mathrm {span}({\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)}) = T_{\varvec{\sigma }}{\mathcal {M}}_r$. In particular, letting $H_{ij} = {\langle {\varvec{u}}^i, {\mathrm {Hess}}f(\varvec{\sigma })[{\varvec{u}}^j] \rangle }$ yields the desired matrix ${\varvec{H}}$ and the Lanczos algorithm is run to find the leading eigenvalue of this matrix. Here, it is important to note that ${\varvec{H}}$ is not a psd matrix, so it is required to shift ${\varvec{H}}$ with a large enough multiple of the identity matrix so that the resulting matrix is guaranteed to be positive semidefinite. In particular, by inspecting the definition of ${\mathrm {Hess}}f(\varvec{\sigma })$ in (5), it is easy to observe that $\Vert {{\mathrm {Hess}}f(\varvec{\sigma })}\Vert _{\text {op}} \le 4\Vert {{\varvec{A}}}\Vert _1$. Therefore, it is sufficient to run the Lanczos algorithm to find the leading eigenvalue of ${\widetilde{{\varvec{H}}}} = {\varvec{H}}+4\Vert {{\varvec{A}}}\Vert _1 {\varvec{I}}$, where ${\varvec{I}}$ denotes the appropriate sized identity matrix. On the other hand, we initialize the Lanczos algorithm with a random vector ${\varvec{u}}$ of unit norm (i.e., $\Vert {\varvec{u}}\Vert _{\mathrm {F}}=1$) in the tangent space $T_{\varvec{\sigma }}{\mathcal {M}}_r$. Notice that ${\varvec{u}}$ can equivalently be represented as a vector $b\in {{\mathbb {R}}}^{n(r-1)}$ in the basis $\{ {\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)} \}$ as ${\varvec{u}}= \sum _{i=1}^{n(r-1)} b_i {\varvec{u}}^i$ such that $\Vert {b}\Vert =1$. Then, by Theorem 8, we have

$$\begin{aligned} \mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < (1-\epsilon ) \lambda _1({\widetilde{{\varvec{H}}}}) \right) \le 1.648 \sqrt{n(r-1)} e^{-\sqrt{\epsilon }(2\ell -1)}. \end{aligned}$$

Letting $\lambda _1({\varvec{H}})$ denote the leading eigenvalue of ${\varvec{H}}$, we run the Lanczos algorithm to obtain a vector $b^*$ such that $\Vert {b^*}\Vert =1$ and ${\langle b^*, {\varvec{H}}b^* \rangle } \ge \lambda _1({\varvec{H}})/2$. Thus, we want $\mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}})/2 \right) $ to be small. Setting $\epsilon ^* = \frac{\lambda _1({\varvec{H}})}{16\Vert {{\varvec{A}}}\Vert _1}$, we can observe that

$$\begin{aligned} \left( 1-\epsilon ^* \right) \lambda _1({\widetilde{{\varvec{H}}}})&= \left( 1-\frac{\lambda _1({\varvec{H}})}{16\Vert {{\varvec{A}}}\Vert _1} \right) \left( 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}}) \right) , \\&= 4\Vert {{\varvec{A}}}\Vert _1 + \frac{3\lambda _1({\varvec{H}})}{4} - \frac{(\lambda _1({\varvec{H}}))^2}{16\Vert {{\varvec{A}}}\Vert _1}, \\&\ge 4\Vert {{\varvec{A}}}\Vert _1 + \frac{\lambda _1({\varvec{H}})}{2}, \end{aligned}$$

where the inequality follows since $\lambda _1({\varvec{H}}) \le 4\Vert {{\varvec{A}}}\Vert _1$. Consequently, we have

$$\begin{aligned}&\mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b)< 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}})/2 \right) \\&\quad \le \mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < (1-\epsilon ^*) \lambda _1({\widetilde{{\varvec{H}}}}) \right) \le 1.648 \sqrt{n(r-1)} e^{-\sqrt{\epsilon ^*}(2\ell -1)}. \end{aligned}$$

By Theorem 6, we know that the Lanczos method is called at most $\left\lceil 675 n \Vert {{\varvec{A}}}\Vert _1^2 / \varepsilon ^2 \right\rceil $ times to search for an $\varepsilon $-approximate concave point and for any non-desired solution we have $\lambda _1({\varvec{H}}) \ge \varepsilon $ by the definition of $\varepsilon $-approximate concave point. Then, by using a union bound over all calls to the Lanczos method, we conclude that when the Lanczos method is run for $\ell $ iterations, we have the following guarantee

$$\begin{aligned}&\mathbb {P}\left( \text {Algorithm 2+3 fails to return an} \varepsilon \text {-approximate concave point} \right) \\&\quad \le \left\lceil \frac{675 n \Vert {{\varvec{A}}}\Vert _1^2}{\varepsilon ^2} \right\rceil 1.648 \sqrt{n(r-1)} e^{-\sqrt{\frac{\varepsilon }{16\Vert {{\varvec{A}}}\Vert _1}}(2\ell -1)}. \end{aligned}$$

In order to set this probability to some $\delta \in (0,1)$, we let

$$\begin{aligned} \ell ^*= & {} \left\lceil \left( \frac{1}{2} + 2 \sqrt{\frac{\Vert {{\varvec{A}}}\Vert _1}{\varepsilon }} \right) \log \left( \frac{\left\lceil \frac{675 n \Vert {{\varvec{A}}}\Vert _1^2}{\varepsilon ^2} \right\rceil 1.648 \sqrt{n(r-1)}}{\delta } \right) \right\rceil \\= & {} \widetilde{{\mathcal {O}}}\left( \sqrt{\frac{\Vert {{\varvec{A}}}\Vert _1}{\varepsilon }} \log \left( \frac{n\sqrt{n(r-1)}}{\delta } \right) \right) , \end{aligned}$$

where tilde is used to hide poly-logarithmic factors in $\Vert {{\varvec{A}}}\Vert _1 / \varepsilon $. Since the Lanczos algorithm is guaranteed to return the leading eigenvalue with probability 1 in at most $n(r-1)$ iterations, then running each Lanczos subroutine for $\min (\ell ^*,n(r-1))$ iterations, it is guaranteed that Algorithm 2+3 returns an $\varepsilon $-approximate concave point with probability at least $1-\delta $.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Erdogdu, M.A., Ozdaglar, A., Parrilo, P.A. et al. Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs. Math. Program. 195, 243–281 (2022). https://doi.org/10.1007/s10107-021-01686-3

Download citation

Received: 10 December 2019
Accepted: 26 June 2021
Published: 30 July 2021
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10107-021-01686-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs

Abstract

Access this article

Similar content being viewed by others

Convergence of a Weighted Barrier Algorithm for Stochastic Convex Quadratic Semidefinite Optimization

Decomposition Methods for Large-Scale Semidefinite Programs with Chordal Aggregate Sparsity and Partial Orthogonality

Sparse PSD approximation of the PSD cone

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Corollary 1

Rest of the Proof of Theorem 2

Proof of Theorem 7

Theorem 8

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs

Abstract

Access this article

Similar content being viewed by others

Convergence of a Weighted Barrier Algorithm for Stochastic Convex Quadratic Semidefinite Optimization

Decomposition Methods for Large-Scale Semidefinite Programs with Chordal Aggregate Sparsity and Partial Orthogonality

Sparse PSD approximation of the PSD cone

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Corollary 1

Rest of the Proof of Theorem 2

Proof of Theorem 7

Theorem 8

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation