Abstract
The randomized Kaczmarz (RK) method is an iterative method for approximating the least-squares solution of large linear systems of equations. The standard RK method uses sequential updates, making parallel computation difficult. Here, we study a parallel version of RK where a weighted average of independent updates is used. We analyze the convergence of RK with averaging and demonstrate its performance empirically. We show that as the number of threads increases, the rate of convergence improves and the convergence horizon for inconsistent systems decreases.
Similar content being viewed by others
References
Aharoni, R., Censor, Y.: Block-iterative projection methods for parallel computation of solutions to convex feasibility problems. Linear Algebra Appl. 120, 165–175 (1989)
Bottou, L.: Online algorithms and stochastic approximations. In: Online Learning and Neural Networks. Cambridge University Press, Cambridge (1998)
Charles, L.: Applied Iterative Methods. Ak Peters/CRC Press, Boca Raton (2007)
Cai, Y., Zhao, Y., Tang, Y.: Exponential convergence of a randomized Kaczmarz algorithm with relaxation. In: Gaol, F.L., Nguyen, Q.V. (eds.) Proceedings of the 2011 2nd International Congress on Computer Applications and Computational Science, pp. 467–473, Springer, Berlin (2012)
Chen, X., Powell, A.M.: Almost sure convergence of the Kaczmarz algorithm with random measurements. J. Fourier Anal. Appl. 18(6), 1195–1214 (2012)
Eggermont, P.P.B., Herman, G.T., Lent, A.: Iterative algorithms for large partitioned linear systems, with applications to image reconstruction. Linear Algebra Appl. 40, 37–67 (1981)
Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011)
Elfving, T.: Block-iterative methods for consistent and inconsistent linear equations. Numer. Math. 35(1), 1–12 (1980)
Gordon, D., Gordon, R.: Component-averaged row projections: a robust, block-parallel scheme for sparse linear systems. SIAM J. Sci. Comput. 27(3), 1092–1117 (2005)
Gordon, R., Bender, R., Herman, G.T.: Algebraic reconstruction techniques (art) for three-dimensional electron microscopy and X-ray photography. J. Theor. Biol. 29(3), 471–481 (1970)
Gower, R.M., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)
Hamaker, C., Solmon, D.C.: The angles between the null spaces of X rays. J. Math. Anal. Appl. 62(1), 1–23 (1978)
Hanke, M., Niethammer, W.: On the acceleration of Kaczmarz’s method for inconsistent linear systems. Linear Algebra Appl. 130, 83–98 (1990)
Hanke, M., Niethammer, W.: On the use of small relaxation parameters in Kaczmarz method. Z. Angew. Math. Mech. 70(6), T575–T576 (1990)
Herman, G.T., Meyer, L.B.: Algebraic reconstruction techniques can be made computationally efficient (positron emission tomography application). IEEE Trans. Med. Imaging 12(3), 600–609 (1993)
Kaczmarz, S.M.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Pol. Sci. Lett. Classe Sci. Math. Nat. Sér. A Sci. Math. 35, 355–357 (1937)
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
Liu, J., Wright, S.J., Srikrishna, S.: An asynchronous parallel randomized Kaczmarz algorithm. arXiv:1401.4780 (2014)
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods. SIAM J. Matrix Anal. A 36(4), 1590–1604 (2015)
Necoara, I.: Faster randomized block Kaczmarz algorithms. SIAM J. Matrix Anal. Appl. 40(4), 1425–1452 (2019)
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155(1), 549–573 (2015)
Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra Appl. 441(August), 199–221 (2012)
Needell, D., Ward, R.: Two-subspace projection method for coherent overdetermined systems. J. Fourier Anal. Appl. 19(2), 256–269 (2013)
Needell, D., Ward, R.: Batched stochastic gradient descent with weighted sampling. Approx. Theory XV San Antonio 2016, 279–306 (2017)
Niu, F., Recht, B., Ré, C., Wright, S.J.: HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In: Neural Information Processing Systems (2011)
Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM J. Matrix Anal. Appl. 41(2), 487–524 (2020)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Jinchao, X., Zikatanov, L.: The method of alternating projections and the method of subspace corrections in hilbert space. J. Am. Math. Soc. 15(3), 573–597 (2002)
Yangyang, X., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)
Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Rosemary Anne Renaut.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Molitor, Moorman, and Needell were partially supported by NSF CAREER DMS \(\#1348721\) and NSF BIGDATA \(\#1740325\). Moorman was additionally supported by NSF Grant DGE-1829071. Tu was supported by DARPA under agreement number FA8750-18-2-0066.
Appendices
A Proof of Lemma 1
Let \({ \mathbb {E}}_{i}\left[ \,\cdot \,\right] \) denote \({ \mathbb {E}}_{i \sim \mathcal{D}}\left[ \,\cdot \,\right] \). Expanding the definition of the weighted sampling matrix \(\mathbf{M }_k\) as a weighted average of the i.i.d. sampling matrices \(\frac{\mathbf{I }_i^\top \mathbf{I }_i}{\Vert \mathbf{A }_i\Vert ^2}\), we see that
Likewise, we can compute
by separating the cases where \(i=j\) from those where \(i \ne j\) and utilizing the independence of the indices sampled in \(\tau _k\).
B Proof of Theorem 1
We now prove Theorem 1 starting from the error update in Eq. (6). Expanding the squared error norm,
Under Assumption 1, the expectations in Lemma 1 simplify to
and
Upon taking expections on both sides of Eq. (11), the middle term simplifies since
Thus,
Making use of Lemma 1 to take the expectation in the first term in Eq. (12),
For the second term in Eq. 12,
Similarly, for the last term in Eq. (12),
Combining these in Eq. (12),
C Proof of Theorem 2
Proof
We seek to optimize the convergence rate constant from Corollary 1 when using uniform weights \(\mathbf{W }= \alpha \mathbf{I }\),
with respect to \(\alpha \). To do this, we first simplify from a matrix polynomial to a maximum over scalar polynomials in \(\alpha \) with coefficients based on each singular value of \(\mathbf{A }\). We then show that the maximum occurs when either the minimum or maximum singular value of \(\mathbf{A }\) is used. Finally, we derive a condition for which singular value to use, and determine the optimal \(\alpha \) that minimizes the maximum singular value.
Defining \(\mathbf{Q }^\top \varvec{\Sigma } \mathbf{Q }= \frac{\mathbf{A }^\top \mathbf{A }}{\Vert \mathbf{A }\Vert _F^2}\) as the eigendecomposition, and the polynomial
the convergence rate constant from Eq. (13) can be written as \(\sigma _{\max }\left( p\left( \frac{\mathbf{A }^\top \mathbf{A }}{\Vert \mathbf{A }\Vert _F^2}\right) \right) \). Since \(p\left( \frac{\mathbf{A }^\top \mathbf{A }}{\Vert \mathbf{A }\Vert _F^2}\right) \) is a polynomial of a symmetric matrix, its singular vectors are the same as those of its argument, while its corresponding singular values are the polynomial p applied to the singular values of the original matrix. That is,
Thus, the convergence rate constant can be written as
Moreover, we can bound this extremal singular value by the maximum of the polynomial p over an interval containing the spectrum of \(\varvec{\Sigma }\)
Here, the singular values of \(\varvec{\Sigma }\) are bounded from below by \(s_{\min }\overset{\text {def}}{=}\frac{\sigma ^2_{\min }(\mathbf{A })}{\Vert \mathbf{A }\Vert _F^2}\) and above by \(s_{\max }\overset{\text {def}}{=}\frac{\sigma ^2_{\max }(\mathbf{A })}{\Vert \mathbf{A }\Vert _F^2}\) since \(\varvec{\Sigma }\) is the diagonal matrix of singular values of \(\frac{\mathbf{A }^\top \mathbf{A }}{\Vert \mathbf{A }\Vert _F^2}\). Note that the polynomial can be factored as \(p(\sigma ) = (1-\sigma \alpha )^2+\frac{\sigma \alpha ^2}{q}\left( 1-\sigma \right) \), and is positive for \(\sigma \in \left[ 0, 1\right] \), which contains \(\left[ s_{\min }, s_{\max }\right] \). Also, since the coefficient of the \(\sigma ^2\) term of the polynomial p is \(\alpha ^2 \left( 1 - \frac{1}{q} \right) \) which is greater than or equal to zero, the polynomial is convex in \(\sigma \) on the interval \(\left[ s_{\min }, s_{\max }\right] \). Thus, the maximum of p on the interval \(\left[ s_{\min }, s_{\max }\right] \) is attained at one of the two endpoints \(s_{\min }, s_{\max }\) and we have the bound
To optimize this bound with respect to \(\alpha \), we first find conditions on \(\alpha \) such that \(p(s_{\min }) < p(s_{\max })\). If \(s_{\max }=s_{\min }\), this obviously never holds; otherwise, \(s_{\max }>s_{\min }\) and
Grouping like terms and cancelling, we get
Since \(\frac{\alpha }{q}>0\), we can divide it from both sides.
Since \(s_{\max } > s_{\min }\), we can divide both sides by \(s_{\max }-s_{\min }\).
and since the number of threads \(q\ge 1\), we can divide both sides by \(1 + \left( q-1\right) \left( s_{\min } + s_{\max }\right) \) to get
Thus,
For the first term,
since \(\alpha > \widehat{\alpha }\) and the coefficient is positive. Factoring \(\frac{2 s_{\max }}{q}\) from the second term and substituting for \(\widehat{\alpha }\), we get
since all terms in both numerator and denominator are positive. Thus, the function is monotonic increasing on \(\alpha \in [\widehat{\alpha },\infty )\), and the minimum is at the lower endpoint, i.e. \(\alpha ^\star = \widehat{\alpha }\).
Similarly, for the second term, \(\alpha \le \widehat{\alpha }\) and
If
this function is monotonic decreasing on \(\alpha \in (-\infty ,\alpha ^\star ]\), and the minimum is at the upper endpoint i.e. \(\alpha = \alpha ^\star \). Otherwise, since \(p(s_{\min })\) is quadratic in \(\alpha \) with positive leading coefficient, the minimum occurs at the critical point, so we set the derivative to 0 and solve for \(\alpha ^\star \)
D Corollary Proofs
We provide proofs for the corollaries of Sect. 2, which follow from Theorem 1.
1.1 D.1 Proof of Corollary 2
Suppose \(p_i = \frac{\Vert \mathbf{A }_i\Vert ^2}{\Vert \mathbf{A }\Vert _F^2}\) and \(\mathbf{W }= \alpha \mathbf{I }\). From the proof of Theorem 1,
In this case, since \(\mathbf{A }^\top r^\star = 0\), \(\langle \mathbf{A }e^k , r^\star \rangle = 0\) and
Combining the inner products,
1.2 D.2 Proof of Corollary 3
Suppose \(q= 1\), \(\mathbf{W }=\mathbf{I }\) and \(p_i = \frac{\Vert \mathbf{A }_i\Vert ^2}{\Vert \mathbf{A }\Vert _F^2}\).
From the proof of Theorem 1,
Decomposing \(r^k\),
Combining the inner products,
Rights and permissions
About this article
Cite this article
Moorman, J.D., Tu, T.K., Molitor, D. et al. Randomized Kaczmarz with averaging. Bit Numer Math 61, 337–359 (2021). https://doi.org/10.1007/s10543-020-00824-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10543-020-00824-1