Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization

Nitanda, Atsushi; Murata, Tomoya; Suzuki, Taiji

doi:10.1007/s10115-021-01593-1

Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization

Regular paper
Published: 12 July 2021

Volume 63, pages 2513–2539, (2021)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Atsushi Nitanda¹,
Tomoya Murata^2,3 &
Taiji Suzuki^3,4

230 Accesses
1 Altmetric
Explore all metrics

Abstract

The minibatching technique has been extensively adopted to facilitate stochastic first-order methods because of their computational efficiency in parallel computing for large-scale machine learning and data mining. Indeed, increasing the minibatch size decreases the iteration complexity (number of minibatch queries) to converge, resulting in the decrease of the running time by processing a minibatch in parallel. However, this gain is usually saturated for too large minibatch sizes and the total computational complexity (number of access to an example) is deteriorated. Hence, the determination of an appropriate minibatch size which controls the trade-off between the iteration and total computational complexities is important to maximize performance of the method with as few computational resources as possible. In this study, we define the optimal minibatch size as the minimum minibatch size with which there exists a stochastic first-order method that achieves the optimal iteration complexity and we call such a method the optimal minibatch method. Moreover, we show that Katyusha (in: Proceedings of annual ACM SIGACT symposium on theory of computing vol 49, pp 1200–1205, ACM, 2017), DASVRDA (Murata and Suzuki, in: Advances in neural information processing systems vol 30, pp 608–617, 2017), and the proposed method which is a combination of Acc-SVRG (Nitanda, in: Advances in neural information processing systems vol 27, pp 1574–1582, 2014) with APPA (Cotter et al. in: Advances in neural information processing systems vol 27, pp 3059–3067, 2014) are optimal minibatch methods. In experiments, we compare optimal minibatch methods with several competitors on $L_1$-and $L_2$-regularized logistic regression problems and observe that iteration complexities of optimal minibatch methods linearly decrease as minibatch sizes increase up to reasonable minibatch sizes and finally attain the best iteration complexities. This confirms the computational efficiency of optimal minibatch methods suggested by the theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Adaptive step size rules for stochastic optimization in large-scale learning

Article 20 February 2023

Zhuang Yang & Li Ma

Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction

Article 19 April 2023

Albert S. Berahas, Jiahao Shi, … Baoyu Zhou

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Article Open access 25 February 2023

Tian-De Guo, Yan Liu & Cong-Ying Han

Notes

In this case, the all methods converged within one or two epochs and performed very similarly to each other.
When $\lambda _2 = 0$, we used dummy L2-regularization with $\lambda _2 = 10^{-8}$, that was roughly $\varepsilon * \Vert x_*\Vert ^2$, where $\varepsilon $ is the desired accuracy and was set to $10^{-4}$ and $x_*$ is the optimal solution.

References

Allen-Zhu Z (2017) Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of annual ACM SIGACT symposium on theory of computing 49, pp 1200–1205. ACM
Murata T, Suzuki T (2017) Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. Adv Neural Inf Process Syst 30:608–617
Google Scholar
Nitanda A (2014) Stochastic proximal gradient descent with acceleration techniques. Adv Neural Inf Process Syst 27:1574–1582
Google Scholar
Lin Q, Lu Z, Xiao L (2014) An accelerated proximal coordinate gradient method. Adv Neural Inf Process Syst 27:3059–3067
Google Scholar
Cotter A, Shamir O, Srebro N, Sridharan K (2011) Better mini-batch algorithms via accelerated gradient methods. Adv Neural Inf Process Syst 24:1647–1655
Google Scholar
Shalev-Shwartz S, Zhang T (2013) Accelerated mini-batch stochastic dual coordinate ascent. Adv Neural Inf Process Syst 26:378–385
Google Scholar
Kai Z, Liang H (2013) Minibatch and parallelization for online large margin structured learning. In: Proceedings of the 2013 Conference of the North American chapter of the association for computational linguistics: human language technologies, pp 370–379
Martin T, Singh BA, Peter R, Nati S (2013) Mini-batch primal and dual methods for svms. In: ICML (3), pp 1022–1030
Mu L, Tong Z, Yuqiang C, Alexander JS (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 661–670. ACM
Peilin Z, Tong Z (2014) Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv preprintarXiv:1405.3080
Jakub Konečnỳ, Jie Liu, Peter Richtárik, Martin Takáč (2016) Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J Select Top Signal Process 10(2):242–255
Article Google Scholar
Martin T, Peter R, Nathan S (2015) Distributed mini-batch sdca. arXiv preprintarXiv:1507.08322
Prateek J, Sham KM, Rahul K, Praneeth N, Aaron S (2016) Parallelizing stochastic approximation through mini-batching and tail-averaging. Stat 1050:12
Google Scholar
Nicolas LR, Mark S, Francis RB (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in neural information processing systems, pp 2663–2671
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 26:315–323
Google Scholar
Defazio A, Bach F, Lacoste-Julien S (2014) Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 27:1646–1654
Google Scholar
Lin H, Mairal J, Harchaoui Z (2015) A universal catalyst for first-order optimization. Adv Neural Inf Process Syst 28:3384–3392
Google Scholar
Frostig R, Ge R, Kakade S, Sidford A (2015) Un-regularizing: Approximate proximal point and faster stochastic algorithms for empirical risk minimization. In: Proceedings of International Conference on Machine Learning vol 32, pp 2540–2548
Blake EW, Srebro N (2016) Tight complexity bounds for optimizing composite objectives. Adv Neural Inf Process Syst 29:3639–3647
Google Scholar
Arjevani Y, Shamir O (2016) Dimension-free iteration complexity of finite sum optimization problems. Adv Neural Inf Process Syst 29:3540–3548
Google Scholar
Mark S, Le RN (2013) Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprintarXiv:1308.6370
Lam MN, Liu J, Scheinberg K, Takáč M (2017) Sarah: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of international conference on machine learning vol 34, pp 2613–2621
Ofer D, Ran G-B, Ohad S, Lin X (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13:165–202
MathSciNet MATH Google Scholar
Nidham G, Robert MG, Joseph S (2019) Optimal mini-batch and step sizes for saga. arXiv preprintarXiv:1902.00071
Yurii N (2004) Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, New York
MATH Google Scholar
Panda B, Joshua SH, Basu S, Roberto JB (2009) Planet: massively parallel learning of tree ensembles with mapreduce. Proc VLDB Endow 2(2):1426–1437
Article Google Scholar
Chi Z, Feifei L, Jeffrey J (2012) Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th international conference on extending database technology, pp 38–49
Tianyang S, Chengchun S, Feng L, Yu H, Lili M, Yitong F (2009) An efficient hierarchical clustering method for large datasets with map-reduce. In: 2009 International conference on parallel and distributed computing, applications and technologies, pp 494–499
Yaobin H, Haoyu T, Wuman L, Huajian M, Di M, Shengzhong F, Jianping F (2011) Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In: 2011 IEEE 17th international conference on parallel and distributed systems, pp 473–480
Weizhong Z, Huifang M, Qing H (2009) Parallel k-means clustering based on mapreduce. In: IEEE international conference on cloud computing, pp 674–679
Xin J, Wang Z, Chen C, Ding L, Wang G, Zhao Y (2014) Elm: distributed extreme learning machine with mapreduce. World Wide Web 17(5):1189–1204
Article Google Scholar
Wang B, Huang S, Qiu J, Yu L, Wang G (2015) Parallel online sequential extreme learning machine based on mapreduce. Neurocomputing 149:224–232
Article Google Scholar
Bi X, Zhao X, Wang G, Zhang P, Wang C (2015) Distributed extreme learning machine with kernels based on mapreduce. Neurocomputing 149:456–463
Article Google Scholar
Çatak FO (2017) Classification with boosting of extreme learning machine over arbitrarily partitioned data. Soft Comput 21(9):2269–2281
Article Google Scholar
Xi C, Qihang L, Javier P (2012) Optimal regularized dual averaging methods for stochastic optimization. In: Advances in neural information processing systems, pp 395–403
Hongzhou L, Julien M, Zaid H (2017) Catalyst acceleration for first-order convex optimization: from theory to practice. arXiv preprintarXiv:1712.05654

Download references

Acknowledgements

AN was partially supported by JSPS KAKENHI (19K20337) and JST PRESTO. TS was partially supported by JSPS KAKENHI (18H03201, and 20H00576), Japan Digital Design and JST CREST.

Author information

Authors and Affiliations

Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Fukuoka, Japan
Atsushi Nitanda
NTT DATA Mathematical Systems Inc., Tokyo, Japan
Tomoya Murata
Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Tomoya Murata & Taiji Suzuki
Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
Taiji Suzuki

Authors

Atsushi Nitanda
View author publications
You can also search for this author in PubMed Google Scholar
Tomoya Murata
View author publications
You can also search for this author in PubMed Google Scholar
Taiji Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atsushi Nitanda.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Convergence Analysis of Acc-SVRG with APPA

We here provide a detailed description concerning convergence analyses for Acc-SVRG with or without APPA. The following convergence theorem is a special case of $n \ge \sqrt{\kappa }$ in [3].

Theorem A

([3]). Consider Algorithm 1. Assume $n\ge \sqrt{\kappa }$ and $g_i$ are L-smooth and $\mu $-strongly convex functions. Let $p\in (0,1)$ satisfy $\frac{2p(2+p)}{1-p} \le \frac{1}{2}$. Set $b\ge \frac{8\sqrt{\kappa }n}{\sqrt{2}p(n-1)+8\sqrt{\kappa }}$, $\eta = \frac{1}{2L}$, $m = \left\lceil \frac{\sqrt{2\kappa }}{1-p}\log \left( \frac{1-p}{p}\right) \right\rceil $. Then, we get

$$\begin{aligned} {\mathbb {E}}[ f({\tilde{x}}_{s+1}) - f_* ] \le \frac{1}{2}( f({\tilde{x}}_{s}) - f_* ) \end{aligned}$$

and we get the following iteration and total complexities by setting $T=O(\log (1/\epsilon ))$.

$$\begin{aligned} I(A(b),f)&= O\left( \left( \frac{n}{b} + \sqrt{\kappa } \right) \log \left( \frac{1}{\epsilon }\right) \right) , \\ T(A(b),f)&= O\left( \left( n + b\sqrt{\kappa } \right) \log \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

When $n\ge \kappa $, the minibatch size of $b=O(n/\sqrt{\kappa })$ satisfies the assumption in this theorem, thus, we can deduce the first claim in the paper concerning Acc-SVRG without APPA:

$$\begin{aligned} I(A(b),f)&= O( \sqrt{\kappa } \log ( 1/\epsilon ) ), \\ T(A(b),f)&= O( n \log ( 1/\epsilon ) ). \end{aligned}$$

The following lemma is key result to derive a total complexity of an optimization method accelerated by APPA. Concretely, Lemma A gives an iteration complexity of Algorithm 2 ignoring a complexity of an inner-solver ${\mathscr {M}}$.

Lemma A

([18]) Suppose that f is $\mu $-strongly convex and $\gamma \ge 3\mu /2$. Consider Algorithm 2. If a generated sequence $\{w_s\}_{s=1}^{T_c +1}$ satisfies the following

$$\begin{aligned} {\mathbb {E}}[ F_s(w_{s+1}) ] -F_s^* \le \frac{1}{4}\left( \frac{\mu }{\mu +2\gamma }\right) ^{\frac{3}{2}}( F_s(w_{s}) - F_s^* ). \end{aligned}$$

(6)

Then, there exists a positive value $C_0$ which only depend on $w_0$ such that

$$\begin{aligned} {\mathbb {E}}[ f(w_{s}) ] - f_* \le \left( 1-\frac{1}{2}\sqrt{ \frac{\mu }{\mu +2\gamma } }\right) ^s C_0. \end{aligned}$$

(7)

From Theorem A and Lemma A, we can derive complexities of Algorithm 1 with APPA in the case of $n < \kappa $. To do so, we specify complexities of Algorithm 1 for solving subproblems with required accuracy (6) by APPA. Let $\gamma = \frac{L}{n} - \mu $. Then, the condition number of subproblems are $\frac{L+\gamma }{\mu +\gamma }= O(n)$. When $\kappa > 3n$, $\gamma > 3\mu /2$ follows. From Theorem A, we find that the required inequality (6) is satisfied with following parameters:

$$\begin{aligned} T= O\left( \log \left( \gamma /\mu \right) \right) ,\ b = O(\sqrt{n}),\ m = O(\sqrt{n}),\ \eta = \frac{1}{2(L+\gamma )}. \end{aligned}$$

Therefore, the iteration and total complexities for each run of Acc-SVRG in APPA are $O(\sqrt{n}\log (\gamma /\mu ))$ and $O(n\log (\gamma /\mu ))$. Since the required number of iterates $T_c$ of APPA to achieve $\epsilon $-accurate solution is $O\left( \sqrt{\frac{\gamma }{\mu }}\log \left( \frac{1}{\epsilon }\right) \right) = O\left( \sqrt{\frac{\kappa }{n}}\log \left( \frac{1}{\epsilon }\right) \right) $. Therefore, we get

$$\begin{aligned} I(A(b),f)&= O\left( \sqrt{\kappa } \log \left( \frac{1}{\epsilon } \right) \log \left( \frac{\kappa }{n}\right) \right) , \\ T(A(b),f)&= O\left( \sqrt{n\kappa } \log \left( \frac{1}{\epsilon } \right) \log \left( \frac{\kappa }{n}\right) \right) . \end{aligned}$$

Experimental details

Here, the details of the experiments in Sect. 6 are provided.

The details of the implemented algorithms and their tuning parameters were as follows:

SVRG [15]. We tuned the number of inner iterations m and the learning rate $\eta $.
Acc-SVRG [3]. The number of inner iterations m and the learning rate $\eta $ were tuned. We fixed the momentum parameter $\beta $ as $(1 - \sqrt{\eta \lambda _2})/(1 + \sqrt{\eta \lambda _2})$, where $\lambda _2$ is the $L_2$-regularization parameter^{Footnote 2}.
SVRG with APPA (SVRG + Algorithm 2). In Addition to the above-described tuning parameters, the positive parameter $\gamma $ described in Algorithm 2 was tuned. We fixed $\mu $ as the $L_2$-regularization parameter $\lambda _2$ $^2$.
Acc-SVRG with APPA (Acc-SVRG + Algorithm 2). In the same manner as Acc-SVRG and SVRG with APPA, m, $\eta $ and $\gamma $ are tuned.
DASVRDA [2]. We tuned the number of inner iterations m, the learning rate $\eta $, and additionally the restart interval S only for strongly convex objectives.

For tuning the above parameters, we automatically chose the values from appropriate ranges for each setting, by selecting the ones that led to the minimum objective value after running 50 passes over the data. The used parameter ranges were as follows:

Learning rate $\eta $: $\{0.1, 0.5, 1.0, 5.0, 10.0\}$.
The number of inner iterations m:

$\{n/b, 5n/b, 10n/b, 50n/b, 100n/b\}$, where n is the size of data set and b is the minibatch size.
APPA parameter $\gamma $: $\{10^k \times \lambda _2 \mid k \in \{0, 1, 2, 3, 4\} \}$, where $\lambda _2$ is the $L_2$-regularization parameter$^2$.
Restart Interval S: $\{1, 5, 10, 50, 100\}$ (only for $\lambda _2 > 0$).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nitanda, A., Murata, T. & Suzuki, T. Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization. Knowl Inf Syst 63, 2513–2539 (2021). https://doi.org/10.1007/s10115-021-01593-1

Download citation

Received: 01 March 2020
Revised: 30 June 2021
Accepted: 03 July 2021
Published: 12 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10115-021-01593-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization

Abstract

Access this article

Similar content being viewed by others

Adaptive step size rules for stochastic optimization in large-scale learning

Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Convergence Analysis of Acc-SVRG with APPA

Theorem A

Lemma A

Experimental details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization

Abstract

Access this article

Similar content being viewed by others

Adaptive step size rules for stochastic optimization in large-scale learning

Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Convergence Analysis of Acc-SVRG with APPA

Theorem A

Lemma A

Experimental details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation