Skip to main content
Log in

Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The minibatching technique has been extensively adopted to facilitate stochastic first-order methods because of their computational efficiency in parallel computing for large-scale machine learning and data mining. Indeed, increasing the minibatch size decreases the iteration complexity (number of minibatch queries) to converge, resulting in the decrease of the running time by processing a minibatch in parallel. However, this gain is usually saturated for too large minibatch sizes and the total computational complexity (number of access to an example) is deteriorated. Hence, the determination of an appropriate minibatch size which controls the trade-off between the iteration and total computational complexities is important to maximize performance of the method with as few computational resources as possible. In this study, we define the optimal minibatch size as the minimum minibatch size with which there exists a stochastic first-order method that achieves the optimal iteration complexity and we call such a method the optimal minibatch method. Moreover, we show that Katyusha (in: Proceedings of annual ACM SIGACT symposium on theory of computing vol 49, pp 1200–1205, ACM, 2017), DASVRDA (Murata and Suzuki, in: Advances in neural information processing systems vol 30, pp 608–617, 2017), and the proposed method which is a combination of Acc-SVRG (Nitanda, in: Advances in neural information processing systems vol 27, pp 1574–1582, 2014) with APPA (Cotter et al. in: Advances in neural information processing systems vol 27, pp 3059–3067, 2014) are optimal minibatch methods. In experiments, we compare optimal minibatch methods with several competitors on \(L_1\)-and \(L_2\)-regularized logistic regression problems and observe that iteration complexities of optimal minibatch methods linearly decrease as minibatch sizes increase up to reasonable minibatch sizes and finally attain the best iteration complexities. This confirms the computational efficiency of optimal minibatch methods suggested by the theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. In this case, the all methods converged within one or two epochs and performed very similarly to each other.

  2. When \(\lambda _2 = 0\), we used dummy L2-regularization with \(\lambda _2 = 10^{-8}\), that was roughly \(\varepsilon * \Vert x_*\Vert ^2\), where \(\varepsilon \) is the desired accuracy and was set to \(10^{-4}\) and \(x_*\) is the optimal solution.

References

  1. Allen-Zhu Z (2017) Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of annual ACM SIGACT symposium on theory of computing 49, pp 1200–1205. ACM

  2. Murata T, Suzuki T (2017) Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. Adv Neural Inf Process Syst 30:608–617

    Google Scholar 

  3. Nitanda A (2014) Stochastic proximal gradient descent with acceleration techniques. Adv Neural Inf Process Syst 27:1574–1582

    Google Scholar 

  4. Lin Q, Lu Z, Xiao L (2014) An accelerated proximal coordinate gradient method. Adv Neural Inf Process Syst 27:3059–3067

    Google Scholar 

  5. Cotter A, Shamir O, Srebro N, Sridharan K (2011) Better mini-batch algorithms via accelerated gradient methods. Adv Neural Inf Process Syst 24:1647–1655

    Google Scholar 

  6. Shalev-Shwartz S, Zhang T (2013) Accelerated mini-batch stochastic dual coordinate ascent. Adv Neural Inf Process Syst 26:378–385

    Google Scholar 

  7. Kai Z, Liang H (2013) Minibatch and parallelization for online large margin structured learning. In: Proceedings of the 2013 Conference of the North American chapter of the association for computational linguistics: human language technologies, pp 370–379

  8. Martin T, Singh BA, Peter R, Nati S (2013) Mini-batch primal and dual methods for svms. In: ICML (3), pp 1022–1030

  9. Mu L, Tong Z, Yuqiang C, Alexander JS (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 661–670. ACM

  10. Peilin Z, Tong Z (2014) Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv preprintarXiv:1405.3080

  11. Jakub Konečnỳ, Jie Liu, Peter Richtárik, Martin Takáč (2016) Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J Select Top Signal Process 10(2):242–255

    Article  Google Scholar 

  12. Martin T, Peter R, Nathan S (2015) Distributed mini-batch sdca. arXiv preprintarXiv:1507.08322

  13. Prateek J, Sham KM, Rahul K, Praneeth N, Aaron S (2016) Parallelizing stochastic approximation through mini-batching and tail-averaging. Stat 1050:12

    Google Scholar 

  14. Nicolas LR, Mark S, Francis RB (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in neural information processing systems, pp 2663–2671

  15. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 26:315–323

    Google Scholar 

  16. Defazio A, Bach F, Lacoste-Julien S (2014) Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 27:1646–1654

    Google Scholar 

  17. Lin H, Mairal J, Harchaoui Z (2015) A universal catalyst for first-order optimization. Adv Neural Inf Process Syst 28:3384–3392

    Google Scholar 

  18. Frostig R, Ge R, Kakade S, Sidford A (2015) Un-regularizing: Approximate proximal point and faster stochastic algorithms for empirical risk minimization. In: Proceedings of International Conference on Machine Learning vol 32, pp 2540–2548

  19. Blake EW, Srebro N (2016) Tight complexity bounds for optimizing composite objectives. Adv Neural Inf Process Syst 29:3639–3647

    Google Scholar 

  20. Arjevani Y, Shamir O (2016) Dimension-free iteration complexity of finite sum optimization problems. Adv Neural Inf Process Syst 29:3540–3548

    Google Scholar 

  21. Mark S, Le RN (2013) Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprintarXiv:1308.6370

  22. Lam MN, Liu J, Scheinberg K, Takáč M (2017) Sarah: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of international conference on machine learning vol 34, pp 2613–2621

  23. Ofer D, Ran G-B, Ohad S, Lin X (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13:165–202

    MathSciNet  MATH  Google Scholar 

  24. Nidham G, Robert MG, Joseph S (2019) Optimal mini-batch and step sizes for saga. arXiv preprintarXiv:1902.00071

  25. Yurii N (2004) Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, New York

    MATH  Google Scholar 

  26. Panda B, Joshua SH, Basu S, Roberto JB (2009) Planet: massively parallel learning of tree ensembles with mapreduce. Proc VLDB Endow 2(2):1426–1437

    Article  Google Scholar 

  27. Chi Z, Feifei L, Jeffrey J (2012) Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th international conference on extending database technology, pp 38–49

  28. Tianyang S, Chengchun S, Feng L, Yu H, Lili M, Yitong F (2009) An efficient hierarchical clustering method for large datasets with map-reduce. In: 2009 International conference on parallel and distributed computing, applications and technologies, pp 494–499

  29. Yaobin H, Haoyu T, Wuman L, Huajian M, Di M, Shengzhong F, Jianping F (2011) Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In: 2011 IEEE 17th international conference on parallel and distributed systems, pp 473–480

  30. Weizhong Z, Huifang M, Qing H (2009) Parallel k-means clustering based on mapreduce. In: IEEE international conference on cloud computing, pp 674–679

  31. Xin J, Wang Z, Chen C, Ding L, Wang G, Zhao Y (2014) Elm: distributed extreme learning machine with mapreduce. World Wide Web 17(5):1189–1204

    Article  Google Scholar 

  32. Wang B, Huang S, Qiu J, Yu L, Wang G (2015) Parallel online sequential extreme learning machine based on mapreduce. Neurocomputing 149:224–232

    Article  Google Scholar 

  33. Bi X, Zhao X, Wang G, Zhang P, Wang C (2015) Distributed extreme learning machine with kernels based on mapreduce. Neurocomputing 149:456–463

    Article  Google Scholar 

  34. Çatak FO (2017) Classification with boosting of extreme learning machine over arbitrarily partitioned data. Soft Comput 21(9):2269–2281

    Article  Google Scholar 

  35. Xi C, Qihang L, Javier P (2012) Optimal regularized dual averaging methods for stochastic optimization. In: Advances in neural information processing systems, pp 395–403

  36. Hongzhou L, Julien M, Zaid H (2017) Catalyst acceleration for first-order convex optimization: from theory to practice. arXiv preprintarXiv:1712.05654

Download references

Acknowledgements

AN was partially supported by JSPS KAKENHI (19K20337) and JST PRESTO. TS was partially supported by JSPS KAKENHI (18H03201, and 20H00576), Japan Digital Design and JST CREST.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Atsushi Nitanda.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Convergence Analysis of Acc-SVRG with APPA

We here provide a detailed description concerning convergence analyses for Acc-SVRG with or without APPA. The following convergence theorem is a special case of \(n \ge \sqrt{\kappa }\) in [3].

Theorem A

([3]). Consider Algorithm 1. Assume \(n\ge \sqrt{\kappa }\) and \(g_i\) are L-smooth and \(\mu \)-strongly convex functions. Let \(p\in (0,1)\) satisfy \(\frac{2p(2+p)}{1-p} \le \frac{1}{2}\). Set \(b\ge \frac{8\sqrt{\kappa }n}{\sqrt{2}p(n-1)+8\sqrt{\kappa }}\), \(\eta = \frac{1}{2L}\), \(m = \left\lceil \frac{\sqrt{2\kappa }}{1-p}\log \left( \frac{1-p}{p}\right) \right\rceil \). Then, we get

$$\begin{aligned} {\mathbb {E}}[ f({\tilde{x}}_{s+1}) - f_* ] \le \frac{1}{2}( f({\tilde{x}}_{s}) - f_* ) \end{aligned}$$

and we get the following iteration and total complexities by setting \(T=O(\log (1/\epsilon ))\).

$$\begin{aligned} I(A(b),f)&= O\left( \left( \frac{n}{b} + \sqrt{\kappa } \right) \log \left( \frac{1}{\epsilon }\right) \right) , \\ T(A(b),f)&= O\left( \left( n + b\sqrt{\kappa } \right) \log \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

When \(n\ge \kappa \), the minibatch size of \(b=O(n/\sqrt{\kappa })\) satisfies the assumption in this theorem, thus, we can deduce the first claim in the paper concerning Acc-SVRG without APPA:

$$\begin{aligned} I(A(b),f)&= O( \sqrt{\kappa } \log ( 1/\epsilon ) ), \\ T(A(b),f)&= O( n \log ( 1/\epsilon ) ). \end{aligned}$$

The following lemma is key result to derive a total complexity of an optimization method accelerated by APPA. Concretely, Lemma A gives an iteration complexity of Algorithm 2 ignoring a complexity of an inner-solver \({\mathscr {M}}\).

Lemma A

([18]) Suppose that f is \(\mu \)-strongly convex and \(\gamma \ge 3\mu /2\). Consider Algorithm 2. If a generated sequence \(\{w_s\}_{s=1}^{T_c +1}\) satisfies the following

$$\begin{aligned} {\mathbb {E}}[ F_s(w_{s+1}) ] -F_s^* \le \frac{1}{4}\left( \frac{\mu }{\mu +2\gamma }\right) ^{\frac{3}{2}}( F_s(w_{s}) - F_s^* ). \end{aligned}$$
(6)

Then, there exists a positive value \(C_0\) which only depend on \(w_0\) such that

$$\begin{aligned} {\mathbb {E}}[ f(w_{s}) ] - f_* \le \left( 1-\frac{1}{2}\sqrt{ \frac{\mu }{\mu +2\gamma } }\right) ^s C_0. \end{aligned}$$
(7)

From Theorem A and Lemma A, we can derive complexities of Algorithm 1 with APPA in the case of \(n < \kappa \). To do so, we specify complexities of Algorithm 1 for solving subproblems with required accuracy (6) by APPA. Let \(\gamma = \frac{L}{n} - \mu \). Then, the condition number of subproblems are \(\frac{L+\gamma }{\mu +\gamma }= O(n)\). When \(\kappa > 3n\), \(\gamma > 3\mu /2\) follows. From Theorem A, we find that the required inequality (6) is satisfied with following parameters:

$$\begin{aligned} T= O\left( \log \left( \gamma /\mu \right) \right) ,\ b = O(\sqrt{n}),\ m = O(\sqrt{n}),\ \eta = \frac{1}{2(L+\gamma )}. \end{aligned}$$

Therefore, the iteration and total complexities for each run of Acc-SVRG in APPA are \(O(\sqrt{n}\log (\gamma /\mu ))\) and \(O(n\log (\gamma /\mu ))\). Since the required number of iterates \(T_c\) of APPA to achieve \(\epsilon \)-accurate solution is \(O\left( \sqrt{\frac{\gamma }{\mu }}\log \left( \frac{1}{\epsilon }\right) \right) = O\left( \sqrt{\frac{\kappa }{n}}\log \left( \frac{1}{\epsilon }\right) \right) \). Therefore, we get

$$\begin{aligned} I(A(b),f)&= O\left( \sqrt{\kappa } \log \left( \frac{1}{\epsilon } \right) \log \left( \frac{\kappa }{n}\right) \right) , \\ T(A(b),f)&= O\left( \sqrt{n\kappa } \log \left( \frac{1}{\epsilon } \right) \log \left( \frac{\kappa }{n}\right) \right) . \end{aligned}$$

Experimental details

Here, the details of the experiments in Sect. 6 are provided.

The details of the implemented algorithms and their tuning parameters were as follows:

  • SVRG [15]. We tuned the number of inner iterations m and the learning rate \(\eta \).

  • Acc-SVRG [3]. The number of inner iterations m and the learning rate \(\eta \) were tuned. We fixed the momentum parameter \(\beta \) as \((1 - \sqrt{\eta \lambda _2})/(1 + \sqrt{\eta \lambda _2})\), where \(\lambda _2\) is the \(L_2\)-regularization parameterFootnote 2.

  • SVRG with APPA (SVRG + Algorithm 2). In Addition to the above-described tuning parameters, the positive parameter \(\gamma \) described in Algorithm 2 was tuned. We fixed \(\mu \) as the \(L_2\)-regularization parameter \(\lambda _2\) \(^2\).

  • Acc-SVRG with APPA (Acc-SVRG + Algorithm 2). In the same manner as Acc-SVRG and SVRG with APPA, m, \(\eta \) and \(\gamma \) are tuned.

  • DASVRDA [2]. We tuned the number of inner iterations m, the learning rate \(\eta \), and additionally the restart interval S only for strongly convex objectives.

For tuning the above parameters, we automatically chose the values from appropriate ranges for each setting, by selecting the ones that led to the minimum objective value after running 50 passes over the data. The used parameter ranges were as follows:

  • Learning rate \(\eta \): \(\{0.1, 0.5, 1.0, 5.0, 10.0\}\).

  • The number of inner iterations m:

    \(\{n/b, 5n/b, 10n/b, 50n/b, 100n/b\}\), where n is the size of data set and b is the minibatch size.

  • APPA parameter \(\gamma \): \(\{10^k \times \lambda _2 \mid k \in \{0, 1, 2, 3, 4\} \}\), where \(\lambda _2\) is the \(L_2\)-regularization parameter\(^2\).

  • Restart Interval S: \(\{1, 5, 10, 50, 100\}\) (only for \(\lambda _2 > 0\)).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nitanda, A., Murata, T. & Suzuki, T. Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization. Knowl Inf Syst 63, 2513–2539 (2021). https://doi.org/10.1007/s10115-021-01593-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01593-1

Keywords

Navigation