Abstract
Deep feedforward neural networks’ (DFNNs) weight estimation relies on the solution of a very large nonconvex optimization problem that may have many local (no global) minimizers, saddle points and large plateaus. Furthermore, the time needed to find good solutions of the training problem heavily depends on both the number of samples and the number of weights (variables). In this work, we show how block coordinate descent (BCD) methods can be fruitful applied to DFNN weight optimization problem and embedded in online frameworks possibly avoiding bad stationary points. We first describe a batch BCD method able to effectively tackle difficulties due to the network’s depth; then we further extend the algorithm proposing an online BCD scheme able to scale with respect to both the number of variables and the number of samples. We perform extensive numerical results on standard datasets using various deep networks. We show that the application of BCD methods to the training problem of DFNNs improves over standard batch/online algorithms in the training phase guaranteeing good generalization performance as well.
Similar content being viewed by others
References
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Bertsekas, D.P.: Incremental least squares methods and the extended Kalman filter. SIAM J. Optim. 6(3), 807–822 (1996)
Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48(3), 334–334 (1997)
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. CoRR, arXiv:abs/1507.01030 (2015)
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: COMPSTAT (2010)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Bravi, L., Sciandrone, M.: An incremental decomposition method for unconstrained optimization. Appl. Math. Comput. 235, 80–86 (2014)
Buzzi, C., Grippo, L., Sciandrone, M.: Convergent decomposition techniques for training RBF neural networks. Neural Comput. 13(8), 1891–1920 (2001)
Chauhan, V.K., Dahiya, K., Sharma, A.: Mini-batch block-coordinate based stochastic average adjusted gradient methods to solve big data problems. In: Proceedings of the Ninth Asian Conference on Machine Learning, volume 77 of Proceedings of Machine Learning Research, pp. 49–64. PMLR, 15–17 Nov 2017
Chollet, F., et al.: Keras (2015)
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv. Neural Inf. Process. Syst. 27, 2933–2941 (2014)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 27, 1646–1654 (2014)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Fisher, R.A.: Statistical methods for research workers. In: Johnson, N.L., Kotz, S. (eds.) Breakthroughs in Statistics, pp. 66–70. Springer, Berlin (1992)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Grippo, L., Manno, A., Sciandrone, M.: Decomposition techniques for multilayer perceptron training. IEEE Trans. Neural Netw. Learn. Syst. 27(11), 2146–2159 (2016)
Grippo, L., Sciandrone, M.: Globally convergent block-coordinate techniques for unconstrained optimization. Optim. Methods Softw. 10(4), 587–637 (1999)
Huang, G., Zhu, Q., Siew, C.: Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006)
Huang, G.-B., Wang, D.H., Lan, Y.: Extreme learning machines: a survey. Int. J. Mach. Learn. Cybern. 2(2), 107–122 (2011)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)
Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for Python. [Online; Accessed \(<\)today\(>\)] (2001)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR, arXiv:abs/1412.6980 (2014)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o \((1/{\rm k}\hat{}2)\). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006)
Palagi, L.: Global optimization issues in deep network regression: an overview. J. Glob. Optim. 73, 239–277 (2018)
Qin, T., Scheinberg, K., Goldfarb, D.: Efficient block-coordinate descent algorithms for the group lasso. Math. Program. Comput. 5(6), 143–169 (2013)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Tieleman, T., Hinton, G.: Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Wang, H., Banerjee, A.: Randomized block coordinate descent for online and stochastic optimization. arXiv preprint arXiv:1407.0107 (2014)
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Yu, A.W., Huang, L., Lin, Q., Salakhutdinov, R., Carbonell, J.: Normalized gradient with adaptive stepsize method for deep neural network training. CoRR arXiv:abs/1707.04822 (2017)
Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems, pp. 3329–3337 (2014)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Laura Palagi was partially supported by the project Distributed optimization algorithms for Big Data of Sapienza No. RM11715C7E49E89C.
Rights and permissions
About this article
Cite this article
Palagi, L., Seccia, R. Block layer decomposition schemes for training deep neural networks. J Glob Optim 77, 97–124 (2020). https://doi.org/10.1007/s10898-019-00856-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-019-00856-0