Abstract
Compressed Sensing using \(\ell _1\) regularization is among the most powerful and popular sparsification technique in many applications, but why has it not been used to obtain sparse deep learning model such as convolutional neural network (CNN)? This paper is aimed to provide an answer to this question and to show how to make it work. Following Xiao (J Mach Learn Res 11(Oct):2543–2596, 2010), We first demonstrate that the commonly used stochastic gradient decent and variants training algorithm is not an appropriate match with \(\ell _1\) regularization and then replace it with a different training algorithm based on a regularized dual averaging (RDA) method. The RDA method of Xiao (J Mach Learn Res 11(Oct):2543–2596, 2010) was originally designed specifically for convex problem, but with new theoretical insight and algorithmic modifications (using proper initialization and adaptivity), we have made it an effective match with \(\ell _1\) regularization to achieve a state-of-the-art sparsity for the highly non-convex CNN compared to other weight pruning methods without compromising accuracy (achieving 95% sparsity for ResNet-18 on CIFAR-10, for example).
Similar content being viewed by others
Notes
In the original paper [38], RDA is proposed as an online learning algorithm, which takes one input at each time.
References
Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, pp. 2270–2278 (2016)
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129, 163 (2011)
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks (2017). arXiv preprint arXiv:1710.09282
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10(Dec), 2899–2934 (2009)
Eldar, Y.C., Kutyniok, G.: Compressed Sensing: Theory and Applications. Cambridge University Press, Cambridge (2012)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding (2015). arXiv preprint arXiv:1510.00149
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)
Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: optimal brain surgeon. In: Advances in Neural Information Processing Systems, pp. 164–171 (1993)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision (ICCV), vol. 2 (2017)
Hu, H., Peng, R., Tai, Y.-W., Tang, C.-K.: Network trimming: a data-driven neuron pruning approach towards efficient deep architectures (2016). arXiv preprint arXiv:1607.03250
Huang, Z., Wang, N.: Data-driven sparse structure selection for deep neural networks (2017). arXiv preprint arXiv:1707.01213
Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10(2), 777–801 (2009)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990)
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G., Müller, K.R. (eds.) Neural Networks: Tricks of the Trade, pp. 9–48. Springer, Berlin (2012)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets (2016). arXiv preprint arXiv:1608.08710
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. IEEE (2017)
Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning (2018). arXiv preprint arXiv:1810.05270
Luo, J.-H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neural network compression (2017). arXiv preprint arXiv:1707.06342
Lustig, M., Donoho, D., Pauly, J.M.: Sparse MRI: the application of compressed sensing for rapid MR imaging. Magn. Reson. Med. Off. J. Int. Soc. Magn. Reson. Med. 58(6), 1182–1195 (2007)
McMahan, B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 525–533 (2011)
McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18(1), 3117–3166 (2017)
Mine, H., Fukushima, M.: A minimization method for the sum of a convex function and a continuously differentiable function. J. Optim. Theory Appl. 33(1), 9–23 (1981)
Mittal, D., Bhardwaj, S., Khapra, M.M., Ravindran, B.: Recovering from random pruning: on the plasticity of deep convolutional neural networks (2018)
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient problem (2012). CoRR arXiv:abs/1211.5063
Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation. In: International Conference on Neural Information Processing Systems, pp. 177–185 (1988)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, pp. 2116–2124 (2009)
Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11(Oct), 2543–2596 (2010)
Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression (2017). arXiv preprint arXiv:1710.01878
Acknowledgements
This work was partially supported by the Penn State and Peking University Joint Center for Computational Mathematics and Applications, the Beijing International Center for Mathematical Research from Peking University, and the Verne M. William Professorship Fund from Penn State University. The research of L. Zhao and L. Zhang was also supported by the China Scholarship Council (for visiting Penn State) and by HKUST16301218 Hong Kong RGC Competitive Earmarked Research Grant (for visiting Penn State), respectively. The authors wish to thank Drs. Lin Xiao and Liang Yang for helpful suggestions and discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
He, J., Jia, X., Xu, J. et al. Make \(\ell _1\) regularization effective in training sparse CNN. Comput Optim Appl 77, 163–182 (2020). https://doi.org/10.1007/s10589-020-00202-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-020-00202-1