Elsevier

Neurocomputing

Volume 451, 3 September 2021, Pages 81-94
Neurocomputing

GenExp: Multi-objective pruning for deep neural network based on genetic algorithm

https://doi.org/10.1016/j.neucom.2021.04.022Get rights and content

Abstract

Unstructured deep neural network (DNN) pruning have been widely studied. However, previous schemes only focused upon compressing the model’s memory footprint, which had led to relatively low reduction ratio in computational workload. This study demonstrates that the main reason behind is the inconsistent distribution of memory footprint and workload of the DNN model among different layers. Based on this observation, we propose to map the network pruning flow as a multi-objective optimization problem and design an improved genetic algorithm, which can efficiently explore the whole pruning structure space with both pruning goals equally constrained, to find the suitable solution that strikes a judicious balance between the DNN’s model size and workload. Experiments show that the proposed scheme can achieve up to 34% further reduction on the model’s computational workload compared to the state-of-the-art pruning scheme [11], [33] for ResNet50 on the ILSVRC-2012 dataset. We have also deployed the pruned ResNet50 models on a dedicated DNN accelerator, and the measured data have shown a considerable 6× reduction in inference time compared to FPGA accelerator implementing dense CNN model quantized in INT8 format, and a 2.27× improvement in power efficiency over 2080Ti GPU-based implementations, respectively.

Introduction

Deep neural networks (DNNs) have achieved remarkable results in the fields of computer vision, machine translation, speech recognition, etc. However, the high computational and storage costs pose great challenges for deployment of the DNN-based algorithms, especially when embedded setting with limited hardware resource is considered. To overcome this hurdle, various schemes, including model quantization (reducing arithmetic precision of the parameters) [10], [7], [20], low-rank factorization (factorizing the weight matrix into low-rank matrices) [34], knowledge distillation (distilling the knowledge learned from a complex network and pass it to a small network) [17], [43], and network pruning (trimming unimportant weights) [12], [26], [9], [21], have been proposed and studied to compress the model size and reduce the computational workload effectively.

Although various neural network pruning schemes have been widely studied in the literature [12], [13], [33], [4], current unstructured pruning methods still suffers from two problems: Firstly, from the perspective of hardware implementation, the runtime performance of the deployed DNN model depends not only on DNN model’s memory footprint but also on its computational workload [41], [40], [39]. However, it is often difficult to explicitly constrain both model footprint and computational workload reduction ratios at the same time in the model pruning flow. For instance, previous study [41] has shown that neural network model’s computational workload was not positively correlated with the number of the parameters. Therefore, for unstructured neural network pruning, a good DNN model size compression result can not guarantee the same effect in computational workload. As being compared in Table 1, studies of [27], [11] both compressed the number of parameters of the ResNet50 model by 5×, however, the achieved computational workload reduction rates are very different. The study of [12] compressed the model size of AlexNet by 9×, while the reduction ratio achieved in workload is only 2.4×. The work of [9] has shown an impressive compression ratio up to 17.6× for the same network model, however, the computational workload is only saved by 2.9×. Through a layer-by-layer analysis of the sparse model reported by previous studies, we have found that the pruning method that only focuses parameter compression often chooses to greedily delete the parameters with low computational workload, such as fully connected layers. Therefore, the actual performance boost gained from network pruning by using previous schemes is relatively small compared to the numbers in model size reduction.

Secondly, the layer-wise sparsity ratios are inherently tricky to set compared to other network training hyper parameters. Many previous works [12], [24] have shown that, in DNNs, some layers are sensitive to pruning, while others can be pruned significantly without degrading the model in accuracy. Consequently, it needs significant manual efforts to explore the pruning rate setting for each layer of the DNN model to balance the model size, workload, and accuracy, which is usually sub-optimal and time-consuming. Moreover, in recent years, with the continuous deepening of the DNN model layer, the hand-crafted model compression method [12], [9], [32] cannot achieve effective pruning. Previous works [19], [27], [5] used saliency-based formulas (including L1 norm and Taylor expansion) to measure the importance of weights and passively obtain the sparsity rate of each layer through global comparison. However, this type of method cannot explicitly constrain the computational workload of the sparse model. The latest study of [30] proposed a histogram-based search strategy to determine the pruning ratios at each bucket, however, the correlation between model search and fine-tuning results is not verified. In addition, the search process of this method limits the accuracy loss threshold and does not fully consider the accuracy recovery ability of fine-tuning, so the obtained pruning compression ratio is lower than most pruning works.

In this paper, we seek to develop a new DNN pruning approach that targets at optimal pruning results in both memory footprint and computational workload, and thus can significantly improve the runtime performance of the pruned model when deployed on dedicated hardware accelerators. Inspired by the recent studies of [24], [23], we introduce a genetic algorithm-based multi-objective DNN pruning flow that performs sparsity architecture exploration to automatically obtain the suitable sparsity ratio for each layer before pruning/fine-tuning the baseline model as illustrated by Fig. 1. We name our scheme as GenExp. The main contributions of this study are:

  • We propose a genetic algorithm-based multi-objective DNN pruning scheme (GenExp), which targets improved inference performance over previous studies when deployed on dedicated hardware accelerators. GenExp maps DNN pruning as a neural network sparsity architecture space exploration problem and automatically finds high-quality solutions under both the constraints of memory footprint and computational workload without manual settings.

  • We propose several improvements to the genetic algorithm including combined Gaussian initialization, progressive shrinking mutation, and fine-grained crossover. The improved genetic algorithm can achieve better solutions in the continuous sparsity architecture space.

  • Unlike previous studies, which only report estimated performance numbers, we have verified the effectiveness of the proposed scheme on a dedicated DNN hardware accelerator [37]. Real-world experiment results show that GenExp can effectively reduce the ResNet50 model’s execution time by up to 6 × at the cost of 0.1% accuracy drop compared to the dense model for image classification task on ImageNet, and achieve 10× model size compression and 6.5× workload reduction of the YOLOv2 model for object detection task on PASCAL VOC dataset.

The remainder of the paper is organized as follows. In Section 2, we describe the related work. The details of our proposed GenExp method are presented in Section 3. Experimental comparisons with other state-of-the-art pruning methods and ablation study are provided in Section 4. We conclude our work and plan of further improvements in Section 5.

Section snippets

Related works

In this section, we summarize the approaches that are most related to our work.

Methodology

In this section, we quantitatively model and analyze the memory and workload pruning targets to reveal the main reasons for the imbalance of pruning results, as shown in Table 1. Then, as illustrated by Fig. 1, we propose a genetic algorithm-based DNN exploration and pruning flow (referred to as GenExp) by mapping the network pruning procedure as a multi-objective optimization problem. An improved genetic algorithm is developed for this flow serving as an optimizer that explores the DNN pruning

Evaluation and results

In this section, we demonstrate the effectiveness of the proposed approach by performing six groups of experiments. Firstly, the VGG and ResNet models were pruned on the CIFAR-10 dataset with different combinations of the model size and workload pruning targets. The 3D Pareto frontier of the improved genetic algorithm for multi-objective optimization was examined on the ResNet20 model. Second, we pruned VGG16, ResNet50, and MobileNet V2 on the ILSVRC-2012 dataset and compared the results with

Conclusion and future work

In this study, we present a genetic algorithm-based unstructured pruning scheme for DNN. By adopting a multi-objective optimization strategy, the proposed approach achieves considerable performance improvement in reducing model’s computational workload over state-of-the-art schemes. Our scheme is especially useful for dedicated DNN hardware accelerators.

For future works, at algorithmic level, we will try more complex evolutionary algorithms, such as artificial bee colony algorithm and ant

CRediT authorship contribution statement

Ke Xu: Conceptualization, Methodology, Software, Writing - original draft. Dezheng Zhang: Writing - review & editing. Jianjing An: Writing - review & editing. Li Liu: Software, Validation. Lingzhi Liu: Software, Validation. Dong Wang: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was partially funded in part by National Key Research and Development Program of China under Grant 2019YFB2204200, Beijing Natural Science Foundation under Grant 4202063, Fundamental Research Funds for the Central Universities (Grant 2020JBM020), and BJTU-Kwai Industry-University-Research Cooperation Project.

Ke Xu received the B.S. degree in Hefei University of Technology, China, in 2016. Now, he is pursuing his Ph.D degree in Institute of Information Science, Beijing Jiaotong University, Beijing, China. His research interests include neural network compression, high performance computing architectures for embedded applications and computer vision.

References (43)

  • X. Liu et al.

    Multiobjective resnet pruning by means of emoas for remote sensing scene classification

    Neurocomputing

    (2020)
  • Burke, E.K., Varley, D.B., 1997. A genetic algorithms tutorial tool for numerical function optimisation, in:...
  • K. Deb et al.

    A combined genetic adaptive search (geneas) for engineering design

    Computer Sci. Inform.

    (1996)
  • K. Deb et al.

    A fast and elitist multiobjective genetic algorithm: Nsga-ii

    IEEE Trans. Evol. Comput.

    (2002)
  • B.L. Deng et al.

    Model compression and hardware acceleration for neural networks: A comprehensive survey

    Proc. IEEE

    (2020)
  • X. Ding et al.

    Global sparse momentum SGD for pruning very deep neural networks

  • J. Frankle, M. Carbin, The lottery ticket hypothesis: Finding sparse, trainable neural networks, in: 7th International...
  • R. Gong et al.

    Differentiable soft quantization: Bridging full-precision and low-bit neural networks, in

  • K. Guo, S. Zeng, J. Yu, Y. Wang, H. Yang, [DL] A survey of fpga-based neural network inference accelerators. ACM Trans....
  • Y. Guo, A. Yao, Y., Chen, Dynamic network surgery for efficient dnns, in: Advances in Neural Information Processing...
  • P. Gysel et al.

    Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks

    IEEE Trans. Neural Networks Learn. Syst.

    (2018)
  • S. Han et al.

    Invited: Bandwidth-efficient deep learning

    (2018)
  • S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural network with pruning, trained quantization and...
  • S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for efficient neural network, in: Advances in...
  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: 2016 IEEE Conference on...
  • Y. He et al.

    Amc: Automl for model compression and acceleration on mobile devices

    (2018)
  • M.G. Kendall

    A new measure of rank correlation

    Biometrika

    (1938)
  • Kim, J., Park, S., Kwak, N., 2018. Paraphrasing complex network: Network compression via factor transfer. Advances in...
  • A. Konak et al.

    Multi-objective optimization using genetic algorithms: A tutorial

    (2006)
  • Lee, N., Ajanthan, T., Torr, P., 2019. SNIP: Single-shot network pruning based on connection sensitivity, in: 7th...
  • Leng, C., Dou, Z., Li, H., Zhu, S., Jin, R., 2018. Extremely low bit neural network: Squeeze the last bit out with...
  • Cited by (0)

    Ke Xu received the B.S. degree in Hefei University of Technology, China, in 2016. Now, he is pursuing his Ph.D degree in Institute of Information Science, Beijing Jiaotong University, Beijing, China. His research interests include neural network compression, high performance computing architectures for embedded applications and computer vision.

    Dezheng Zhang received the master degree in Guangxi Normal University, China, in 2019. He is currently pursuing his Ph.D. degree in Institute of Information Science, Beijing Jiaotong University, Beijing, China. His research interests include neural network compression, high performance computing architectures for embedded applications, computer vision and information security.

    Jianjing An received the B.S. degree in Harbin University of Science And Technology, China, in 2015. He is currently pursuing his Ph.D. degree in Institute of Information Science, Beijing Jiaotong University, Beijing, China. His research interests include convolutional neural network compression, object detection, high performance computing architectures for embedded applications and computer vision.

    Li Liu graduated from Tsinghua University with bachelor degree major in EE and obtained his Ph.D. from the University of Missouri in 2009 major in CS. He joined Marvell semiconductor and then Realtek USA as a senior algorithm engineer focusing on video codec and image processing. Now he works at Heterogeneous Computing Group of Kuaishou Technology in Palo Alto, CA. as a heterogeneous platform architect.

    Lingzhi Liu is the Local Manager of US R&D Center, the Head and Chief Architect of Heterogeneous Computing Group of Kwai.Inc. in Palo Alto, CA. He received the B.S. degree from Xi’an Jiaotong University in 1998, and the Ph.D. degree from Shanghai Jiaotong University in 2004. He had careers in Alibaba-Inc., Realtek and Intel before joining Kwai.Inc. He was an adjunct Professor of Wuhan University, China and was a Postdoctoral Researcher in EE Dept. of University of Washington 2005 to 2008. His general interests include neural network algorithm and architecture, multimedia algorithm and implementation, VLSI system design. He is a BoD member of Chinese American Semiconductor Professional Association. He is a Senior Member of IEEE.

    Dong Wang received the Ph.D. and MS degrees in electronic engineering in 2010 and 2006 from Xi’an Jiaotong University, China. He has been a visiting scholar in the Department of Electrical and Computer Engineering of University of California, Davis during 2018–2019. He is currently working at Beijing Jiaotong University as a professor in the School of Computer and Information Technology. His research interests include reconfigurable computing, high performance computing architectures for embedded applications and computer vision.

    View full text