Elsevier

Neural Networks

Volume 123, March 2020, Pages 134-141
Neural Networks

Structured pruning of recurrent neural networks through neuron selection

https://doi.org/10.1016/j.neunet.2019.11.018Get rights and content

Abstract

Recurrent neural networks (RNNs) have recently achieved remarkable successes in a number of applications. However, the huge sizes and computational burden of these models make it difficult for their deployment on edge devices. A practically effective approach is to reduce the overall storage and computation costs of RNNs by network pruning techniques. Despite their successful applications, those pruning methods based on Lasso either produce irregular sparse patterns in weight matrices, which is not helpful in practical speedup. To address these issues, we propose a structured pruning method through neuron selection which can remove the independent neuron of RNNs. More specifically, we introduce two sets of binary random variables, which can be interpreted as gates or switches to the input neurons and the hidden neurons, respectively. We demonstrate that the corresponding optimization problem can be addressed by minimizing the L0 norm of the weight matrix. Finally, experimental results on language modeling and machine reading comprehension tasks have indicated the advantages of the proposed method in comparison with state-of-the-art pruning competitors. In particular, nearly 20× practical speedup during inference was achieved without losing performance for the language model on the Penn TreeBank dataset, indicating the promising performance of the proposed method.

Introduction

Recurrent neural networks (RNNs) have recently achieved remarkable successes in multiple fields such as image captioning (Anderson et al., 2018, Vinyals et al., 2016), action recognition (Pan et al., 2019, Ye et al., 2018), music segmentation (Liu et al., 2018), question answering (Mao et al., 2018, Sagara and Hagiwara, 2014), machine translation (Bahdanau et al., 2014, Luong et al., 2015, Yang et al., 2018), and language modeling (Byeon et al., 2015, Liu et al., 2018, Sutskever et al., 2014). These successes heavily rely on huge models trained on large datasets, especially for those RNN variants such as Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) networks (Cho et al., 2014). With the increasing popularity of edge computing, a recent trend is to deploy these models onto end devices so as to allow off-line reasoning and inference. However, these models are generally of huge sizes and bring expensive computation and storage costs during inference, which makes the deployment difficult for those devices with limited resources. In order to reduce the overall computation and storage costs of these models, model compression on recurrent neural networks has been widely concerned.

Network pruning is one of the prominent approaches to tackle the compression of RNNs. Narang, Elsen, Diamos and Sengupta (2017a) presents a connection pruning method to compress RNNs efficiently. However, the obtained weight matrix via connection pruning has random and unstructured sparsity. Such unstructured sparse formats are unfriendly for efficient computation in modern hardware systems (Lebedev and Lempitsky, 2016, Zhao et al., 2017) due to irregular memory access in modern processors. Previous studies (Wen et al., 2017, Wen et al., 2016) have shown that speedup obtained with random sparse matrix multiplication on various hardware platforms is lower than expected. For example, varying the sparsity level in weight matrices of AlexNet in the range of 67.6%, 92.4%, 94.3%, 96.6%, and 97.2%, the speedup ratio was 0.25×, 0.52×, 1.36×, 1.04×, and 1.38×, respectively. A practical remedy to this problem is structured pruning where pruning individual neurons can directly trim weight matrix size such that structured sparse matrix multiplication efficiently utilizes the hardware resources.

Due to the promising properties of structured pruning, the structured pruning on deep neuron networks(DNNs) has been widely explored (Ding et al., 2019, He et al., 2019, He et al., 2017, Zhuang et al., 2018). However, compared with the structured pruning on DNNs, there is a vital challenge originated from recurrent structure of RNNs, which is shared across all the time steps in a sequence. Structured pruning methods used in DNNs cannot be directly applied to RNNs. The reason is that independently removing the links can result in a mismatch of feature dimensions and then induce invalid recurrent units. In contrast, this problem does not exist in DNNs, where neurons can be independently removed without violating the usability of the final network structure. Accordingly group sparsity (Louizos, Welling, & Kingma, 2017) is difficult to be applied in RNNs.

To address this issue, we explore a new type of method along the line of structured pruning of RNNs through neuron selection. In detail, we introduce two sets of binary random variables, which can be interpreted as gates to the neurons, to indicate the presence of the input neurons and the hidden neurons, respectively. The two sets of binary random variables are then used to generate sparse masks for the weight matrix. More specifically, the presence of the matrix entry wij depends on both the presence of the ith input unit and the jth output unit, while the value of wij indicates the strength of the connection if wij0. However, the optimization of these variables is computationally intractable due to the nature of 2|h| possible states of binary gate variable vector h. We then develop an efficient L0 inference algorithm for inferring the binary gate variables, motivated from the work of pruning DNN weights (Louizos et al., 2017, Srinivas et al., 2017).

While previous efforts on structured pruning of RNNs resort to the group lasso (i.e., the L2,1 norm regularization) for learning sparsity (Wen et al., 2017), the lasso based methods are shown to be insufficient in inducing sparsity for large scale non-convex problems such as the training of DNNs (Collins and Kohli, 2014, Srinivas et al., 2017). In contrast, the expected L0 minimization closely resembles spike-and-slab priors (Mitchell and Beauchamp, 1988, Xu et al., 2016, Zhe et al., 2015) used in Bayesian variable selection (Louizos et al., 2017, Srinivas et al., 2017). The spike-and-slab priors can induce high sparsity and encourage large values at the same time due to the richer parameterization of these priors, while LASSO shrinks all parameters until lot of them are close to zero. And the L0-norm regularization explicitly penalizes parameters for being different than zero with no other restrictions. Hence compared with Intrinsic Sparse Structures (ISS) via Lasso proposed by Wen et al. (2017), our neuron selection via the L0 norm regularization can achieve higher adequate sparsity in RNNs.

In this paper, we propose a new type of method to prune individual neurons of RNNs. Our key contribution is that we introduce binary gates on recurrent and input units such that sparse masks for the weight matrix can be generated, allowing for effective neuron selection under sparsity constraint. For the first work of neuron selection in RNNs, we attempt to employ the smoothed mechanism for the L0 regularized objective proposed in Louizos et al. (2017), motivated from Srinivas et al. (2017).

We evaluate our structured pruning method on two tasks, i.e., language modeling and machine reading comprehension. For example, in the case of language modeling of the word level on the Penn Treebank dataset, our method achieves the state-of-the-art results, i.e., the model size is reduced by more than 10 times, and the inference of the resulted sparse model is nearly 20 times faster than that of the original model. We also achieve encouraging results for the recurrent highway networks (Zilly, Srivastava, Koutník, & Schmidhuber, 2017) on language modeling and BiDAF model (Seo, Kembhavi, Farhadi, & Hajishirzi, 2016) on machine reading comprehension.

Section snippets

Related work

Despite model compression has achieved impressive success in DNNs (e.g., CNNs) (Ayinde et al., 2019, Han et al., 2016, Mohammed and Lim, 2017), it is difficult to directly apply this technology of compressing DNNs to the compression of RNNs due to the recurrent structure in RNNs. There are some recent efforts on the compression of RNNs. Generally, the compression techniques on RNNs can be categorized into the following types: pruning (Narang, Elsen, Diamos and Sengupta, 2017a, Narang, Elsen and

Structured pruning of LSTMs through neuron selection

Without loss of generality, we focus on the compression of LSTMs (Hochreiter & Schmidhuber, 1997), a common variant of RNN that learns long-term dependencies. Note that our method can be readily applied to the compression of GRUs and vanilla RNNs. Before presenting the proposed sparsification methods, we first introduce the LSTM network. it=σ(Wixt+Uiht1+bi),ft=σ(Wfxt+Ufht1+bf),ot=σ(Woxt+Uoht1+bo),ut=tanh(Wuxt+Uuht1+bu),ct=itut+ftct1,ht=ottanh(ct), where σ() is the sigmoid function,

Experiments

To compare with Intrinsic Sparse Structures (ISS) via Lasso proposed by Wen et al. (2017), we also evaluate our structured sparsity learning method with L0 regularization on language modeling and machine reading tasks. In the case of language modeling, we seek to sparsify a stacked LSTM model (Zaremba, Sutskever, & Vinyals, 2014) and the state-of-the-art Recurrent Highway Networks (RHNs) (Zilly et al., 2017) on the Penn Treebank (PTB) dataset (Marcus, Marcinkiewicz, & Santorini, 1993). For the

Conclusion

In this paper, we propose a novel structured sparsity learning method for recurrent neural networks. By introducing binary gates on neurons, we penalize weight matrices through L0 regularization, reduce the sizes of the network parameters significantly and lead to practical speedup during inference. We also demonstrate the superiority of our relaxed L0 regularization over the group lasso used in previous methods. Our methods can be readily used in other recurrent structures such as Gated

Acknowledgments

This work was partially supported by NSF China (No. 61572111), a Fundamental Research Fund for the Central Universities of China (No. ZYGX2016Z003), and Startup Funding (No. G05QNQR004).

References (50)

  • CollinsM.D. et al.

    Memory bounded deep convolutional networks

    CoRR

    (2014)
  • Ding, X., Ding, G., Guo, Y., & Han, J. (2019). Centripetal sgd for pruning very deep convolutional networks with...
  • HanS. et al.

    Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding

    CoRR

    (2016)
  • He, Y., Liu, P., Wang, Z., Hu, Z., & Yang, Y. (2019). Filter pruning via geometric median for deep convolutional neural...
  • HeY. et al.

    Channel pruning for accelerating very deep neural networks

  • HochreiterS. et al.

    Long short-term memory

    Neural Computation

    (1997)
  • HubaraI. et al.

    Quantized neural networks: Training neural networks with low precision weights and activations

    CoRR

    (2016)
  • Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In 5th international...
  • KingmaD.P. et al.

    Auto-encoding variational bayes

    (2013)
  • Lebedev, V., & Lempitsky, V. (2016). Fast convnets using group-wise brain damage. In Proceedings of the IEEE conference...
  • Liu, H., He, L., Bai, H., Dai, B., Bai, K., & Xu, Z. (2018). Structured inference for recurrent hidden semi-markov...
  • LouizosC. et al.

    Learning sparse neural networks through L0 regularization

    CoRR

    (2017)
  • Luong, T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. In...
  • MaddisonC.J. et al.

    The concrete distribution: A continuous relaxation of discrete random variables

    (2016)
  • MarcusM.P. et al.

    Building a large annotated corpus of english: The penn treebank

    Computational Linguistics

    (1993)
  • Cited by (31)

    • Self-organizing pipelined recurrent wavelet neural network for time series prediction

      2023, Expert Systems with Applications
      Citation Excerpt :

      These growing algorithms can effectively adjust the network structure, but may cause the redundancy of the network. In (Wen et al., 2020), the pruning mechanism based on the neuron selection was addressed to determine the structure of the network. The algorithm combining linear dynamic system was applied to prune the structure of RNN (Chatzikonstantinou et al., 2021).

    • DMPP: Differentiable multi-pruner and predictor for neural network pruning

      2022, Neural Networks
      Citation Excerpt :

      However, the huge number of parameters and computational cost make it difficult to deploy to mobile devices, which greatly hinders the further application of deep learning. In order to solve this problem, many model compression methods have been proposed, such as low-rank decomposition (Hayashi, Yamaguchi, Sugawara, & Maeda, 2019; Jaderberg, Vedaldi, & Zisserman, 2014; Kim, Khan, & Kyung, 2019; Yu, Liu, Wang, & Tao, 2017), quantization (Jung et al., 2019; Lin, Ji, Xu, et al., 2020; Nagel, Baalen, Blankevoort, & Welling, 2019; Xu et al., 2021), knowledge distillation (Hinton, Vinyals, & Dean, 2015; Li et al., 2021; Zhang et al., 2019; Zhang, Xiang, et al., 2018), and pruning (Ayinde, Inanc, & Zurada, 2019; Guo, Wu, Kittler, & Feng, 2021; Luo & Wu, 2020; Wen, Zhang, Bai, & Xu, 2020; Zhuang et al., 2018). In summary, existing methods have low search efficiency and have difficulty in finding the optimal structure.

    • End-to-End Supermask Pruning: Learning to Prune Image Captioning Models: Learning to Prune Image Captioning Models

      2022, Pattern Recognition
      Citation Excerpt :

      There are extensive prior work in this direction targeted at feed-forward CNNs, including [30–34] just to name a few. At the same time, structured pruning of RNNs is also widely explored [16,35,36]. Recently, unstructured pruning has enjoyed emerging support, including Fast SpMM kernels [7–9] and block-sparsity support by NVIDIA Ampere GPU3 and HuggingFace Transformers library4.

    • TWD-SFNN: Three-way decisions with a single hidden layer feedforward neural network

      2021, Information Sciences
      Citation Excerpt :

      The pruning of a neural network reduces redundant neurons or connection weights to deal with the overfitting problem caused by excessive neurons and synapses [6]. Several pruning strategies have been reported, such as amplitude-based weight pruning [37], cluster loss function-based pruning [48], and discrimination-aware channel pruning [49]. A heuristic neural network optimizes the parameters of the network by applying heuristic models to avoid falling into local optimum, such as genetic algorithm [10], electromagnetic algorithm [32], and adaptive annealing learning algorithm [15].

    View all citing articles on Scopus
    View full text