Structured pruning of recurrent neural networks through neuron selection
Introduction
Recurrent neural networks (RNNs) have recently achieved remarkable successes in multiple fields such as image captioning (Anderson et al., 2018, Vinyals et al., 2016), action recognition (Pan et al., 2019, Ye et al., 2018), music segmentation (Liu et al., 2018), question answering (Mao et al., 2018, Sagara and Hagiwara, 2014), machine translation (Bahdanau et al., 2014, Luong et al., 2015, Yang et al., 2018), and language modeling (Byeon et al., 2015, Liu et al., 2018, Sutskever et al., 2014). These successes heavily rely on huge models trained on large datasets, especially for those RNN variants such as Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) networks (Cho et al., 2014). With the increasing popularity of edge computing, a recent trend is to deploy these models onto end devices so as to allow off-line reasoning and inference. However, these models are generally of huge sizes and bring expensive computation and storage costs during inference, which makes the deployment difficult for those devices with limited resources. In order to reduce the overall computation and storage costs of these models, model compression on recurrent neural networks has been widely concerned.
Network pruning is one of the prominent approaches to tackle the compression of RNNs. Narang, Elsen, Diamos and Sengupta (2017a) presents a connection pruning method to compress RNNs efficiently. However, the obtained weight matrix via connection pruning has random and unstructured sparsity. Such unstructured sparse formats are unfriendly for efficient computation in modern hardware systems (Lebedev and Lempitsky, 2016, Zhao et al., 2017) due to irregular memory access in modern processors. Previous studies (Wen et al., 2017, Wen et al., 2016) have shown that speedup obtained with random sparse matrix multiplication on various hardware platforms is lower than expected. For example, varying the sparsity level in weight matrices of AlexNet in the range of 67.6%, 92.4%, 94.3%, 96.6%, and 97.2%, the speedup ratio was 0.25, 0.52, 1.36, 1.04, and 1.38, respectively. A practical remedy to this problem is structured pruning where pruning individual neurons can directly trim weight matrix size such that structured sparse matrix multiplication efficiently utilizes the hardware resources.
Due to the promising properties of structured pruning, the structured pruning on deep neuron networks(DNNs) has been widely explored (Ding et al., 2019, He et al., 2019, He et al., 2017, Zhuang et al., 2018). However, compared with the structured pruning on DNNs, there is a vital challenge originated from recurrent structure of RNNs, which is shared across all the time steps in a sequence. Structured pruning methods used in DNNs cannot be directly applied to RNNs. The reason is that independently removing the links can result in a mismatch of feature dimensions and then induce invalid recurrent units. In contrast, this problem does not exist in DNNs, where neurons can be independently removed without violating the usability of the final network structure. Accordingly group sparsity (Louizos, Welling, & Kingma, 2017) is difficult to be applied in RNNs.
To address this issue, we explore a new type of method along the line of structured pruning of RNNs through neuron selection. In detail, we introduce two sets of binary random variables, which can be interpreted as gates to the neurons, to indicate the presence of the input neurons and the hidden neurons, respectively. The two sets of binary random variables are then used to generate sparse masks for the weight matrix. More specifically, the presence of the matrix entry depends on both the presence of the th input unit and the th output unit, while the value of indicates the strength of the connection if . However, the optimization of these variables is computationally intractable due to the nature of possible states of binary gate variable vector . We then develop an efficient inference algorithm for inferring the binary gate variables, motivated from the work of pruning DNN weights (Louizos et al., 2017, Srinivas et al., 2017).
While previous efforts on structured pruning of RNNs resort to the group lasso (i.e., the norm regularization) for learning sparsity (Wen et al., 2017), the lasso based methods are shown to be insufficient in inducing sparsity for large scale non-convex problems such as the training of DNNs (Collins and Kohli, 2014, Srinivas et al., 2017). In contrast, the expected minimization closely resembles spike-and-slab priors (Mitchell and Beauchamp, 1988, Xu et al., 2016, Zhe et al., 2015) used in Bayesian variable selection (Louizos et al., 2017, Srinivas et al., 2017). The spike-and-slab priors can induce high sparsity and encourage large values at the same time due to the richer parameterization of these priors, while LASSO shrinks all parameters until lot of them are close to zero. And the -norm regularization explicitly penalizes parameters for being different than zero with no other restrictions. Hence compared with Intrinsic Sparse Structures (ISS) via Lasso proposed by Wen et al. (2017), our neuron selection via the norm regularization can achieve higher adequate sparsity in RNNs.
In this paper, we propose a new type of method to prune individual neurons of RNNs. Our key contribution is that we introduce binary gates on recurrent and input units such that sparse masks for the weight matrix can be generated, allowing for effective neuron selection under sparsity constraint. For the first work of neuron selection in RNNs, we attempt to employ the smoothed mechanism for the regularized objective proposed in Louizos et al. (2017), motivated from Srinivas et al. (2017).
We evaluate our structured pruning method on two tasks, i.e., language modeling and machine reading comprehension. For example, in the case of language modeling of the word level on the Penn Treebank dataset, our method achieves the state-of-the-art results, i.e., the model size is reduced by more than 10 times, and the inference of the resulted sparse model is nearly 20 times faster than that of the original model. We also achieve encouraging results for the recurrent highway networks (Zilly, Srivastava, Koutník, & Schmidhuber, 2017) on language modeling and BiDAF model (Seo, Kembhavi, Farhadi, & Hajishirzi, 2016) on machine reading comprehension.
Section snippets
Related work
Despite model compression has achieved impressive success in DNNs (e.g., CNNs) (Ayinde et al., 2019, Han et al., 2016, Mohammed and Lim, 2017), it is difficult to directly apply this technology of compressing DNNs to the compression of RNNs due to the recurrent structure in RNNs. There are some recent efforts on the compression of RNNs. Generally, the compression techniques on RNNs can be categorized into the following types: pruning (Narang, Elsen, Diamos and Sengupta, 2017a, Narang, Elsen and
Structured pruning of LSTMs through neuron selection
Without loss of generality, we focus on the compression of LSTMs (Hochreiter & Schmidhuber, 1997), a common variant of RNN that learns long-term dependencies. Note that our method can be readily applied to the compression of GRUs and vanilla RNNs. Before presenting the proposed sparsification methods, we first introduce the LSTM network. where is the sigmoid function,
Experiments
To compare with Intrinsic Sparse Structures (ISS) via Lasso proposed by Wen et al. (2017), we also evaluate our structured sparsity learning method with regularization on language modeling and machine reading tasks. In the case of language modeling, we seek to sparsify a stacked LSTM model (Zaremba, Sutskever, & Vinyals, 2014) and the state-of-the-art Recurrent Highway Networks (RHNs) (Zilly et al., 2017) on the Penn Treebank (PTB) dataset (Marcus, Marcinkiewicz, & Santorini, 1993). For the
Conclusion
In this paper, we propose a novel structured sparsity learning method for recurrent neural networks. By introducing binary gates on neurons, we penalize weight matrices through regularization, reduce the sizes of the network parameters significantly and lead to practical speedup during inference. We also demonstrate the superiority of our relaxed regularization over the group lasso used in previous methods. Our methods can be readily used in other recurrent structures such as Gated
Acknowledgments
This work was partially supported by NSF China (No. 61572111), a Fundamental Research Fund for the Central Universities of China (No. ZYGX2016Z003), and Startup Funding (No. G05QNQR004).
References (50)
- et al.
Redundant feature pruning for accelerated inference in deep neural networks
Neural Networks
(2019) - et al.
Query completion in community-based question answering search
Neurocomputing
(2018) - et al.
A new hyperbox selection rule and a pruning strategy for the enhanced fuzzy min–max neural network
Neural Networks
(2017) - et al.
Natural language neural network and its application to question-answering system
Neurocomputing
(2014) - et al.
Generative adversarial training for neural machine translation
Neurocomputing
(2018) - Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., & Gould, S., et al. (2018). Bottom-up and top-down attention...
- et al.
Neural machine translation by jointly learning to align and translate
(2014) - et al.
Estimating or propagating gradients through stochastic neurons for conditional computation
(2013) - Byeon, W., Breuel, T. M., Raue, F., & Liwicki, M. (2015). Scene labeling with LSTM recurrent neural networks. In CVPR...
- et al.
Learning phrase representations using rnn encoder-decoder for statistical machine translation
(2014)
Memory bounded deep convolutional networks
CoRR
Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding
CoRR
Channel pruning for accelerating very deep neural networks
Long short-term memory
Neural Computation
Quantized neural networks: Training neural networks with low precision weights and activations
CoRR
Auto-encoding variational bayes
Learning sparse neural networks through regularization
CoRR
The concrete distribution: A continuous relaxation of discrete random variables
Building a large annotated corpus of english: The penn treebank
Computational Linguistics
Cited by (31)
Self-organizing pipelined recurrent wavelet neural network for time series prediction
2023, Expert Systems with ApplicationsCitation Excerpt :These growing algorithms can effectively adjust the network structure, but may cause the redundancy of the network. In (Wen et al., 2020), the pruning mechanism based on the neuron selection was addressed to determine the structure of the network. The algorithm combining linear dynamic system was applied to prune the structure of RNN (Chatzikonstantinou et al., 2021).
DMPP: Differentiable multi-pruner and predictor for neural network pruning
2022, Neural NetworksCitation Excerpt :However, the huge number of parameters and computational cost make it difficult to deploy to mobile devices, which greatly hinders the further application of deep learning. In order to solve this problem, many model compression methods have been proposed, such as low-rank decomposition (Hayashi, Yamaguchi, Sugawara, & Maeda, 2019; Jaderberg, Vedaldi, & Zisserman, 2014; Kim, Khan, & Kyung, 2019; Yu, Liu, Wang, & Tao, 2017), quantization (Jung et al., 2019; Lin, Ji, Xu, et al., 2020; Nagel, Baalen, Blankevoort, & Welling, 2019; Xu et al., 2021), knowledge distillation (Hinton, Vinyals, & Dean, 2015; Li et al., 2021; Zhang et al., 2019; Zhang, Xiang, et al., 2018), and pruning (Ayinde, Inanc, & Zurada, 2019; Guo, Wu, Kittler, & Feng, 2021; Luo & Wu, 2020; Wen, Zhang, Bai, & Xu, 2020; Zhuang et al., 2018). In summary, existing methods have low search efficiency and have difficulty in finding the optimal structure.
End-to-End Supermask Pruning: Learning to Prune Image Captioning Models: Learning to Prune Image Captioning Models
2022, Pattern RecognitionCitation Excerpt :There are extensive prior work in this direction targeted at feed-forward CNNs, including [30–34] just to name a few. At the same time, structured pruning of RNNs is also widely explored [16,35,36]. Recently, unstructured pruning has enjoyed emerging support, including Fast SpMM kernels [7–9] and block-sparsity support by NVIDIA Ampere GPU3 and HuggingFace Transformers library4.
TWD-SFNN: Three-way decisions with a single hidden layer feedforward neural network
2021, Information SciencesCitation Excerpt :The pruning of a neural network reduces redundant neurons or connection weights to deal with the overfitting problem caused by excessive neurons and synapses [6]. Several pruning strategies have been reported, such as amplitude-based weight pruning [37], cluster loss function-based pruning [48], and discrimination-aware channel pruning [49]. A heuristic neural network optimizes the parameters of the network by applying heuristic models to avoid falling into local optimum, such as genetic algorithm [10], electromagnetic algorithm [32], and adaptive annealing learning algorithm [15].