当前位置: X-MOL 学术arXiv.cs.NE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Activation Density based Mixed-Precision Quantization for Energy Efficient Neural Networks
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2021-01-12 , DOI: arxiv-2101.04354
Karina Vasquez, Yeshwanth Venkatesha, Abhiroop Bhattacharjee, Abhishek Moitra, Priyadarshini Panda

As neural networks gain widespread adoption in embedded devices, there is a need for model compression techniques to facilitate deployment in resource-constrained environments. Quantization is one of the go-to methods yielding state-of-the-art model compression. Most approaches take a fully trained model, apply different heuristics to determine the optimal bit-precision for different layers of the network, and retrain the network to regain any drop in accuracy. Based on Activation Density (AD)-the proportion of non-zero activations in a layer-we propose an in-training quantization method. Our method calculates bit-width for each layer during training yielding a mixed precision model with competitive accuracy. Since we train lower precision models during training, our approach yields the final quantized model at lower training complexity and also eliminates the need for re-training. We run experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on VGG19/ResNet18 architectures and report the accuracy and energy estimates for the same. We achieve ~4.5x benefit in terms of estimated multiply-and-accumulate (MAC) reduction while reducing the training complexity by 50% in our experiments. To further evaluate the energy benefits of our proposed method, we develop a mixed-precision scalable Process In Memory (PIM) hardware accelerator platform. The hardware platform incorporates shift-add functionality for handling multi-bit precision neural network models. Evaluating the quantized models obtained with our proposed method on the PIM platform yields ~5x energy reduction compared to 16-bit models. Additionally, we find that integrating AD based quantization with AD based pruning (both conducted during training) yields up to ~198x and ~44x energy reductions for VGG19 and ResNet18 architectures respectively on PIM platform compared to baseline 16-bit precision, unpruned models.

中文翻译:

节能神经网络的基于激活密度的混合精度量化

随着神经网络在嵌入式设备中得到广泛采用,需要模型压缩技术以促进在资源受限环境中的部署。量化是产生最新模型压缩的常用方法之一。大多数方法采用训练有素的模型,应用不同的启发式方法来确定网络不同层的最佳位精度,并对网络进行重新训练以重新获得准确性的任何下降。基于激活密度(AD)-层中非零激活的比例-我们提出了训练中量化方法。我们的方法计算出训练期间每一层的位宽,从而产生具有竞争准确性的混合精度模型。由于我们在训练过程中训练了精度较低的模型,我们的方法以较低的训练复杂度产生了最终的量化模型,并且消除了重新训练的需要。我们在VGG19 / ResNet18架构上的基准数据集(如CIFAR-10,CIFAR-100,TinyImagenet)上进行了实验,并报告了该数据集的准确性和能量估算。在估计的乘积和(MAC)减少方面,我们实现了约4.5倍的收益,同时在我们的实验中将训练复杂度降低了50%。为了进一步评估我们提出的方法的能源效益,我们开发了一种混合精度可扩展的内存中处理(PIM)硬件加速器平台。硬件平台包含用于处理多位精度神经网络模型的移位加功能。与16位模型相比,在PIM平台上评估使用我们提出的方法获得的量化模型可减少约5倍的能耗。
更新日期:2021-01-13
down
wechat
bug