当前位置: X-MOL 学术Neurocomputing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fully integer-based quantization for mobile convolutional neural network inference
Neurocomputing ( IF 6 ) Pub Date : 2020-12-23 , DOI: 10.1016/j.neucom.2020.12.035
Peng Peng , Mingyu You , Weisheng Xu , Jiaxin Li

Deploying deep convolutional neural networks on mobile devices is challenging because of the conflict between their heavy computational overhead and the hardware’s restricted computing capacity. Network quantization is typically used to alleviate this problem. However, we found that a “datatype mismatch” issue in existing low bitwidth quantization approaches can generate severe instruction redundancy, dramatically reducing their running efficiency on mobile devices. We therefore propose a novel quantization approach which ensures that only integer-based arithmetic is needed during the inference stage of the quantized model. To this end, we improved the quantization function to compel the quantized value to follow a standard integer format. Then we presented to simultaneously quantize the batch normalization parameters by a logarithm-like method. By doing so, the quantized model can keep the advantage of low bitwidth representation, while preventing the occurrence of “datatype mismatch” issue and corresponding instruction redundancy. Comprehensive experiments show that our method can achieve comparable prediction accuracy to other state-of-the-art methods while reducing the run-time latency by a large margin. Our fully integer-based quantized Resnet-18 has 4-bit weights, 4-bit activations and only a 0.7% top-1 and 0.4% top-5 accuracy drop on the ImageNet dataset. The assembly language implementation of a series of building blockscan reach a maximum of 4.33× the speed of the original full-precision version on an ARMv8 CPU.



中文翻译:

卷积神经网络推理的完全基于整数的量化

在移动设备上部署深度卷积神经网络具有挑战性,因为它们繁重的计算开销与硬件受限的计算能力之间存在冲突。网络量化通常用于缓解此问题。但是,我们发现,现有的低位宽量化方法中的“数据类型不匹配”问题会产生严重的指令冗余,从而大大降低了它们在移动设备上的运行效率。因此,我们提出了一种新颖的量化方法,该方法可确保在量化模型的推理阶段仅需要基于整数的算法。为此,我们改进了量化功能,以强制量化值遵循标准整数格式。然后,我们提出了通过对数样方法同时量化批次归一化参数。这样,量化模型可以保持低位宽表示的优势,同时防止出现“数据类型不匹配”问题和相应的指令冗余。全面的实验表明,我们的方法可以实现与其他最新方法相当的预测精度,同时可以大大减少运行时延迟。我们完全基于整数的量化Resnet-18具有4位权重,4位激活,并且ImageNet数据集的准确度下降仅为0.7%的top-1和0.4%的top-5。一系列构件的汇编语言实现最多可以达到4.33 全面的实验表明,我们的方法可以实现与其他最新方法相当的预测精度,同时可以大大减少运行时延迟。我们完全基于整数的量化Resnet-18具有4位权重,4位激活,并且ImageNet数据集的准确度下降仅为0.7%的top-1和0.4%的top-5。一系列构件的汇编语言实现最多可以达到4.33 全面的实验表明,我们的方法可以达到与其他最新方法相当的预测精度,同时可以大大减少运行时延迟。我们完全基于整数的量化Resnet-18具有4位权重,4位激活,并且ImageNet数据集的准确度下降仅为0.7%的top-1和0.4%的top-5。一系列构件的汇编语言实现最多可以达到4.33× 原始全精度版本在ARMv8 CPU上的速度。

更新日期:2021-01-12
down
wechat
bug