当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference
arXiv - CS - Hardware Architecture Pub Date : 2021-02-08 , DOI: arxiv-2102.04503
Steve Dai, Rangharajan Venkatesan, Haoxing Ren, Brian Zimmer, William J. Dally, Brucek Khailany

Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of ($\approx$16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 37% area saving and 24% energy saving while maintaining over 75% accuracy for ResNet50 on ImageNet. 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26% compared to an 8-bit baseline.

中文翻译:

VS-Quant:精确低精度神经网络推理的每向量缩放量化

量化可通过减少模型的内存占用并利用低成本的整数数学硬件单元来有效加速深度神经网络。量化使用比例因子将经过训练的模型中的浮点权重和激活映射到低位宽整数值。过度的量化会极大地降低精度,从而导致精度下降。当比例因子在每个张量的许多维度上以粗粒度共享时,张量内单个元素的有效精度受到限制。为了减少与量化相关的精度损失,我们建议在张量的单个维度内为($ \ approx $ 16-64)元素的每个小向量使用单独的比例因子。为了实现高效的硬件实现,当使用两级量化方案进行校准时,可以使用低位宽整数实现每个向量的比例因子。我们发现,与传统的用于流行神经网络的缩放技术相比,每向量缩放始终可以在较低的精度下获得更好的推理精度,而无需重新训练。我们还修改了深度学习加速器硬件设计,以研究每向量缩放支持的面积和能量开销。我们的评估表明,具有4位权重和激活的按矢量缩放的量化实现了37%的面积节省和24%的能耗节省,同时使ImageNet上的ResNet50保持了75%以上的准确性。4位权重和8位激活可实现SQuAD上基于BERT的和BERT较大的几乎完全精确的精度,同时与8位基线相比将面积减少26%。
更新日期:2021-02-11
down
wechat
bug