当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hardware-Centric AutoML for Mixed-Precision Quantization
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2020-06-11 , DOI: 10.1007/s11263-020-01339-6
Kuan Wang , Zhijian Liu , Yujun Lin , Ji Lin , Song Han

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support flexible bitwidth (1–8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off accuracy, latency, energy, and model size, which is both time-consuming and usually sub-optimal. There are plenty of specialized hardware accelerators for neural networks, but little research has been done to design specialized neural networks optimized for a particular hardware accelerator. The latter is demanding given the much longer design cycle of silicon than neural nets. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which automatically determine the quantization policy, and we take the hardware accelerator’s feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate the direct feedback signals to the RL agent. Compared with other conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4–1.95 $$\times $$ × and the energy consumption by 1.9 $$\times $$ × with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.

中文翻译:

用于混合精度量化的以硬件为中心的 AutoML

模型量化是一种广泛使用的技术,用于压缩和加速深度神经网络 (DNN) 推理。Emergent DNN 硬件加速器开始支持灵活的位宽(1-8 位)以进一步提高计算效率,这对寻找每一层的最佳位宽提出了巨大挑战:它需要领域专家探索广阔的设计空间,权衡精度,延迟、能量和模型大小,这既耗时又通常是次优的。有很多用于神经网络的专用硬件加速器,但很少有人研究设计针对特定硬件加速器优化的专用神经网络。考虑到硅的设计周期比神经网络长得多,后者要求很高。传统的量化算法忽略了不同的硬件架构,并以统一的方式量化所有层。在本文中,我们介绍了自动确定量化策略的硬件感知自动量化 (HAQ) 框架,并在设计循环中采用了硬件加速器的反馈。我们不依赖于诸如 FLOP 和模型大小之类的代理信号,而是使用硬件模拟器来生成对 RL 代理的直接反馈信号。与其他传统方法相比,我们的框架是完全自动化的,可以针对不同的神经网络架构和硬件架构专门制定量化策略。我们的框架有效地将延迟减少了 1.4-1.95 $$\times $$ ×,能耗减少了 1。9 $$\times $$ × 与固定位宽(8 位)量化相比,精度损失可以忽略不计。我们的框架表明,在不同资源约束(即延迟、能量和模型大小)下,不同硬件架构(即边缘和云架构)上的最优策略是截然不同的。我们解释了不同量化策略的含义,这为神经网络架构设计和硬件架构设计提供了见解。
更新日期:2020-06-11
down
wechat
bug