当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Post-Training Sparsity-Aware Quantization
arXiv - CS - Hardware Architecture Pub Date : 2021-05-23 , DOI: arxiv-2105.11010
Gil Shomron, Freddy Gabbay, Samer Kurzum, Uri Weiser

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.

中文翻译:

训练后稀疏感知量化

量化是一种用于深度神经网络(DNN)中的技术,可以提高执行性能和硬件效率。统一的训练后量化(PTQ)方法是常见的,因为它们可以在硬件中高效实现,并且不需要大量的硬件资源或训练集。使用统一的PTQ将FP32模型映射到INT8,产生的模型的精度下降可忽略不计;然而,由于量化噪声的增加,使用PTQ将精度降低到8位以下具有挑战性,因为精度下降变得很明显。在本文中,我们提出了一种稀疏感知量化(SPARQ)方法,其中非结构化和动态激活稀疏性在不同的表示粒度中得到利用。例如4位量化 通过首先检查零值位,动态检查8位值的位并选择4位的窗口来使用。此外,我们没有将激活逐个量化为4位,而是关注8位激活对,并检查这两个激活中的一个是否等于零。如果一个等于零,则第二个可以机会使用另一个的4位预算。如果两个值都不等于零,则将每个值动态量化为4位,如所述。SPARQ的精度降低不大,与广泛使用的硬件体系结构相比,速度提高了2倍,并且实现了实用的硬件。该代码位于https://github.com/gilshm/sparq。我们专注于8位激活对,并检查两者之一是否等于零。如果一个等于零,则第二个可以机会使用另一个的4位预算。如果两个值都不等于零,则将每个值动态量化为4位,如所述。SPARQ的精度降低不大,与广泛使用的硬件体系结构相比,速度提高了2倍,并且实现了实用的硬件。该代码位于https://github.com/gilshm/sparq。我们专注于8位激活对,并检查两者之一是否等于零。如果一个等于零,则第二个可以机会使用另一个的4位预算。如果两个值都不等于零,则将每个值动态量化为4位,如所述。SPARQ的精度降低不大,与广泛使用的硬件体系结构相比,速度提高了2倍,并且实现了实用的硬件。该代码位于https://github.com/gilshm/sparq。
更新日期:2021-05-25
down
wechat
bug