Post-Training Sparsity-Aware Quantization,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Post-Training Sparsity-Aware Quantization
arXiv - CS - Hardware Architecture Pub Date : 2021-05-23 , DOI: arxiv-2105.11010
Gil Shomron, Freddy Gabbay, Samer Kurzum, Uri Weiser

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.

中文翻译：

训练后稀疏感知量化

量化是一种用于深度神经网络（DNN）中的技术，可以提高执行性能和硬件效率。统一的训练后量化（PTQ）方法是常见的，因为它们可以在硬件中高效实现，并且不需要大量的硬件资源或训练集。使用统一的PTQ将FP32模型映射到INT8，产生的模型的精度下降可忽略不计；然而，由于量化噪声的增加，使用PTQ将精度降低到8位以下具有挑战性，因为精度下降变得很明显。在本文中，我们提出了一种稀疏感知量化（SPARQ）方法，其中非结构化和动态激活稀疏性在不同的表示粒度中得到利用。例如4位量化通过首先检查零值位，动态检查8位值的位并选择4位的窗口来使用。此外，我们没有将激活逐个量化为4位，而是关注8位激活对，并检查这两个激活中的一个是否等于零。如果一个等于零，则第二个可以机会使用另一个的4位预算。如果两个值都不等于零，则将每个值动态量化为4位，如所述。SPARQ的精度降低不大，与广泛使用的硬件体系结构相比，速度提高了2倍，并且实现了实用的硬件。该代码位于https://github.com/gilshm/sparq。我们专注于8位激活对，并检查两者之一是否等于零。如果一个等于零，则第二个可以机会使用另一个的4位预算。如果两个值都不等于零，则将每个值动态量化为4位，如所述。SPARQ的精度降低不大，与广泛使用的硬件体系结构相比，速度提高了2倍，并且实现了实用的硬件。该代码位于https://github.com/gilshm/sparq。我们专注于8位激活对，并检查两者之一是否等于零。如果一个等于零，则第二个可以机会使用另一个的4位预算。如果两个值都不等于零，则将每个值动态量化为4位，如所述。SPARQ的精度降低不大，与广泛使用的硬件体系结构相比，速度提高了2倍，并且实现了实用的硬件。该代码位于https://github.com/gilshm/sparq。

更新日期：2021-05-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>