Quantized Neural Network Inference with Precision Batching,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Quantized Neural Network Inference with Precision Batching
arXiv - CS - Performance Pub Date : 2020-02-26 , DOI: arxiv-2003.00822
Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

中文翻译：

具有精确批处理的量化神经网络推理

我们提出了 PrecisionBatching，这是一种量化推理算法，用于在传统硬件平台上以低位宽加速神经网络执行，而无需重新训练或重新校准。PrecisionBatching 将神经网络分解为单独的位层，并使用快速的 1 位运算来累积它们，同时以全精度保持激活。PrecisionBatching 不仅有助于在低位宽（< 8 位）下进行量化推理，无需重新训练/重新校准，而且 1）使传统硬件平台能够以更精细的量化粒度（例如：1-16 位执行）实现推理加速) 和 2) 通过将要累积的位层数量公开为可调参数，在运行时允许准确性和加速之间的权衡。跨各种应用程序（MNIST、语言建模、

更新日期：2020-03-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文