当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Quantized Neural Network Inference with Precision Batching
arXiv - CS - Performance Pub Date : 2020-02-26 , DOI: arxiv-2003.00822
Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

中文翻译:

具有精确批处理的量化神经网络推理

我们提出了 PrecisionBatching,这是一种量化推理算法,用于在传统硬件平台上以低位宽加速神经网络执行,而无需重新训练或重新校准。PrecisionBatching 将神经网络分解为单独的位层,并使用快速的 1 位运算来累积它们,同时以全精度保持激活。PrecisionBatching 不仅有助于在低位宽(< 8 位)下进行量化推理,无需重新训练/重新校准,而且 1)使传统硬件平台能够以更精细的量化粒度(例如:1-16 位执行)实现推理加速) 和 2) 通过将要累积的位层数量公开为可调参数,在运行时允许准确性和加速之间的权衡。跨各种应用程序(MNIST、语言建模、
更新日期:2020-03-03
down
wechat
bug