当前位置:
X-MOL 学术
›
arXiv.cs.PF
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Quantized Neural Network Inference with Precision Batching
arXiv - CS - Performance Pub Date : 2020-02-26 , DOI: arxiv-2003.00822 Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi
arXiv - CS - Performance Pub Date : 2020-02-26 , DOI: arxiv-2003.00822 Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi
We present PrecisionBatching, a quantized inference algorithm for speeding up
neural network execution on traditional hardware platforms at low bitwidths
without the need for retraining or recalibration. PrecisionBatching decomposes
a neural network into individual bitlayers and accumulates them using fast
1-bit operations while maintaining activations in full precision.
PrecisionBatching not only facilitates quantized inference at low bitwidths (<
8 bits) without the need for retraining/recalibration, but also 1) enables
traditional hardware platforms the ability to realize inference speedups at a
finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows
accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers
to accumulate as a tunable parameter. Across a variety of applications (MNIST,
language modeling, natural language inference) and neural network architectures
(fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of
over 8x on a GPU within a < 1% error margin of the full precision baseline,
outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same
error tolerance.
中文翻译:
具有精确批处理的量化神经网络推理
我们提出了 PrecisionBatching,这是一种量化推理算法,用于在传统硬件平台上以低位宽加速神经网络执行,而无需重新训练或重新校准。PrecisionBatching 将神经网络分解为单独的位层,并使用快速的 1 位运算来累积它们,同时以全精度保持激活。PrecisionBatching 不仅有助于在低位宽(< 8 位)下进行量化推理,无需重新训练/重新校准,而且 1)使传统硬件平台能够以更精细的量化粒度(例如:1-16 位执行)实现推理加速) 和 2) 通过将要累积的位层数量公开为可调参数,在运行时允许准确性和加速之间的权衡。跨各种应用程序(MNIST、语言建模、
更新日期:2020-03-03
中文翻译:
具有精确批处理的量化神经网络推理
我们提出了 PrecisionBatching,这是一种量化推理算法,用于在传统硬件平台上以低位宽加速神经网络执行,而无需重新训练或重新校准。PrecisionBatching 将神经网络分解为单独的位层,并使用快速的 1 位运算来累积它们,同时以全精度保持激活。PrecisionBatching 不仅有助于在低位宽(< 8 位)下进行量化推理,无需重新训练/重新校准,而且 1)使传统硬件平台能够以更精细的量化粒度(例如:1-16 位执行)实现推理加速) 和 2) 通过将要累积的位层数量公开为可调参数,在运行时允许准确性和加速之间的权衡。跨各种应用程序(MNIST、语言建模、