Blended coarse gradient descent for full quantization of deep neural networks,Research in the Mathematical Sciences

当前位置： X-MOL 学术 › Res. Math. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Blended coarse gradient descent for full quantization of deep neural networks
Research in the Mathematical Sciences ( IF 1.2 ) Pub Date : 2019-01-02 , DOI: 10.1007/s40687-018-0177-6
Penghang Yin , Shuai Zhang , Jiancheng Lyu , Stanley Osher , Yingyong Qi , Jack Xin

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.

中文翻译：

混合的粗梯度下降可对深度神经网络进行完全量化

量化深度神经网络（QDNN）的吸引力在于，与常规的全精度神经网络相比，它们的内存存储量低得多且推理速度更快。为了保持相同的性能水平，尤其是在低位宽时，必须重新训练QDNN。他们的训练涉及分段恒定激活函数和离散权重；因此，数学上出现了挑战。我们介绍了粗糙梯度的概念，并提出了混合粗糙梯度下降（BCGD）算法，用于训练完全量化的神经网络。粗梯度通常不是任何函数的梯度，而是人为的上升方向。BCGD的权重更新通过对全权重的加权平均值及其量化（所谓的混合）进行粗略梯度校正来完成，这样就可以使目标值充分下降，从而加快训练速度。我们的实验表明，这种简单的混合技术对于以极低的位宽（例如二值化）进行量化非常有效。在对ResNet-18进行ImageNet分类任务的完全量化中，BCGD给出了64.36％的top-1准确性，其所有层的二进制权重和4位自适应激活。如果第一层和最后一层的权重保持完全精确，则此数字将增加到65.46％。作为理论上的证明，我们显示了具有高斯输入数据的两线性层神经网络模型的粗梯度下降的收敛性分析，并证明了预期的粗梯度与基础真实梯度呈正相关。我们的实验表明，这种简单的混合技术对于以极低的位宽（例如二值化）进行量化非常有效。在对ResNet-18进行ImageNet分类任务的完全量化中，BCGD给出了64.36％的top-1准确性，其所有层的二进制权重和4位自适应激活。如果第一层和最后一层的权重保持完全精确，则此数字将增加到65.46％。作为理论上的证明，我们显示了具有高斯输入数据的两线性层神经网络模型的粗梯度下降的收敛性分析，并证明了预期的粗梯度与基础真实梯度呈正相关。我们的实验表明，这种简单的混合技术对于以极低的位宽（例如二值化）进行量化非常有效。在对ResNet-18进行ImageNet分类任务的完全量化中，BCGD给出了64.36％的top-1准确性，其所有层的二进制权重和4位自适应激活。如果第一层和最后一层的权重保持完全精确，则此数字将增加到65.46％。作为理论上的证明，我们显示了具有高斯输入数据的两线性层神经网络模型的粗梯度下降的收敛性分析，并证明了预期的粗梯度与基础真实梯度呈正相关。36％的top-1精度，所有层均具有二进制权重和4位自适应激活。如果第一层和最后一层的权重保持完全精确，则此数字将增加到65.46％。作为理论上的证明，我们显示了具有高斯输入数据的两线性层神经网络模型的粗梯度下降的收敛性分析，并证明了预期的粗梯度与基础真实梯度呈正相关。36％的top-1精度，所有层均具有二进制权重和4位自适应激活。如果第一层和最后一层的权重保持完全精确，则此数字将增加到65.46％。作为理论上的证明，我们显示了具有高斯输入数据的两线性层神经网络模型的粗梯度下降的收敛性分析，并证明了预期的粗梯度与基础真实梯度呈正相关。

更新日期：2019-01-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>