Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training,Integration

当前位置： X-MOL 学术 › Integration › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training
Integration ( IF 1.9 ) Pub Date : 2020-05-14 , DOI: 10.1016/j.vlsi.2020.05.002
TaiYu Cheng , Yukata Masuda , Jun Chen , Jaehoon Yu , Masanori Hashimoto

Recently, emerging “edge computing” moves data and services from the cloud to nearby edge servers to achieve short latency and wide bandwidth, and solve privacy concerns. However, edge servers, often embedded with GPU processors, highly demand a solution for power-efficient neural network (NN) training due to the limitation of power and size. Besides, according to the nature of the broad dynamic range of gradient values computed in NN training, floating-point representation is more suitable. This paper proposes to adopt a logarithm-approximate multiplier (LAM) for multiply-accumulate (MAC) computation in neural network (NN) training engines, where LAM approximates a floating-point multiplication as a fixed-point addition, resulting in smaller delay, fewer gates, and lower power consumption. We demonstrate the efficiency of LAM in two platforms, which are dedicated NN training hardware, and open-source GPU design. Compared to the NN training applying the exact multiplier, our implementation of the NN training engine for a 2-D classification dataset achieves 10% speed-up and 2.3X efficiency improvement in power and area, respectively. LAM is also highly compatible with conventional bit-width scaling (BWS). When BWS is applied with LAM in five test datasets, the implemented training engines achieve more than 4.9X power efficiency improvement, with at most 1% accuracy degradation, where 2.2X improvement originates from LAM. Also, the advantage of LAM can be exploited in processors. A GPU design embedded with LAM executing an NN-training workload, which is implemented in an FPGA, presents 1.32X power efficiency improvement, and the improvement reaches 1.54X with LAM + BWS. Finally, LAM-based training in deeper NN is evaluated. Up to 4-hidden layer NN, LAM-based training achieves highly comparable accuracy as that of the accurate multiplier, even with aggressive BWS.

中文翻译：

对数近似浮点乘法器适用于节能神经网络训练

最近，新兴的“边缘计算”将数据和服务从云移动到附近的边缘服务器，以实现短延迟和宽带宽，并解决隐私问题。但是，由于功耗和大小的限制，通常嵌入有GPU处理器的边缘服务器对电源有效的神经网络（NN）培训提出了很高的要求。此外，根据在NN训练中计算的梯度值的宽动态范围的性质，浮点表示更合适。本文提议采用对数近似乘数（LAM）进行神经网络（NN）训练引擎中的乘积（MAC）计算，其中LAM将浮点乘积近似为定点加法，从而导致较小的延迟，更少的门，更低的功耗。我们展示了LAM在两个平台上的效率，它们是专用的NN培训硬件和开源GPU设计。与使用精确乘数的NN训练相比，我们针对二维分类数据集的NN训练引擎的实现分别在功率和面积上实现了10％的提速和2.3倍的效率提升。LAM还与常规位宽缩放（BWS）高度兼容。当BWS与LAM应用于五个测试数据集中时，已实施的训练引擎可将功率效率提高4.9倍以上，而精度最多降低1％，其中2.2倍的改进源自LAM。同样，可以在处理器中利用LAM的优势。在FPGA中实现的，嵌入有LAM并执行NN训练工作量的GPU设计实现了1.32倍的电源效率改进，而使用LAM + BWS的改进达到1.54倍。最后，评估了基于LAM的深度NN中的训练。甚至在具有攻击性的BWS的情况下，基于LAM的训练最多可以达到4隐藏层NN，其精度与精确乘数相当。

更新日期：2020-05-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>