当前位置: X-MOL 学术IEEE Embed. Syst. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bactran: A Hardware Batch Normalization Implementation for CNN Training Engine
IEEE Embedded Systems Letters ( IF 1.6 ) Pub Date : 2020-02-19 , DOI: 10.1109/les.2020.2975055
Yang Zhijie , Wang Lei , Luo Li , Li Shiming , Guo Shasha , Wang Shuquan

In recent years, convolutional neural networks (CNNs) have been widely used. However, their ever-increasing amount of parameters makes it challenging to train them with the GPUs, which is time and energy expensive. This has prompted researchers to turn their attention to training on more energy-efficient hardware. batch normalization (BN) layer has been widely used in various state-of-the-art CNNs for it is an indispensable layer in the acceleration of CNN training. As the amount of computation of the convolutional layer declines, its importance continues to increase. However, the traditional CNN training accelerators do not pay attention to the efficient hardware implementation of the BN layer. In this letter, we design an efficient CNN training architecture by using the systolic array. The processing element of the systolic array can support the BN functions both in the training process and the inference process. The BN function implemented is an improved, hardware-friendly BN algorithm, range batch normalization (RBN). The experimental results show that the implementation of RBN saves 10% hardware resources, reduces the power by 10.1%, and the delay by 4.6% on average. We implement the accelerator on the field programmable gate array VU440, and the power consumption of the its core computing engine is 8.9 W.

中文翻译:

Bactran:用于CNN培训引擎的硬件批量标准化实现

近年来,卷积神经网络(CNN)已被广泛使用。但是,它们不断增加的参数数量使使用GPU对其进行训练具有挑战性,这既耗时又耗能。这促使研究人员将注意力转移到更节能的硬件上进行培训。批处理规范化(BN)层已广泛用于各种最新的CNN中,因为它是加速CNN训练的必不可少的层。随着卷积层的计算量下降,其重要性继续增加。但是,传统的CNN训练加速器并不关注BN层的有效硬件实现。在这封信中,我们通过使用脉动阵列设计了一种有效的CNN训练架构。脉动阵列的处理元件在训练过程和推理过程中都可以支持BN功能。实现的BN功能是一种改进的,对硬件友好的BN算法,即范围批归一化(RBN)。实验结果表明,RBN的实现平均节省了10%的硬件资源,功耗降低了10.1%,延迟平均降低了4.6%。我们在现场可编程门阵列VU440上实现了加速器,其核心计算引擎的功耗为8​​.9W。平均6%。我们在现场可编程门阵列VU440上实现了加速器,其核心计算引擎的功耗为8​​.9W。平均6%。我们在现场可编程门阵列VU440上实现了加速器,其核心计算引擎的功耗为8​​.9W。
更新日期:2020-02-19
down
wechat
bug