当前位置: X-MOL 学术IEEE Trans. Circuits Syst. I Regul. Pap. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration
IEEE Transactions on Circuits and Systems I: Regular Papers ( IF 5.1 ) Pub Date : 2021-05-17 , DOI: 10.1109/tcsi.2021.3078541
Baoting Li , Hang Wang , Xuchong Zhang , Jie Ren , Longjun Liu , Hongbin Sun , Nanning Zheng

Depthwise separable convolution (DSC) has become one of the essential structures for lightweight convolutional neural networks. Nevertheless, its hardware architecture has not received much attention. Several previous hardware designs incur either high off-chip memory traffic or large on-chip memory usage, and hence have deficiency in terms of hardware efficiency as well as performance. This paper proposes two efficient dynamic design techniques, i.e. adaptive row-based dataflow scheduling and adaptive computation mapping, to achieve a much better trade-off between hardware efficiency and performance for DSC-based lightweight CNN accelerator. The effectiveness and efficiency of the proposed dynamic design techniques have been extensively evaluated using six DSC-based lightweight CNNs. Compared with the reference architectures, the simulation results show the proposed architectural techniques can at least reduce on-chip buffer size by 50.4% and improve the performance of convolution calculation by $1.18\times $ while maintaining the minimum off-chip memory traffic. MobileNetV2 is implemented on Zynq UltraScale+ ZCU102 SoC FPGA, and the results show the proposed accelerator can achieve 381.7 frames per second (fps), which is $1.43\times $ of the reference design, and it can save about 36.3% on-chip buffer size compared with the reference design, while maintaining the same off-chip memory traffic.

中文翻译:

用于高效深度可分离卷积加速的动态数据流调度和计算映射技术

深度可分离卷积(DSC)已成为轻量级卷积神经网络的基本结构之一。尽管如此,其硬件架构并未受到太多关注。之前的几种硬件设计要么导致片外内存流量高,要么片上内存使用量大,因此在硬件效率和性能方面都存在不足。本文提出了两种有效的动态设计技术,即基于行的自适应数据流调度和自适应计算映射,以实现基于 DSC 的轻量级 CNN 加速器的硬件效率和性能之间更好的权衡。所提出的动态设计技术的有效性和效率已使用六个基于 DSC 的轻量级 CNN 进行了广泛评估。与参考架构相比, $1.18\times $ 同时保持最小的片外内存流量。MobileNetV2 在 Zynq UltraScale+ ZCU102 SoC FPGA 上实现,结果表明所提出的加速器可以达到 381.7 帧每秒 (fps),即 $1.43\times $ 与参考设计相比,它可以节省约 36.3% 的片上缓冲区大小,同时保持相同的片外存储器流量。
更新日期:2021-07-13
down
wechat
bug