当前位置: X-MOL 学术IEEE J. Emerg. Sel. Top. Circuits Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing Temporal Convolutional Network inference on FPGA-based accelerators
IEEE Journal on Emerging and Selected Topics in Circuits and Systems ( IF 3.7 ) Pub Date : 2020-09-01 , DOI: 10.1109/jetcas.2020.3014503
Marco Carreras , Gianfranco Deriu , Luigi Raffo , Luca Benini , Paolo Meloni

Convolutional Neural Networks (CNNs) are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition and segmentation. Recent research results demonstrate that multi-layer (deep) network involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modeling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), represent an extremely promising alternative to recurrent architectures, commonly used across a broad range of sequence modeling tasks. While FPGA based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-of-the-art TCNs as a benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and a power efficiency of 33,8 GOPS/s/W on an Ultrascale+ ZU3EG (up to $10\times$ speedup and $3\times$ power efficiency improvement with respect to pure software implementation).

中文翻译:

在基于 FPGA 的加速器上优化时间卷积网络推理

卷积神经网络 (CNN) 被广泛用于广泛的应用中,通常包括计算机视觉任务,如图像和视频分类、识别和分割。最近的研究结果表明,涉及单维卷积和扩张的多层(深)网络可以有效地用于时间序列和序列的分类和分割,以及涉及序列建模的任务。这些结构通常被称为时间卷积网络 (TCN),代表了一种非常有前途的循环架构替代方案,通常用于广泛的序列建模任务。虽然用于经典 CNN 的基于 FPGA 的推理加速器很普遍,但文献缺乏对它们对 TCN 模型推理的可用性的定量评估。在本文中,我们提出了这样的评估,将具有支持 TCN 内核的特定功能的 CNN 加速器作为参考,并以一组最先进的 TCN 作为基准。实验结果表明,在 TCN 执行期间,操作强度对整体性能至关重要。我们提出了一种基于批处理的卷积调度,可以将效率提高到理论峰值性能的 96%。总体而言,我们可以在 Ultrascale+ ZU3EG 上实现高达 111,8 GOPS/s 和 33,8 GOPS/s/W 的功率效率(相对于纯软件,速度提升高达 10 美元/倍,电源效率提高 3 倍执行)。实验结果表明,在 TCN 执行期间,操作强度对整体性能至关重要。我们提出了一种基于批处理的卷积调度,可以将效率提高到理论峰值性能的 96%。总体而言,我们可以在 Ultrascale+ ZU3EG 上实现高达 111.8 GOPS/s 和 33.8 GOPS/s/W 的功率效率(相对于纯软件,速度提升高达 10 美元/秒,电源效率提高 3 美元/秒)执行)。实验结果表明,在 TCN 执行期间,操作强度对整体性能至关重要。我们提出了一种基于批处理的卷积调度,可以将效率提高到理论峰值性能的 96%。总体而言,我们可以在 Ultrascale+ ZU3EG 上实现高达 111,8 GOPS/s 和 33,8 GOPS/s/W 的功率效率(相对于纯软件,速度提升高达 10 美元/倍,电源效率提高 3 倍执行)。
更新日期:2020-09-01
down
wechat
bug