Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

当前位置： X-MOL 学术 › IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ( IF 2.7 ) Pub Date : 2020-04-01 , DOI: 10.1109/tcad.2019.2897701
Yun Liang , Liqiang Lu , Qingcheng Xiao , Shengen Yan

In recent years, convolutional neural networks (CNNs) have become widely adopted for computer vision tasks. Field-programmable gate arrays (FPGAs) have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often bounded by the computational capability of FPGAs (e.g., the number of DSPs). To address this problem, the feature maps are transformed to a special domain using fast algorithms to reduce the arithmetic complexity. Winograd and fast Fourier transformation (FFT), as fast algorithm representatives, first transform input data and filter to Winograd or frequency domain, then perform element-wise multiplication, and apply inverse transformation to get the final output. In this paper, we propose a novel architecture for implementing fast algorithms on FPGAs. Our design employs line buffer structure to effectively reuse the feature map data among different tiles. We also effectively pipeline the Winograd/FFT processing element (PE) engine and initiate multiple PEs through parallelization. Meanwhile, there exists a complex design space to explore. We propose an analytical model to predict the resource usage and the performance. Then, we use the model to guide a fast design space exploration. Experiments using the state-of-the-art CNNs demonstrate the best performance and energy efficiency on FPGAs. We achieve 854.6 and 2479.6 GOP/s for AlexNet and VGG16 on Xilinx ZCU102 platform using Winograd. We achieve 130.4 GOP/s for Resnet using Winograd and 201.1 GOP/s for YOLO using FFT on Xilinx ZC706 platform.

中文翻译：

在 FPGA 上评估卷积神经网络的快速算法

近年来，卷积神经网络 (CNN) 已被广泛用于计算机视觉任务。由于其高性能、高能效和可重构性，现场可编程门阵列 (FPGA) 已被充分探索为有前途的 CNN 硬件加速器。然而，基于传统卷积算法的现有FPGA解决方案通常受到FPGA的计算能力（例如，DSP的数量）的限制。为了解决这个问题，使用快速算法将特征图转换到一个特殊的域，以降低算术复杂度。Winograd 和快速傅立叶变换 (FFT) 作为快速算法的代表，首先将输入数据和滤波器变换到 Winograd 或频域，然后进行逐元素乘法，并应用逆变换得到最终输出。在本文中，我们提出了一种在 FPGA 上实现快速算法的新架构。我们的设计采用行缓冲结构来有效地在不同瓦片之间重用特征图数据。我们还有效地流水线化 Winograd/FFT 处理元素 (PE) 引擎并通过并行化启动多个 PE。同时，还有一个复杂的设计空间可供探索。我们提出了一个分析模型来预测资源使用和性能。然后，我们使用该模型来指导快速设计空间探索。使用最先进的 CNN 的实验证明了 FPGA 的最佳性能和能效。我们使用 Winograd 在 Xilinx ZCU102 平台上为 AlexNet 和 VGG16 实现了 854.6 和 2479.6 GOP/s。我们在赛灵思 ZC706 平台上使用 Winograd 实现了 Resnet 的 130.4 GOP/s 和使用 FFT 的 YOLO 的 201.1 GOP/s。我们提出了一种新颖的架构，用于在 FPGA 上实现快速算法。我们的设计采用行缓冲结构来有效地在不同瓦片之间重用特征图数据。我们还有效地流水线化 Winograd/FFT 处理元素 (PE) 引擎并通过并行化启动多个 PE。同时，还有一个复杂的设计空间可供探索。我们提出了一个分析模型来预测资源使用和性能。然后，我们使用该模型来指导快速设计空间探索。使用最先进的 CNN 的实验证明了 FPGA 的最佳性能和能效。我们使用 Winograd 在 Xilinx ZCU102 平台上为 AlexNet 和 VGG16 实现了 854.6 和 2479.6 GOP/s。我们在赛灵思 ZC706 平台上使用 Winograd 实现了 Resnet 的 130.4 GOP/s 和使用 FFT 的 YOLO 的 201.1 GOP/s。我们提出了一种新颖的架构，用于在 FPGA 上实现快速算法。我们的设计采用行缓冲结构来有效地在不同瓦片之间重用特征图数据。我们还有效地流水线化 Winograd/FFT 处理元素 (PE) 引擎并通过并行化启动多个 PE。同时，还有一个复杂的设计空间可供探索。我们提出了一个分析模型来预测资源使用和性能。然后，我们使用该模型来指导快速设计空间探索。使用最先进的 CNN 的实验证明了 FPGA 的最佳性能和能效。我们使用 Winograd 在 Xilinx ZCU102 平台上为 AlexNet 和 VGG16 实现了 854.6 和 2479.6 GOP/s。我们在赛灵思 ZC706 平台上使用 Winograd 实现了 Resnet 的 130.4 GOP/s 和使用 FFT 的 YOLO 的 201.1 GOP/s。

更新日期：2020-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11