Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks,IEEE Transactions on Very Large Scale Integration (VLSI) Systems

当前位置： X-MOL 学术 › IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2020-04-01 , DOI: 10.1109/tvlsi.2019.2961602
Juan Yepez , Seok-Bum Ko

Convolutional neural networks (CNNs) have been widely adopted for computer vision applications. CNNs require many multiplications, making their use expensive in terms of both computational complexity and hardware. An effective method to mitigate the number of required multiplications is via the Winograd algorithm. Previous implementations of CNNs based on Winograd use the 2-D algorithm

$F(2 \times 2,3 \times 3)$

, which reduces computational complexity by a factor of 2.25 over regular convolution. However, current Winograd implementations only apply when using a stride (shift displacement of a kernel over an input) of 1. In this article, we presented a novel method to apply the Winograd algorithm to a stride of 2. This method is valid for one, two, or three dimensions. We also introduced new Winograd versions compatible with a kernel of size 3, 5, and 7. The algorithms were successfully implemented on an NVIDIA K20c GPU. Compared to regular convolutions, the implementations for stride 2 are 1.44 times faster for a

$3 \times 3$

kernel,

$2.04\times $

faster for a

$5\times 5$

kernel,

$2.42\times $

faster for a

$7 \times 7$

kernel, and

$1.73\times $

faster for a

$3 \times 3 \times 3$

kernel. Additionally, a CNN accelerator using a novel processing element (PE) performs two 2-D Winograd stride 1, or one 2-D Winograd stride 2, and operations per clock cycle was implemented on an Intel Arria-10 field-programmable gate array (FPGA). We accelerated the original and our proposed modified VGG-16 architectures and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/DSPs, respectively.

中文翻译：

用于卷积神经网络的 Stride 2 1-D、2-D 和 3-D Winograd

卷积神经网络 (CNN) 已被广泛用于计算机视觉应用。CNN 需要许多乘法，这使得它们的使用在计算复杂性和硬件方面都很昂贵。减少所需乘法次数的有效方法是通过 Winograd 算法。以前基于 Winograd 的 CNN 实现使用 2-D 算法

$F(2 \times 2,3 \times 3)$

，这比常规卷积降低了 2.25 倍的计算复杂度。然而，当前的 Winograd 实现仅适用于使用 1 的步长（内核在输入上的移位位移）。在本文中，我们提出了一种将 Winograd 算法应用于 2 步长的新方法。该方法适用于一个、两个或三个维度。我们还推出了与大小为 3、5 和 7 的内核兼容的新 Winograd 版本。这些算法已在 NVIDIA K20c GPU 上成功实现。与常规卷积相比，步幅 2 的实现速度提高了 1.44 倍

$3 \times 3$

核心，

$2.04\times $

更快的

$5\乘以5$

核心，

$2.42\times $

更快的

$7 \times 7$

内核，和

$1.73\times $

更快的

$3 \times 3 \times 3$

核心。此外，使用新型处理元件 (PE) 的 CNN 加速器执行两个 2-D Winograd stride 1 或一个 2-D Winograd stride 2，并且每个时钟周期的操作是在 Intel Arria-10 现场可编程门阵列上实现的（ FPGA）。我们加速了原始和我们提出的修改后的 VGG-16 架构，并分别实现了每秒 1.22 千兆操作 (GOPS)/DSP 和 1.33 GOPS/DSP 的数字信号处理器 (DSP) 效率。

更新日期：2020-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11