Uni-OPU: An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks,IEEE Transactions on Very Large Scale Integration (VLSI) Systems

当前位置： X-MOL 学术 › IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Uni-OPU: An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2020-07-01 , DOI: 10.1109/tvlsi.2020.2995741
Yunxuan Yu , Tiandong Zhao , Mingyu Wang , Kun Wang , Lei He

In this article, we design the first full software/ hardware stack, called Uni-OPU, for an efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks and conventional convolutional (CONV) networks. Specifically, a software compiler is provided to transform the computation of various TCONV, i.e., zero-inserting-based TCONV (zero-TCONV), nearest-neighbor resizing-based TCONV (NN-TCONV), and CONV layers into the same pattern. The compiler conducts the following optimizations: 1) eliminating up to 98.4% of operations in TCONV by making use of the fixed pattern of TCONV upsampling; 2) decomposing and reformulating TCONV and CONV into streaming parallel vector multiplication with a uniform address generation scheme and data flow pattern; and 3) efficient scheduling and instruction compilation to map networks onto a hardware processor. An instruction-based hardware acceleration processor is developed to efficiently speedup our uniform computation pattern with throughput up to 2.35 TOPS for the TCONV layer, consuming only 2.89 W dynamic power. We evaluate Uni-OPU on a benchmark set composed of six TCONV networks from different application fields. Extensive experimental results indicate that Uni-OPU is able to gain

$1.45 \times $

$3.68 \times $

superior power efficiency compared with state-of-the-art zero-TCONV accelerators. High acceleration performance is also achieved on NN-TCONV networks, the acceleration of which have not been explored before. In summary, we observe

$1.90 \times $

and

$1.63 \times $

latency reduction, as well as

$15.04 \times $

and

$12.43 \times $

higher power efficiency on zero-TCONV and NN-TCONV networks compared with Titan Xp GPU on average. To the best of our knowledge, ours is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TCONV, and CONV layers.

中文翻译：

Uni-OPU：用于卷积和转置卷积网络的基于 FPGA 的统一加速器

在本文中，我们设计了第一个完整的软件/硬件堆栈，称为统一OPU，用于不同类型的转置卷积 (TCONV) 网络和传统卷积 (CONV) 网络的高效统一硬件加速。具体而言，提供了软件编译器，将各种TCONV的计算，即基于零插入的TCONV（zero-TCONV）、基于最近邻大小调整的TCONV（NN-TCONV）、以及CONV层的计算转化为相同的模式。编译器进行了如下优化： 1）利用TCONV上采样的固定模式，消除了TCONV中高达98.4%的操作；2) 将 TCONV 和 CONV 分解并重组为具有统一地址生成方案和数据流模式的流式并行向量乘法；3) 高效的调度和指令编译以将网络映射到硬件处理器上。开发了一种基于指令的硬件加速处理器，以有效地加速我们的统一计算模式，TCONV 层的吞吐量高达 2.35 TOPS，仅消耗 2.89 W 动态功率。我们评估统一OPU在由来自不同应用领域的六个 TCONV 网络组成的基准集上。大量的实验结果表明，统一OPU 能够获得

$1.45 \times $

至

$3.68 \times $

与最先进的零 TCONV 加速器相比，具有更高的能效。在 NN-TCONV 网络上也实现了很高的加速性能，其加速性能以前没有被探索过。总之，我们观察到

$1.90 \times $

和

$1.63 \times $

延迟减少，以及

$15.04 \times $

和

$12.43 \times $

与 Titan Xp GPU 相比，零 TCONV 和 NN-TCONV 网络的平均能效更高。据我们所知，我们是第一个完全统一零 TCONV、NN-TCONV 和 CONV 层的计算过程的深入研究。