当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EPA: The effective pipeline architecture for CNN accelerator with high performance and computing efficiency based on FPGA
Concurrency and Computation: Practice and Experience ( IF 2 ) Pub Date : 2021-03-31 , DOI: 10.1002/cpe.6198
Junjie Zhang 1 , Qiao Yin 1 , Weicheng Hu 1 , Yunfeng Li 1 , Hu Li 1 , Nan Ye 1 , Bingyao Cao 1
Affiliation  

Thanks to the great developments of the latest Field Programmable Gate Array (FPGA), the performance bottleneck of Deep Learning hardware accelerators has been converted to computing ability. In this paper, a novel FPGA-based Convolutional Neural Network (CNN) Accelerator architecture, named the Effective Pipeline Architecture (EPA) is proposed to optimize the resource usage for the implementation of the CNN calculation. As the unique storage strategies, which contain many creative designing details, are adopted and optimized for different CNN models and layers, great DSP computing efficiency can be achieved in the fine-grained pipeline. Moreover, compared with the traditional architectures, through the kernel combination and data scheduling, twice throughput for the general matrix multiplication is realized in a great many parallel DSP48E resources. As a result, the realization of Yolov2-Tiny achieves 873 Giga Operations Per Second (GOPS) by 902 DSPs with 67 Frames Per Second (FPS), and the computing efficiency in most layers can even reach more than 90%, which improves the calculation performance and efficiency comparing with the previous designs, and is significant to meet the increasing computing requirement.

中文翻译:

EPA:基于FPGA的高性能、高计算效率的CNN加速器有效流水线架构

得益于最新现场可编程门阵列(FPGA)的巨大发展,深度学习硬件加速器的性能瓶颈已经转化为计算能力。本文提出了一种新型的基于 FPGA 的卷积神经网络 (CNN) 加速器架构,称为有效管道架构 (EPA),以优化 CNN 计算实施的资源使用。由于针对不同的CNN模型和层采用并优化了包含许多创造性设计细节的独特存储策略,因此可以在细粒度管道中实现巨大的DSP计算效率。而且,与传统架构相比,通过内核组合和数据调度,在大量并行 DSP48E 资源中实现了一般矩阵乘法两倍的吞吐量。由此,Yolov2-Tiny的实现通过902个DSP,每秒67帧(FPS)达到了873 Giga Operations Per Second(GOPS),并且大多数层的计算效率甚至可以达到90%以上,从而提高了计算能力与以前的设计相比,性能和效率对于满足日益增长的计算需求具有重要意义。
更新日期:2021-03-31
down
wechat
bug