当前位置: X-MOL 学术IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Utilization, High-Flexibility Depth-First CNN Coprocessor for Image Pixel Processing on FPGA
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2021-01-14 , DOI: 10.1109/tvlsi.2020.3046125
Steven Colleman , Marian Verhelst

Recently, CNNs are increasingly exploited for pixel processing tasks, such as denoising, which opens up new challenges due to the increased activation and operation count. This article presents a CNN coprocessor architecture to solve these challenges on field-programmable gate array (FPGA) through four main contributions. First, the I/O communication between the host processor and the FPGA is reduced to a minimum using a depth-first (DF) principle. Three new DF approaches are presented. Second, to ensure high throughput, the increased parallelization opportunities of the proposed line-based DF operation are analyzed. Third, introducing programmability to the compute array is introduced to enable a broad deployment while maintaining high utilization of the available multipliers digital signal processings (DSPs), independently of the kernel dimensions and without control of the host processor. This is in contrast with many state-of-the-art FPGA implementations, focusing on only one algorithm and/or one kernel topology. Fourth, a model is built to investigate the influence of architecture parameters and show the benefits of DF. The scalable design can be deployed on a wide range of FPGAs, maintaining 78%–93% DSP utilization across all algorithms (denoising, optical flow, depth estimation, segmentation, and super-resolution) and FPGA platforms. Up to 695 GOPS is achieved on a Zynq XCZU9EG board, matching state-of-the-art performance with a more flexible design. The throughput is compared with other pixel processing architectures on FPGA.

中文翻译:

高利用率,高灵活性,深度优先的CNN协处理器,用于FPGA上的图像像素处理

近年来,CNN被越来越多地用于像素处理任务(例如降噪),由于激活和操作次数的增加,这带来了新的挑战。本文提出了一种CNN协处理器架构,以通过四个主要方面来解决现场可编程门阵列(FPGA)上的这些挑战。首先,使用深度优先(DF)原理将主机处理器和FPGA之间的I / O通信减少到最低限度。提出了三种新的DF方法。其次,为了确保高吞吐量,分析了所提出的基于行的DF操作增加的并行化机会。第三,引入了对计算阵列的可编程性,以实现广泛的部署,同时保持对可用乘法器数字信号处理(DSP)的高利用率,与内核尺寸无关,并且不受主机处理器的控制。这与许多最新的FPGA实现相反,后者仅关注一种算法和/或一种内核拓扑。第四,建立一个模型来研究架构参数的影响并显示DF的好处。可扩展的设计可以部署在广泛的FPGA上,在所有算法(降噪,光流,深度估计,分段和超分辨率)和FPGA平台上,DSP利用率保持在78%至93%之间。Zynq XCZU9EG板上可达到695 GOPS,将最先进的性能与更灵活的设计相匹配。将吞吐量与FPGA上的其他像素处理架构进行了比较。仅关注一种算法和/或一种内核拓扑。第四,建立一个模型来研究架构参数的影响并显示DF的好处。可扩展的设计可以部署在广泛的FPGA上,在所有算法(降噪,光流,深度估计,分段和超分辨率)和FPGA平台上,DSP利用率保持在78%至93%之间。Zynq XCZU9EG板上可达到695 GOPS,将最先进的性能与更灵活的设计相匹配。将吞吐量与FPGA上的其他像素处理架构进行了比较。仅关注一种算法和/或一种内核拓扑。第四,建立一个模型来研究架构参数的影响并显示DF的好处。可扩展的设计可以部署在广泛的FPGA上,在所有算法(降噪,光流,深度估计,分段和超分辨率)和FPGA平台上,DSP利用率保持在78%至93%之间。Zynq XCZU9EG板上可达到695 GOPS,将最先进的性能与更灵活的设计相匹配。将吞吐量与FPGA上的其他像素处理架构进行了比较。在所有算法(降噪,光流,深度估计,分段和超分辨率)和FPGA平台上,DSP利用率保持78%–93%。Zynq XCZU9EG板上可达到695 GOPS,将最先进的性能与更灵活的设计相匹配。将吞吐量与FPGA上的其他像素处理架构进行了比较。在所有算法(降噪,光流,深度估计,分段和超分辨率)和FPGA平台上,DSP利用率保持78%–93%。Zynq XCZU9EG板上可达到695 GOPS,将最先进的性能与更灵活的设计相匹配。将吞吐量与FPGA上的其他像素处理架构进行了比较。
更新日期:2021-02-26
down
wechat
bug