当前位置: X-MOL 学术J. Sign. Process. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling
Journal of Signal Processing Systems ( IF 1.8 ) Pub Date : 2021-02-13 , DOI: 10.1007/s11265-021-01642-6
Masayuki Shimoda , Youki Sada , Hiroki Nakahara

Convolutional neural networks (CNNs) exhibit state-of-the-art performance while performing computer-vision tasks. CNNs require high-speed, low-power, and high-accuracy hardware for various scenarios, such as edge environments. However, the number of weights is so large that embedded systems cannot store them owing to their limited on-chip memory. A different method is used to minimize the input image size, for real-time processing, but it causes a considerable drop in accuracy. Although pruned sparse CNNs and special accelerators are proposed, the requirement of random access incurs a large number of wide multiplexers for a high degree of parallelism, which becomes more complicated and unsuitable for FPGA implementation. To address this problem, we propose filter-wise pruning with distillation and block RAM (BRAM)-based zero-weight skipping accelerator. It eliminates weights such that each filter has the same number of nonzero weights, performing retraining with distillation, while retaining comparable accuracy. Further, filter-wise pruning enables our accelerator to exploit inter-filter parallelism, where a processing block for a layer executes filters concurrently, with a straightforward architecture. We also propose an overlapped tiling algorithm, where tiles are extracted with overlap to prevent both accuracy degradation and high utilization of BRAMs storing high-resolution images. Our evaluation using semantic-segmentation tasks showed a 1.8 times speedup and 18.0 times increase in power efficiency of our FPGA design compared with a desktop GPU. Additionally, compared with the conventional FPGA implementation, the speedup and accuracy improvement were 1.09 times and 6.6 points, respectively. Therefore, our approach is useful for FPGA implementation and exhibits considerable accuracy for applications in embedded systems.



中文翻译:

基于FPGA的层间流水线加速器,用于具有重叠平铺的滤波明智平衡权稀疏全卷积网络

卷积神经网络(CNN)在执行计算机视觉任务时表现出最先进的性能。CNN需要用于各种场景(例如边缘环境)的高速,低功耗和高精度硬件。然而,权重的数量如此之大,以致嵌入式系统由于其有限的片上存储器而无法存储它们。为了进行实时处理,使用了一种不同的方法来最小化输入图像的大小,但是这会导致准确性的大幅下降。尽管提出了修剪稀疏的CNN和特殊的加速器,但是对于高度并行性,随机访问的需求招致了大量的宽复用器,这变得更加复杂并且不适合FPGA实现。为了解决这个问题,我们建议通过蒸馏进行过滤器修剪以及基于Block RAM(BRAM)的零权重跳过加速器。它消除了权重,从而使每个过滤器具有相同数量的非零权重,通过蒸馏进行再训练,同时保持相当的精度。此外,基于过滤器的修剪使我们的加速器能够利用过滤器间的并行性,其中层的处理块以简单的架构同时执行过滤器。我们还提出了一种重叠的切片算法,其中会重叠提取图块,以防止准确性下降和存储高分辨率图像的BRAM的高利用率。我们使用语义分段任务的评估显示,与台式机GPU相比,我们的FPGA设计的速度提高了1.8倍,功耗效率提高了18.0倍。此外,与传统的FPGA实现相比,其提速和准确度分别提高了1.09倍和6.6点。因此,我们的方法对于FPGA实施很有用,并且对于嵌入式系统中的应用具有相当大的准确性。

更新日期:2021-02-15
down
wechat
bug