当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors
arXiv - CS - Hardware Architecture Pub Date : 2020-09-03 , DOI: arxiv-2009.01588
Duy Thanh Nguyen, Hyun Kim, and Hyuk-Jae Lee

Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.

中文翻译:

基于 CNN 的目标检测器的 FPGA 设计中混合精度混合数据流的特定层优化

卷积神经网络 (CNN) 需要密集的计算和频繁的内存访问,这导致处理速度低和功耗大。尽管 CNN 中不同层的特性通常大不相同,但以前的硬件设计对其采用了通用的优化方案。本文提出了一种特定于层的设计,该设计采用针对不同层进行优化的不同组织。所提出的设计采用两个特定于层的优化:特定于层的混合数据流和特定于层的混合精度。混合数据流旨在最大限度地减少片外访问,同时要求 FPGA 设备的片上存储器 (BRAM) 资源最少。混合精度量化是为了实现无损精度和积极的模型压缩,从而进一步减少片外访问。贝叶斯优化方法用于为每一层选择最佳稀疏度,实现精度和压缩之间的最佳折衷。这种混合方案允许将整个网络模型存储在 FPGA 的 BRAM 中,从而积极减少片外访问,从而实现显着的性能提升。与全精度网络相比,模型大小减少了 22.66-28.93 倍,在 VOC、COCO 和 ImageNet 数据集上的精度下降可以忽略不计。此外,混合数据流和混合精度的组合在吞吐量、片外访问和片上存储器要求方面明显优于以前的工作。实现精度和压缩之间的最佳权衡。这种混合方案允许将整个网络模型存储在 FPGA 的 BRAM 中,从而积极减少片外访问,从而实现显着的性能提升。与全精度网络相比,模型大小减少了 22.66-28.93 倍,在 VOC、COCO 和 ImageNet 数据集上的精度下降可以忽略不计。此外,混合数据流和混合精度的组合在吞吐量、片外访问和片上存储器要求方面明显优于以前的工作。实现精度和压缩之间的最佳权衡。这种混合方案允许将整个网络模型存储在 FPGA 的 BRAM 中,从而积极减少片外访问,从而实现显着的性能提升。与全精度网络相比,模型大小减少了 22.66-28.93 倍,在 VOC、COCO 和 ImageNet 数据集上的精度下降可以忽略不计。此外,混合数据流和混合精度的组合在吞吐量、片外访问和片上存储器要求方面明显优于以前的工作。从而实现显着的性能提升。与全精度网络相比,模型大小减少了 22.66-28.93 倍,在 VOC、COCO 和 ImageNet 数据集上的精度下降可以忽略不计。此外,混合数据流和混合精度的组合在吞吐量、片外访问和片上存储器要求方面明显优于以前的工作。从而实现显着的性能提升。与全精度网络相比,模型大小减少了 22.66-28.93 倍,在 VOC、COCO 和 ImageNet 数据集上的精度下降可以忽略不计。此外,混合数据流和混合精度的组合在吞吐量、片外访问和片上存储器要求方面明显优于以前的工作。
更新日期:2020-09-04
down
wechat
bug