当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data
arXiv - CS - Hardware Architecture Pub Date : 2021-06-15 , DOI: arxiv-2106.08167
Duy Thanh Nguyen, Hyeonseung Je, Tuan Nghia Nguyen, Soojung Ryu, Kyujung Lee, Hyuk-Jae Lee

Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient. Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.

中文翻译:

ShortcutFusion:从 Tensorflow 到基于 FPGA 的加速器,具有用于快捷数据的重用感知内存分配

残差块是最近最先进的 CNN 中非常常见的组件,例如 EfficientNet 或 EfficientDet。在 ResNet152 [8] 中,快捷方式数据占特征图访问的近 40%。之前的大部分 DNN 编译器、加速器都忽略了快捷数据优化。本文介绍了 ShortcutFusion,这是一种基于 FPGA 加速器的优化工具,具有用于快捷数据的重用感知静态内存分配,可在给定资源限制的情况下最大化片上数据重用。根据 TensorFlow DNN 模型,所提出的设计为一组节点生成指令集,这些节点对每个残差块使用优化的数据重用。在 Xilinx KCU1500 FPGA 卡上实施的加速器设计在 EfficientNet 推理方面明显优于 NVIDIA RTX 2080 Ti、Titan Xp 和 GTX 1080 Ti。与RTX 2080 Ti相比,建议的设计速度提高 1.35-2.33 倍,能效提高 6.7-7.9 倍。与基线的结果相比,权重、输入和输出每层只从片外存储器访问一次,ShortcutFusion 将 RetinaNet、Yolov3、ResNet152 和 EfficientNet 的 DRAM 访问减少了 47.8-84.8%。给定与 ShortcutMining [8] 相似的缓冲区大小,它也在硬件中挖掘快捷方式数据,所提出的工作将特征映射的片外访问减少了 5.27 倍,同时从片外存储器访问权重仅一次。
更新日期:2021-06-16
down
wechat
bug