当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Memory-Efficient Dataflow Inference for Deep CNNs on FPGA
arXiv - CS - Hardware Architecture Pub Date : 2020-11-14 , DOI: arxiv-2011.07317
Lucian Petrica, Tobias Alonso, Mairin Kroes, Nicholas Fraser, Sorin Cotofana, Michaela Blott

Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implementation bottleneck. In this work, we propose an accelerator design methodology - Frequency Compensated Memory Packing (FCMP) - which improves the OCM utilization efficiency of dataflow accelerators with minimal reduction in throughput and no modifications to the physical structure of FPGA OCM. To validate our methodology, we apply it to several realizations of medium-sized CIFAR-10 inference accelerators and demonstrate up to 30% reduction in OCM utilization without loss of inference throughput, allowing us to port the accelerators from Xilinx Zynq 7020 to 7012S, reducing application cost. We also implement a custom dataflow FPGA inference accelerator for a quantized ResNet-50 CNN, utilizing on-chip weights, the largest topology ever implemented with this accelerator architecture. We demonstrate that by applying FCMP to the ResNet accelerator, the OCM bottleneck is alleviated which enables the accelerator to be ported from Alveo U250 to the smaller Alveo U280 board with less throughput loss compared to alternative techniques. By providing a finer-grained trade off between throughput and OCM requirements, FCMP increases the flexibility of custom dataflow CNN inference designs on FPGA.

中文翻译:

FPGA 上深度 CNN 的内存高效数据流推理

FPGA 上的自定义数据流卷积神经网络 (CNN) 推理加速器针对特定的 CNN 拓扑进行定制,并将参数存储在片上存储器 (OCM) 中,从而实现高能效和低推理延迟。然而,在这些加速器中,参数存储器的形状由吞吐量限制决定,并且不能很好地映射到底层 OCM,这成为实现瓶颈。在这项工作中,我们提出了一种加速器设计方法——频率补偿内存打包 (FCMP)——它提高了数据流加速器的 OCM 利用效率,同时最大限度地减少了吞吐量,并且不修改 FPGA OCM 的物理结构。为了验证我们的方法,我们将其应用于中型 CIFAR-10 推理加速器的多个实现,并证明在不损失推理吞吐量的情况下,OCM 利用率降低了 30%,使我们能够将加速器从 Xilinx Zynq 7020 移植到 7012S,从而降低应用成本。我们还为量化的 ResNet-50 CNN 实现了自定义数据流 FPGA 推理加速器,利用片上权重,这是使用该加速器架构实现的最大拓扑。我们证明,通过将 FCMP 应用于 ResNet 加速器,OCM 瓶颈得到缓解,这使得加速器能够从 Alveo U250 移植到较小的 Alveo U280 板,与替代技术相比,吞吐量损失更少。通过在吞吐量和 OCM 要求之间提供更细粒度的权衡,
更新日期:2020-11-17
down
wechat
bug