当前位置: X-MOL 学术J. Real-Time Image Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automated CNN back-propagation pipeline generation for FPGA online training
Journal of Real-Time Image Processing ( IF 3 ) Pub Date : 2021-07-23 , DOI: 10.1007/s11554-021-01147-2
A. Mazouz 1 , C. P. Bridges 1
Affiliation  

Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning has become essential for the future deployment of CNNs on autonomous systems. In this work, we present an automated CNN training pipeline compilation tool for Xilinx FPGAs. We automatically generate multiple hardware designs from high-level CNN descriptions using a multi-objective optimization algorithm that explores the design space by exploiting CNN parallelism. These designs that trade-off resources for throughput allow users to tailor implementations to their hardware and applications. The training pipeline is generated based on the backpropagation (BP) equations of convolution which highlight an overlap in computation. We translate the overlap into hardware by reusing most of the forward pass (FP) pipeline reducing the resources overhead. The implementation uses a streaming interface that lends itself well to data streams and live feeds instead of static data reads from memory. Meaning, we do not use the standard array of processing elements (PEs) approach, which is efficient for offline inference, instead we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available. We validate the results using the Zynq-7100 on three datasets and varying size architectures against CPU and GPU implementations. GPUs consistently outperform FPGAs in training times in batch processing scenarios, but in data stream scenarios, FPGA designs achieve a significant speedup compared to GPU and CPU when enough resources are dedicated to the learning task. A 2.8×, 5.8×, and 3× speed up over GPU was achieved on three architectures trained on MNIST, SVHN, and CIFAR-10 respectively.



中文翻译:

用于 FPGA 在线训练的自动 CNN 反向传播管道生成

在嵌入式平台上训练卷积神经网络 (CNN) 以支持设备上学习已成为未来在自主系统上部署 CNN 的必要条件。在这项工作中,我们提出了一种适用于 Xilinx FPGA 的自动化 CNN 训练流水线编译工具。我们使用多目标优化算法从高级 CNN 描述中自动生成多个硬件设计,该算法通过利用 CNN 并行性来探索设计空间。这些为吞吐量权衡资源的设计允许用户根据他们的硬件和应用程序定制实现。训练管道是基于卷积的反向传播 (BP) 方程生成的,该方程突出了计算中的重叠。我们通过重用大部分前向传递 (FP) 管道来减少资源开销,从而将重叠转换为硬件。该实现使用了一个流接口,它非常适合数据流和实时馈送,而不是从内存中读取静态数据。这意味着,我们不使用标准的处理元素阵列 (PE) 方法,这对于离线推理非常有效,而是将架构转换为数据流的管道,允许在新样本可用时读取它们。我们使用 Zynq-7100 在三个数据集和不同大小的架构上针对 CPU 和 GPU 实现验证结果。在批处理场景中,GPU 在训练时间方面始终优于 FPGA,但在数据流场景中,当有足够的资源专用于学习任务时,FPGA 设计与 GPU 和 CPU 相比实现了显着的加速。A 2.8×、5.8×、

更新日期:2021-07-23
down
wechat
bug