O3BNN-R: An Out-Of-Order Architecture for High-Performance and Regularized BNN Inference,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

O3BNN-R: An Out-Of-Order Architecture for High-Performance and Regularized BNN Inference
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-01-01 , DOI: 10.1109/tpds.2020.3013637
Tong Geng , Ang Li , Tianqi Wang , Chunshu Wu , Yanfei Li , Runbin Shi , Wei Wu , Martin Herbordt

Binarized Neural Networks (BNN), which significantly reduce computational complexity and memory demand, have shown potential in cost- and power-restricted domains, such as IoT and smart edge-devices, where reaching certain accuracy bars is sufficient and real-time is highly desired. In this article, we demonstrate that the highly-condensed BNN model can be shrunk significantly by dynamically pruning irregular redundant edges. Based on two new observations on BNN-specific properties, an out-of-order (OoO) architecture, O3BNN-R, which can curtail edge evaluation in cases where the binary output of a neuron can be determined early at runtime during inference, is proposed. Similar to instruction level parallelism (ILP), fine-grained, irregular, and runtime pruning opportunities are traditionally presumed to be difficult to exploit. To further enhance the pruning opportunities, we conduct an algorithm/architecture co-design approach where we augment the loss function during the training stage with specialized regularization terms favoring edge pruning. We evaluate our design on an embedded FPGA using networks that include VGG-16, AlexNet for ImageNet, and a VGG-like network for Cifar-10. Results show that O3BNN-R without regularization can prune, on average, 30 percent of the operations, without any accuracy loss, bringing 2.2× inference-speedup, and on average 34× energy-efficiency improvement over state-of-the-art BNN implementations on FPGA/GPU/CPU. With regularization at training, the performance is further improved, on average, by 15 percent.

中文翻译：

O3BNN-R：用于高性能和正则化 BNN 推理的无序架构

二值化神经网络 (BNN) 可显着降低计算复杂性和内存需求，在成本和功率受限的领域（例如物联网和智能边缘设备）中显示出潜力，在这些领域，达到特定的准确度就足够了，并且实时性很高想要的。在本文中，我们证明了通过动态修剪不规则的冗余边缘可以显着缩小高度浓缩的 BNN 模型。基于对 BNN 特定属性的两个新观察，无序 (OoO) 架构 O3BNN-R 可以在推理期间在运行时早期确定神经元的二进制输出的情况下减少边缘评估，是建议的。与指令级并行 (ILP) 类似，传统上认为细粒度、不规则和运行时修剪机会难以利用。为了进一步提高剪枝机会，我们进行了一种算法/架构协同设计方法，在该方法中，我们在训练阶段使用有利于边缘剪枝的专门正则化术语来增加损失函数。我们使用包括 VGG-16、ImageNet 的 AlexNet 和 Cifar-10 的类 VGG 网络在内的网络在嵌入式 FPGA 上评估我们的设计。结果表明，没有正则化的 O3BNN-R 平均可以修剪 30% 的操作，而没有任何精度损失，与最先进的 BNN 相比，推理速度提高了 2.2 倍，能效提高了 34 倍在 FPGA/GPU/CPU 上实现。通过训练中的正则化，性能进一步提高，平均提高 15%。我们进行了一种算法/架构协同设计方法，我们在训练阶段使用有利于边缘修剪的专门正则化术语来增加损失函数。我们使用包括 VGG-16、ImageNet 的 AlexNet 和 Cifar-10 的类 VGG 网络在内的网络在嵌入式 FPGA 上评估我们的设计。结果表明，没有正则化的 O3BNN-R 平均可以修剪 30% 的操作，而没有任何精度损失，与最先进的 BNN 相比，推理速度提高了 2.2 倍，能效提高了 34 倍在 FPGA/GPU/CPU 上实现。通过训练中的正则化，性能进一步提高，平均提高 15%。我们进行了一种算法/架构协同设计方法，我们在训练阶段使用有利于边缘修剪的专门正则化术语来增加损失函数。我们使用包括 VGG-16、ImageNet 的 AlexNet 和 Cifar-10 的类 VGG 网络在内的网络在嵌入式 FPGA 上评估我们的设计。结果表明，没有正则化的 O3BNN-R 平均可以修剪 30% 的操作，而没有任何精度损失，与最先进的 BNN 相比，推理速度提高了 2.2 倍，能效提高了 34 倍在 FPGA/GPU/CPU 上实现。通过训练中的正则化，性能进一步提高，平均提高 15%。和 Cifar-10 的类似 VGG 的网络。结果表明，没有正则化的 O3BNN-R 平均可以修剪 30% 的操作，没有任何精度损失，带来 2.2 倍的推理加速，以及平均 34 倍的能源效率改进，比最先进的 BNN在 FPGA/GPU/CPU 上实现。通过训练中的正则化，性能进一步提高，平均提高 15%。和 Cifar-10 的类似 VGG 的网络。结果表明，没有正则化的 O3BNN-R 平均可以修剪 30% 的操作，没有任何精度损失，带来 2.2 倍的推理加速，以及平均 34 倍的能源效率改进，比最先进的 BNN在 FPGA/GPU/CPU 上实现。通过训练中的正则化，性能进一步提高，平均提高 15%。

更新日期：2021-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11