Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge
arXiv - CS - Hardware Architecture Pub Date : 2021-08-25 , DOI: arxiv-2108.11441
Vinod Ganesan, Pratyush Kumar

Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. We formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called FuSeConv, is a drop-in replacement for depthwise separable convolutions. FuSeConv factorizes convolution fully along their spatial and depth dimensions. The resultant computation efficiently maps to systolic arrays. The optimal dataflow, called Spatial-Tiled Output Stationary (ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the array to maximize resource utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv by distilling knowledge from the expensive depthwise separable convolutions. This bridges the accuracy gap between FuSeConv networks and baselines. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade-off latency and accuracy. The HW/SW co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1-9.25X with state-of-the-art efficient networks for ImageNet. The parameter efficiency of FuSeConv and its significant out-performance over depthwise separable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency on systolic arrays.

中文翻译：

用于边缘计算机视觉的高效 DNN 算子的设计和脚手架训练

大规模并行收缩阵列和资源高效的深度可分离卷积是加速边缘 DNN 推理的两种有前途的技术。有趣的是，它们的组合效率低下：深度可分离卷积的计算模式没有表现出有节奏的收缩流，并且缺乏足够的数据重用来使收缩阵列饱和。我们正式分析了这种低效率，并提出了一个高效的操作符、一个最佳的硬件数据流和一个卓越的训练方法来缓解这个问题。称为 FuSeConv 的高效算子是深度可分离卷积的直接替代品。FuSeConv 完全沿空间和深度维度分解卷积。结果计算有效地映射到脉动阵列。最优数据流，称为空间平铺输出平稳（ST-OS），最大化 FuSeConv 在收缩阵列上的效率。它将独立卷积映射到阵列的行，以最大限度地提高资源利用率，而 VLSI 开销可忽略不计。Neural Operator Scaffolding (NOS) 通过从昂贵的深度可分离卷积中提取知识来构建 FuSeConv 的训练。这弥合了 FuSeConv 网络和基线之间的精度差距。此外，NOS 可以与神经架构搜索 (NAS) 结合以权衡延迟和准确性。FuSeConv 与 ST-OS 的硬件/软件协同设计通过 ImageNet 的最先进高效网络实现了 4.1-9.25 倍的显着加速。FuSeConv 的参数效率及其在收缩阵列上的深度可分离卷积方面的显着优于性能说明了它们作为边缘强大解决方案的前景。使用 NOS 训练 FuSeConv 网络可实现与基线相当的准确度。此外，通过将 NOS 与 NAS 相结合，我们设计了定义最先进模型的网络，以提高收缩阵列的准确性和延迟。

更新日期：2021-08-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文