DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs,IEEE Transactions on Computers

当前位置： X-MOL 学术 › IEEE Trans. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
IEEE Transactions on Computers ( IF 3.7 ) Pub Date : 2021-03-18 , DOI: 10.1109/tc.2021.3066883
Alessio Burrello , Angelo Garofalo , Nazareno Bruschi , Giuseppe Tagliavini , Davide Rossi , Francesco Conti

The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency – requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY ( Deployment Oriented to memoRY ) – an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5× better MAC/cycle than the GreenWaves proprietary software solution and 18.1× better than the state-of-the-art result on an STM32-H743 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps – 15.4× better than an STM32-H743. We release all our developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.

中文翻译：

DORY：在低成本物联网 MCU 上自动端到端部署真实世界的 DNN

在物联网极端边缘的终端节点上部署深度神经网络 (DNN) 是支持普遍的深度学习增强型应用程序的关键推动因素。基于低成本 MCU 的终端节点具有有限的片上内存，并且经常用暂存器替换缓存，以减少区域开销并提高能源效率——需要在不同级别的内存层次结构之间进行基于 DMA 的显式内存传输。在这些系统上映射现代 DNN 需要积极的拓扑相关平铺和双缓冲。在这项工作中，我们提出多莉 ( 面向内存的部署 ) – 在低成本 MCU 上部署 DNN 的自动工具，通常小于 1MB 的片上 SRAM 内存。DORY 将平铺抽象为约束编程 (CP) 问题：它在每个 DNN 层施加的拓扑约束下最大化 L1 内存利用率。然后，它生成 ANSI C 代码来协调片外和片上传输和计算阶段。此外，为了最大限度地提高速度，DORY 通过启发式方法增强了 CP 公式，以提高性能有效的瓷砖尺寸。作为 DORY 的案例研究，我们的目标是 GreenWaves Technologies GAP8，这是市场上最先进的并行超低功耗 MCU 级设备之一。在此器件上，DORY 实现的 MAC/周期比 GreenWaves 专有软件解决方案高 2.5 倍，比单层 STM32-H743 MCU 上的最新结果高 18.1 倍。使用我们的工具，1.0-MobileNet-128网络平均仅消耗 63 pJ/MAC @ 4.3 fps – 比 STM32-H743 好 15.4 倍。我们将所有开发成果——DORY 框架、优化的后端内核和相关的启发式——作为开源软件发布。

更新日期：2021-03-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>