Memory optimization at Edge for Distributed Convolution Neural Network,Transactions on Emerging Telecommunications Technologies

当前位置： X-MOL 学术 › Trans. Emerg. Telecommun. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Memory optimization at Edge for Distributed Convolution Neural Network
Transactions on Emerging Telecommunications Technologies ( IF 3.6 ) Pub Date : 2022-09-15 , DOI: 10.1002/ett.4648
Soumyalatha Naveen ₁ , Manjunath R. Kounte ₂

Affiliation

Internet of Things (IoT) edge intelligence has emerged by optimizing the deep learning (DL) models deployed on resource-constraint devices for quick decision-making. In addition, edge intelligence reduces network overload and latency by bringing intelligent analytics closer to the source. On the other hand, DL models need a lot of computing resources. As a result, they have high computational workloads and memory footprint, making it impractical to deploy and execute on IoT edge devices with limited capabilities. In addition, existing layer-based partitioning methods generate many intermediate results, resulting in a huge memory footprint. In this article, we propose a framework to provide a comprehensive solution that enables the deployment of convolutional neural networks (CNNs) onto distributed IoT devices for faster inference and reduced memory footprint. This framework considers a pretrained YOLOv2 model, and a weight pruning technique is applied to the pre-trained model to reduce the number of non-contributing parameters. We use the fused layer partitioning method to vertically partition the fused layers of the CNN and then distribute the partition among the edge devices to process the input. In our experiment, we have considered multiple Raspberry Pi as edge devices. Raspberry Pi with a neural computing stick is a gateway device to combine the results from various edge devices and get the final output. Our proposed model achieved inference latency of 5 to $urn:x-wiley:ett:media:ett4648:ett4648-math-0001$ 7 seconds for $urn:x-wiley:ett:media:ett4648:ett4648-math-0002$ to $urn:x-wiley:ett:media:ett4648:ett4648-math-0003$ fused layer partitioning for five devices with a 9% improvement in memory footprint.

中文翻译：

分布式卷积神经网络边缘内存优化

通过优化部署在资源受限设备上以实现快速决策的深度学习 (DL) 模型，出现了物联网 (IoT) 边缘智能。此外，边缘智能通过使智能分析更接近源头来减少网络过载和延迟。另一方面，深度学习模型需要大量的计算资源。因此，它们具有很高的计算工作负载和内存占用量，因此在功能有限的物联网边缘设备上部署和执行是不切实际的。此外，现有的基于层的划分方法会产生许多中间结果，从而导致巨大的内存占用。在本文中，我们提出了一个框架来提供一个全面的解决方案，使卷积神经网络 (CNN) 能够部署到分布式物联网设备上，以加快推理速度并减少内存占用。该框架考虑预训练的 YOLOv2 模型，并将权重剪枝技术应用于预训练模型以减少无贡献参数的数量。我们使用融合层划分方法对 CNN 的融合层进行垂直划分，然后将划分分布在边缘设备之间以处理输入。在我们的实验中，我们将多个 Raspberry Pi 视为边缘设备。带有神经计算棒的树莓派是一种网关设备，可以将各种边缘设备的结果结合起来，得到最终的输出。我们提出的模型实现了 5 到 5 的推理延迟该框架考虑预训练的 YOLOv2 模型，并将权重剪枝技术应用于预训练模型以减少无贡献参数的数量。我们使用融合层划分方法对 CNN 的融合层进行垂直划分，然后将划分分布在边缘设备之间以处理输入。在我们的实验中，我们将多个 Raspberry Pi 视为边缘设备。带有神经计算棒的树莓派是一种网关设备，可以将各种边缘设备的结果结合起来，得到最终的输出。我们提出的模型实现了 5 到 5 的推理延迟该框架考虑预训练的 YOLOv2 模型，并将权重剪枝技术应用于预训练模型以减少无贡献参数的数量。我们使用融合层划分方法对 CNN 的融合层进行垂直划分，然后将划分分布在边缘设备之间以处理输入。在我们的实验中，我们将多个 Raspberry Pi 视为边缘设备。带有神经计算棒的树莓派是一种网关设备，可以将各种边缘设备的结果结合起来，得到最终的输出。我们提出的模型实现了 5 到 5 的推理延迟我们使用融合层划分方法对 CNN 的融合层进行垂直划分，然后将划分分布在边缘设备之间以处理输入。在我们的实验中，我们将多个 Raspberry Pi 视为边缘设备。带有神经计算棒的树莓派是一种网关设备，可以将各种边缘设备的结果结合起来，得到最终的输出。我们提出的模型实现了 5 到 5 的推理延迟我们使用融合层划分方法对 CNN 的融合层进行垂直划分，然后将划分分布在边缘设备之间以处理输入。在我们的实验中，我们将多个 Raspberry Pi 视为边缘设备。带有神经计算棒的树莓派是一种网关设备，可以将各种边缘设备的结果结合起来，得到最终的输出。我们提出的模型实现了 5 到 5 的推理延迟 $urn:x-wiley:ett:media:ett4648:ett4648-math-0001$ $urn:x-wiley:ett:media:ett4648:ett4648-math-0002$ 将五个设备的融合层分区用时7 秒 $urn:x-wiley:ett:media:ett4648:ett4648-math-0003$ ，内存占用减少 9%。

更新日期：2022-09-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>