Memory-efficient deep learning inference with incremental weight loading and data layout reorganization on edge systems

doi:10.1016/j.sysarc.2021.102183

Journal of Systems Architecture

Volume 118, September 2021, 102183

https://doi.org/10.1016/j.sysarc.2021.102183 Get rights and content

Abstract

Pattern recognition applications such as face recognition and agricultural product detection have drawn a rapid interest on Cyber–Physical–Social-Systems (CPSS). These CPSS applications rely on the deep neural networks (DNN) to conduct the image classification. However, traditional DNN inference models in the cloud could suffer from network delay fluctuations and privacy leakage problems. In this regard, current real-time CPSS applications are preferred to be deployed on edge-end embedded devices. Constrained by the computing power and memory limitations of edge devices, improving the memory management efficacy is the key to improving the quality of service for model inference. First, this study explored the incremental loading strategy of model weights for the model inference. Second, the memory space at runtime is optimized through data layout reorganization from the spatial dimension. In particular, the proposed schemes are orthogonal to existing models. Experimental results demonstrate that the proposed approach reduced the memory consumption by 61.05% without additional inference time overhead.

Introduction

In the era of Internet-of-Things (IoT), deploying deep neural networks (DNN) on Cyber–Physical–Social-Systems (CPSS) has drawn a rapid interest. Owning to the modern High-Performance-Computing-Communications (HPCC) techniques, deep learning applications have brought tremendous changes to the entire industry and research [1], [2], [3]. In the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Challenge) competition, the deep neural network Alexnet [4] won the championship with a 17% error rate, and thereafter the won the championship. Their solutions deepen the network layers and increase the number of network parameters to obtain higher generalization and accuracy. Until 2015, the ResNet [5] model used a cross-layer connection to partially solve the problem of gradient explosion or disappearance during training, further reducing the error rate to 3.57%. Meanwhile, these approaches increased the training complexity and hardware overhead.

In recent years, a number of studies [6], [7], [8] are proposed to solve problems or optimize the performance in Cyber–Physical Systems. For CPSS applications, with the development of modern high-performance hardware (such as Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU)) and the availability of large-scale datasets, the accuracy of image classification and target recognition are getting higher. Though model network structures and operators are increasingly improved for the reduced cost of model training, the model inference has gradually become one of the major bottlenecks in deep learning applications. Limited by network delay and implicit protection, models are preferred to be deployed in edge-end embedded devices independently. As the improvement of the edge device computing power, low-cost and general-purpose CPU is capable of conducting the deep learning network inference stage [9]. However, the challenge of deploying the model on edge devices is its huge DRAM requirement.

The main memory overhead of the CNN model [10] involves two parts: (1) Deployment Memory: After the CNN training, we need to deploy the model structures and load the model weights into the memory during inference. For example, the weights of Alexnet consume more than 200 MB DRAM space, which is a great burden for edge devices with limited resources. (2) Run-time Memory: During model inference, additional memory space is needed to store the feature maps generated by the network and the intermediate results of the parallel calculation of the operators. For the Deployment Memory, the network model structures and weights are often compressed for a smaller memory consumption [11], [12]. On the other hand, Low-rank Decomposition [13] can be used to reduce Run-Time Memory, but it does not take account of the memory management in the inference framework and thus cannot further optimize the memory allocation efficacy.

Many works have been proposed to enhance the main memory management efficacy for deep learning at the system or framework layer. Peng et al. [14] explored the memory swapping for GPU-based training by weighing the benefits of swapping and recalculation. Similarly, there are a lot of studies [15], [16] focusing on memory optimization by exchanging GPU cache and memory. Unfortunately, these works are only used for GPU-based training models, while the memory usage of the inference on edge devices is very different. Li [17] and Wu [18] enhanced the data layout to optimize the access performance of internal and external memory, but they paid little attention to the resource-constrained feature of edge devices, e.g., the insufficient memory during the inference phase.

This paper aims at optimizing the Deployment Memory and Run-time Memory of model inference on edge-end embedded devices at the same time. First, it considers the data usage granularity of different layers and reuses the weight memory space in the time dimension. Second, it considers the data usage of tensors, reorganizes the layout of the runtime Blob and Workspace tensors, and reduces the total tensor memory consumption in the spatial dimension. Experimental results with a wide range of workloads show that compared to the traditional inference framework, the proposed technique can reduce the memory consumption by as much as 61.05% without any time overhead.

This paper makes the following contributions:

•
Identified that main memory could be occupied by tensors uselessly for a long time in the inference phase.
•
Proposed an incremental weight loading scheme to aggressively reuse allocated memory for reduced memory consumption.
•
Proposed a data layout reorganization scheme in runtime to avoid the fragmentation of small dispersed free memory spaces.

The rest of the paper is organized as follows. Section 2 introduces the background and the related work. Section 3 presents the motivation. Section 4 describes the design details of the proposed schemes. Section 5 presents the evaluation results with some discussions. Section 6 concludes this paper.

Section snippets

Deep learning in edge systems

Deep Learning Models. Compared with traditional image classification and object detection technologies, DNN (Deep Neural Networks) has better robustness and accuracy. It is composed of an Input, a Hidden and an Output modules. The Hidden is connected by a multi-layer network, where each layer of the network performs one or several operators, such as pooling operations, convolution operations, etc. Image classification is to determine the category of the input picture through the DNN model. One

Motivation

To analyze the memory utilization of model inference, we conducted an experiment to study the memory behaviors of a set of popular models in the inference stage on a real edge device (detailed experiment setup can be found in Section 5).

Methodology

Edging servers on embedded devices heavily rely on the inference of deep learning. Our analysis revealed that the data dependency during the inference execution is almost static, which suggests that the data to be accessed is highly predictable. For resource-constrained embedded devices [31], [32] at the edge-side, the model needs to be serialized first before real inference phases. This study proposes an Memory-Efficient Deep-Learning Inference (MDI) approach. First, during the model inference

Experiment

Experiment Platform: Our experiments are based on an embedded deep learning platform, which is equipped with Quad-Core ARM® Cortex®-A57 MPCore, and 1GB 128-bit LPDDR4 Memory. The proposed MDI approach involves two schemes: Incremental Weight Loading and Data Layout Reorganization schemes, which are applied to model Weights and tensors (Blob, Workspace) generated during runtime. The two schemes are orthogonal and transparent to models. In the experiment, we use NCNN [21] as the baseline

Conclusion

This paper proposes Incremental Weight Loading and Data Layout Reorganization to improve the memory utilization of the model weight and runtime data, respectively. The first method uses layer as the granularity, and reuses the weight memory area in the time dimension; The second method uses tensors as the granularity, reorganizes the layout of the Blob and Workspace tensors generated at runtime, and reduces the total memory consumption of the tensors in the spatial dimension. Comprehensive use

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.sysarc.2021.102183.

Acknowledgments

This work was partially supported by the China Postdoctoral Science Foundation (No. 2020M671637), National Science Youth Fund of Jiangsu Province (No. BK20190224 and BK20200462), the Jiangsu Postdoctoral Science Foundation (No. 2019K224), and Ministry of Science and Technology, Taiwan (No. 109-2221-E-009-075).

References (35)

ZhuZ. et al.
PHDFS: Optimizing I/O performance of HDFS in deep learning cloud computing platform
J. Syst. Archit.
(2020)
ZhaoQ. et al.
HLC-PCP: A resource synchronization protocol for certifiable mixed criticality scheduling
J. Syst. Archit.
(2016)
WuC. et al.
Pruning deep reinforcement learning for dual user experience and storage lifetime improvement on mobile devices
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
(2020)
HintonG.E. et al.
A fast learning algorithm for deep belief nets
Neural Comput.
(2006)
MittalS.
A survey on modeling and improving reliability of DNN algorithms and accelerators
J. Syst. Archit.
(2020)
KrizhevskyA. et al.
Imagenet classification with deep convolutional neural networks
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
ZhouJ. et al.
Cost and makespan-aware workflow scheduling in hybrid clouds
J. Syst. Archit.
(2019)
ZhaoQ. et al.
Schedulability analysis and stack size minimization with preemption thresholds and mixed-criticality scheduling
J. Syst. Archit.
(2017)
LeeJ. et al.
On-device neural net inference with mobile gpus
(2019)

Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, C. Zhang, Learning efficient convolutional networks through network slimming,...

RastegariM. et al.

Xnor-net: Imagenet classification using binary convolutional neural networks

HanS. et al.

Learning both weights and connections for efficient neural network

DentonE.L. et al.

Exploiting linear structure within convolutional networks for efficient evaluation

X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, X. Qian, Capuchin: Tensor-based GPU memory management for...

RhuM. et al.

vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S.L. Song, Z. Xu, T. Kraska, Superneurons: Dynamic GPU memory management for...

Cited by (13)

LRseg: An efficient railway region extraction method based on lightweight encoder and self-correcting decoder
2024, Expert Systems with Applications
This paper proposes a lightweight and efficient railway region extraction model LRseg, which provides technical support for detecting foreign objects on the railway. LRseg consists of a lightweight encoder, self-correcting decoder, and segmentation head. The lightweight encoder is based on lightweight principles and consists of the LR-unit and Focus module. The self-correcting decoder is based on SC-FFM (Self-Correcting Feature Fusion Module) and cascade structure, efficiently utilizing the context and spatial information. The segmentation head introduces PPM (Pyramid Pooling Module) to extract multi-scale context information. At the same time, the segmentation performance of LRseg is enhanced by CIRKD (Cross-Image Relational Knowledge Distillation). The method has 0.78 M parameters and 1.47 G FLOPs (Floating Point of Operations), and the model size and memory demand are only 2.98 MB and 37.5 MB, respectively. On the self-built Railway-seg and public Railsem19 datasets, the method achieved mIoU (mean Intersection over Union) of 92.2% and 92.4%, respectively. When the input image size is 512×512, on Jetson TX2 and personal computer with Intel Core i9-12900, the method achieves 18 FPS and 94 FPS (Frames Per Second), respectively. Combining real-time and accuracy, the method is advantageous over the mainstream models, suitable for on-board devices with continuous work. The code will be released at https://github.com/CV-Altrai2023/LRseg.
Deep Convolution Neural Network sharing for the multi-label images classification
2022, Machine Learning with Applications
Addressing issues related to multi-label classification is relevant in many fields of applications. In this work. We present a multi-label classification architecture based on Multi-Branch Neural Network Model (MBNN) that permits the network to encode data from multiple semi-parallel subnetworks or layers outputs separately. Different types of neural networks can be used in the MBNN, but the proposal is made with Convolutional Neural Networks subnetworks, trained, and joined in classifying the outputs (i.e., labels). The proposed work makes it possible to perform incremental changes on existing Multitask Learning architectures for an adaptation to the multi-label classification. These transformations lead us to define two new architectures (neural network multi-outputs and neural network multi-features) using the feature extractors from the pre-trained neural networks. The empirical and statistical results verify that the proposed multibranch neural network architecture performs better than other simple multi-label classification architectures. Later, the “network with multi-features” obtained the highest classification score than other deep neural networks with 83.31% of the f1-score for the Amazon rainforest dataset. The f1-score values are 88.81% for Pascal VOC 2007 dataset, 87.71% for Nuswide, and 88.64% for Pascal VOC 2012.
LNNet: Lightweight Nested Network for motion deblurring
2022, Journal of Systems Architecture
Citation Excerpt :
However, this stringent computing requirement may not be feasible for MEC devices. Despite recent efforts in designing lightweight deep learning models [48–57], most of these models only consider some simple tasks. As shown in recent studies [7,8], the design of lightweight deep learning models is dependent on the task domains.
Motion deblurring methods based on convolutional neural networks (CNN) have recently demonstrated their advantages over conventional methods. However, repetitions of scaling or slicing operations of these methods on the input images inevitably lead to spatial information loss. Meanwhile, some recent methods based on complex models inevitably bring a large model size and huge computing cost. It is still challenging to balance the deblurring performance and the cost. To this end, we propose a lightweight nested network (LNNet) for the motion-deblurring task. Our LNNet leverages several simple yet efficient sub-networks to process motion deblurring features at each stage. We design a nested connection, which is conducive to the model size reduction when connecting sub-networks so as to reuse deblurring information and facilitate deblurring information diversity. Meanwhile, we introduce the feature-fusion module to improve deblurring performance further. We perform extensive experiments on a workstation platform and an embedded mobile edge computing (MEC) platform to evaluate our LNNet as well as other existing methods. Extensive experimental results demonstrate that our LNNet achieves superior deblurring performance than state-of-the-art methods with a small model size within a short running time. Moreover, experimental results also show that our model is quite suitable for other embedded devices.
QEGCN: An FPGA-based accelerator for quantized GCNs with edge-level parallelism
2022, Journal of Systems Architecture
Citation Excerpt :
Third, the power-law distribution of the natural graph challenges the existing memory hierarchy of CPUs and GPUs, for the sparsely distributed low-degree vertices in the graphs make it hard to reuse the graph data in GPPs. Accelerators for DNNs are well established [26–29], and there are many studies on parallel computing based on FPGAs and other platform.[30–32]. Existing accelerators lack research on data quantification and support for fine-grained parallelism.
Graph convolutional networks(GCNs) have been successfully applied to many fields such as social networks, knowledge graphs, and recommend systems. The purpose of this research is to design and implement an accelerator for quantized GCNs with edge-level parallelism. We explore the viability of training quantized GCNs, enabling the usage of low precision integer arithmetic during inference. GCNs trained with QEGCN for INT8 quantization perform as well as FP32 models in three most used datasets. Data quantization can significantly reduce the use of logical resources and energy consumption without reducing accuracy. Based on the SAGA-NN model, we improved the original algorithm to achieve parallelism at the edge level. Unlike previous work, we implement the pipeline structure for apply and gather operations at the edge level to achieve fine-grained parallelism. Using our framework, we generate accelerators for GCN models on a state-of-the-art FPGA platform and evaluate our designs. Compared to HyGCN, EnGN, and AWB-GCN, our work achieves 42.1x, 19.3x, 8.31x speedup and 11.6x, 8.1x, 41.68x energy savings on average, respectively.
Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems
2024, IEEE Transactions on Intelligent Transportation Systems
Efficient memory reuse methodology for CNN-based real-time image processing in mobile-embedded systems
2023, Journal of Real-Time Image Processing

View all citing articles on Scopus

^☆: This paper is an extended version of [1].

View full text

Memory-efficient deep learning inference with incremental weight loading and data layout reorganization on edge systems☆

Abstract

Introduction

Section snippets

Deep learning in edge systems

Motivation

Methodology

Experiment

Conclusion

Declaration of Competing Interest

Acknowledgments

J. Syst. Archit.

J. Syst. Archit.

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

A fast learning algorithm for deep belief nets

Neural Comput.

A survey on modeling and improving reliability of DNN algorithms and accelerators

J. Syst. Archit.

Imagenet classification with deep convolutional neural networks

Cost and makespan-aware workflow scheduling in hybrid clouds

J. Syst. Archit.

Schedulability analysis and stack size minimization with preemption thresholds and mixed-criticality scheduling

J. Syst. Archit.

On-device neural net inference with mobile gpus

Xnor-net: Imagenet classification using binary convolutional neural networks

Learning both weights and connections for efficient neural network

Exploiting linear structure within convolutional networks for efficient evaluation

vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design