Memory-efficient deep learning inference with incremental weight loading and data layout reorganization on edge systems☆
Introduction
In the era of Internet-of-Things (IoT), deploying deep neural networks (DNN) on Cyber–Physical–Social-Systems (CPSS) has drawn a rapid interest. Owning to the modern High-Performance-Computing-Communications (HPCC) techniques, deep learning applications have brought tremendous changes to the entire industry and research [1], [2], [3]. In the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Challenge) competition, the deep neural network Alexnet [4] won the championship with a 17% error rate, and thereafter the won the championship. Their solutions deepen the network layers and increase the number of network parameters to obtain higher generalization and accuracy. Until 2015, the ResNet [5] model used a cross-layer connection to partially solve the problem of gradient explosion or disappearance during training, further reducing the error rate to 3.57%. Meanwhile, these approaches increased the training complexity and hardware overhead.
In recent years, a number of studies [6], [7], [8] are proposed to solve problems or optimize the performance in Cyber–Physical Systems. For CPSS applications, with the development of modern high-performance hardware (such as Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU)) and the availability of large-scale datasets, the accuracy of image classification and target recognition are getting higher. Though model network structures and operators are increasingly improved for the reduced cost of model training, the model inference has gradually become one of the major bottlenecks in deep learning applications. Limited by network delay and implicit protection, models are preferred to be deployed in edge-end embedded devices independently. As the improvement of the edge device computing power, low-cost and general-purpose CPU is capable of conducting the deep learning network inference stage [9]. However, the challenge of deploying the model on edge devices is its huge DRAM requirement.
The main memory overhead of the CNN model [10] involves two parts: (1) Deployment Memory: After the CNN training, we need to deploy the model structures and load the model weights into the memory during inference. For example, the weights of Alexnet consume more than 200 MB DRAM space, which is a great burden for edge devices with limited resources. (2) Run-time Memory: During model inference, additional memory space is needed to store the feature maps generated by the network and the intermediate results of the parallel calculation of the operators. For the Deployment Memory, the network model structures and weights are often compressed for a smaller memory consumption [11], [12]. On the other hand, Low-rank Decomposition [13] can be used to reduce Run-Time Memory, but it does not take account of the memory management in the inference framework and thus cannot further optimize the memory allocation efficacy.
Many works have been proposed to enhance the main memory management efficacy for deep learning at the system or framework layer. Peng et al. [14] explored the memory swapping for GPU-based training by weighing the benefits of swapping and recalculation. Similarly, there are a lot of studies [15], [16] focusing on memory optimization by exchanging GPU cache and memory. Unfortunately, these works are only used for GPU-based training models, while the memory usage of the inference on edge devices is very different. Li [17] and Wu [18] enhanced the data layout to optimize the access performance of internal and external memory, but they paid little attention to the resource-constrained feature of edge devices, e.g., the insufficient memory during the inference phase.
This paper aims at optimizing the Deployment Memory and Run-time Memory of model inference on edge-end embedded devices at the same time. First, it considers the data usage granularity of different layers and reuses the weight memory space in the time dimension. Second, it considers the data usage of tensors, reorganizes the layout of the runtime Blob and Workspace tensors, and reduces the total tensor memory consumption in the spatial dimension. Experimental results with a wide range of workloads show that compared to the traditional inference framework, the proposed technique can reduce the memory consumption by as much as 61.05% without any time overhead.
This paper makes the following contributions:
- •
Identified that main memory could be occupied by tensors uselessly for a long time in the inference phase.
- •
Proposed an incremental weight loading scheme to aggressively reuse allocated memory for reduced memory consumption.
- •
Proposed a data layout reorganization scheme in runtime to avoid the fragmentation of small dispersed free memory spaces.
The rest of the paper is organized as follows. Section 2 introduces the background and the related work. Section 3 presents the motivation. Section 4 describes the design details of the proposed schemes. Section 5 presents the evaluation results with some discussions. Section 6 concludes this paper.
Section snippets
Deep learning in edge systems
Deep Learning Models. Compared with traditional image classification and object detection technologies, DNN (Deep Neural Networks) has better robustness and accuracy. It is composed of an Input, a Hidden and an Output modules. The Hidden is connected by a multi-layer network, where each layer of the network performs one or several operators, such as pooling operations, convolution operations, etc. Image classification is to determine the category of the input picture through the DNN model. One
Motivation
To analyze the memory utilization of model inference, we conducted an experiment to study the memory behaviors of a set of popular models in the inference stage on a real edge device (detailed experiment setup can be found in Section 5).
Methodology
Edging servers on embedded devices heavily rely on the inference of deep learning. Our analysis revealed that the data dependency during the inference execution is almost static, which suggests that the data to be accessed is highly predictable. For resource-constrained embedded devices [31], [32] at the edge-side, the model needs to be serialized first before real inference phases. This study proposes an Memory-Efficient Deep-Learning Inference (MDI) approach. First, during the model inference
Experiment
Experiment Platform: Our experiments are based on an embedded deep learning platform, which is equipped with Quad-Core ARM® Cortex®-A57 MPCore, and 1GB 128-bit LPDDR4 Memory. The proposed MDI approach involves two schemes: Incremental Weight Loading and Data Layout Reorganization schemes, which are applied to model Weights and tensors (Blob, Workspace) generated during runtime. The two schemes are orthogonal and transparent to models. In the experiment, we use NCNN [21] as the baseline
Conclusion
This paper proposes Incremental Weight Loading and Data Layout Reorganization to improve the memory utilization of the model weight and runtime data, respectively. The first method uses layer as the granularity, and reuses the weight memory area in the time dimension; The second method uses tensors as the granularity, reorganizes the layout of the Blob and Workspace tensors generated at runtime, and reduces the total memory consumption of the tensors in the spatial dimension. Comprehensive use
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.sysarc.2021.102183.
Acknowledgments
This work was partially supported by the China Postdoctoral Science Foundation (No. 2020M671637), National Science Youth Fund of Jiangsu Province (No. BK20190224 and BK20200462), the Jiangsu Postdoctoral Science Foundation (No. 2019K224), and Ministry of Science and Technology, Taiwan (No. 109-2221-E-009-075).
References (35)
- et al.
PHDFS: Optimizing I/O performance of HDFS in deep learning cloud computing platform
J. Syst. Archit.
(2020) - et al.
HLC-PCP: A resource synchronization protocol for certifiable mixed criticality scheduling
J. Syst. Archit.
(2016) - et al.
Pruning deep reinforcement learning for dual user experience and storage lifetime improvement on mobile devices
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
(2020) - et al.
A fast learning algorithm for deep belief nets
Neural Comput.
(2006) A survey on modeling and improving reliability of DNN algorithms and accelerators
J. Syst. Archit.
(2020)- et al.
Imagenet classification with deep convolutional neural networks
- K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
- et al.
Cost and makespan-aware workflow scheduling in hybrid clouds
J. Syst. Archit.
(2019) - et al.
Schedulability analysis and stack size minimization with preemption thresholds and mixed-criticality scheduling
J. Syst. Archit.
(2017) - et al.
On-device neural net inference with mobile gpus
(2019)
Xnor-net: Imagenet classification using binary convolutional neural networks
Learning both weights and connections for efficient neural network
Exploiting linear structure within convolutional networks for efficient evaluation
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
Cited by (13)
LRseg: An efficient railway region extraction method based on lightweight encoder and self-correcting decoder
2024, Expert Systems with ApplicationsDeep Convolution Neural Network sharing for the multi-label images classification
2022, Machine Learning with ApplicationsLNNet: Lightweight Nested Network for motion deblurring
2022, Journal of Systems ArchitectureCitation Excerpt :However, this stringent computing requirement may not be feasible for MEC devices. Despite recent efforts in designing lightweight deep learning models [48–57], most of these models only consider some simple tasks. As shown in recent studies [7,8], the design of lightweight deep learning models is dependent on the task domains.
QEGCN: An FPGA-based accelerator for quantized GCNs with edge-level parallelism
2022, Journal of Systems ArchitectureCitation Excerpt :Third, the power-law distribution of the natural graph challenges the existing memory hierarchy of CPUs and GPUs, for the sparsely distributed low-degree vertices in the graphs make it hard to reuse the graph data in GPPs. Accelerators for DNNs are well established [26–29], and there are many studies on parallel computing based on FPGAs and other platform.[30–32]. Existing accelerators lack research on data quantification and support for fine-grained parallelism.
Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems
2024, IEEE Transactions on Intelligent Transportation SystemsEfficient memory reuse methodology for CNN-based real-time image processing in mobile-embedded systems
2023, Journal of Real-Time Image Processing
- ☆
This paper is an extended version of [1].