Memory-efficient deep learning inference with incremental weight loading and data layout reorganization on edge systems

https://doi.org/10.1016/j.sysarc.2021.102183Get rights and content

Abstract

Pattern recognition applications such as face recognition and agricultural product detection have drawn a rapid interest on Cyber–Physical–Social-Systems (CPSS). These CPSS applications rely on the deep neural networks (DNN) to conduct the image classification. However, traditional DNN inference models in the cloud could suffer from network delay fluctuations and privacy leakage problems. In this regard, current real-time CPSS applications are preferred to be deployed on edge-end embedded devices. Constrained by the computing power and memory limitations of edge devices, improving the memory management efficacy is the key to improving the quality of service for model inference. First, this study explored the incremental loading strategy of model weights for the model inference. Second, the memory space at runtime is optimized through data layout reorganization from the spatial dimension. In particular, the proposed schemes are orthogonal to existing models. Experimental results demonstrate that the proposed approach reduced the memory consumption by 61.05% without additional inference time overhead.

Introduction

In the era of Internet-of-Things (IoT), deploying deep neural networks (DNN) on Cyber–Physical–Social-Systems (CPSS) has drawn a rapid interest. Owning to the modern High-Performance-Computing-Communications (HPCC) techniques, deep learning applications have brought tremendous changes to the entire industry and research [1], [2], [3]. In the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Challenge) competition, the deep neural network Alexnet [4] won the championship with a 17% error rate, and thereafter the won the championship. Their solutions deepen the network layers and increase the number of network parameters to obtain higher generalization and accuracy. Until 2015, the ResNet [5] model used a cross-layer connection to partially solve the problem of gradient explosion or disappearance during training, further reducing the error rate to 3.57%. Meanwhile, these approaches increased the training complexity and hardware overhead.

In recent years, a number of studies [6], [7], [8] are proposed to solve problems or optimize the performance in Cyber–Physical Systems. For CPSS applications, with the development of modern high-performance hardware (such as Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU)) and the availability of large-scale datasets, the accuracy of image classification and target recognition are getting higher. Though model network structures and operators are increasingly improved for the reduced cost of model training, the model inference has gradually become one of the major bottlenecks in deep learning applications. Limited by network delay and implicit protection, models are preferred to be deployed in edge-end embedded devices independently. As the improvement of the edge device computing power, low-cost and general-purpose CPU is capable of conducting the deep learning network inference stage [9]. However, the challenge of deploying the model on edge devices is its huge DRAM requirement.

The main memory overhead of the CNN model [10] involves two parts: (1) Deployment Memory: After the CNN training, we need to deploy the model structures and load the model weights into the memory during inference. For example, the weights of Alexnet consume more than 200 MB DRAM space, which is a great burden for edge devices with limited resources. (2) Run-time Memory: During model inference, additional memory space is needed to store the feature maps generated by the network and the intermediate results of the parallel calculation of the operators. For the Deployment Memory, the network model structures and weights are often compressed for a smaller memory consumption [11], [12]. On the other hand, Low-rank Decomposition [13] can be used to reduce Run-Time Memory, but it does not take account of the memory management in the inference framework and thus cannot further optimize the memory allocation efficacy.

Many works have been proposed to enhance the main memory management efficacy for deep learning at the system or framework layer. Peng et al. [14] explored the memory swapping for GPU-based training by weighing the benefits of swapping and recalculation. Similarly, there are a lot of studies [15], [16] focusing on memory optimization by exchanging GPU cache and memory. Unfortunately, these works are only used for GPU-based training models, while the memory usage of the inference on edge devices is very different. Li [17] and Wu [18] enhanced the data layout to optimize the access performance of internal and external memory, but they paid little attention to the resource-constrained feature of edge devices, e.g., the insufficient memory during the inference phase.

This paper aims at optimizing the Deployment Memory and Run-time Memory of model inference on edge-end embedded devices at the same time. First, it considers the data usage granularity of different layers and reuses the weight memory space in the time dimension. Second, it considers the data usage of tensors, reorganizes the layout of the runtime Blob and Workspace tensors, and reduces the total tensor memory consumption in the spatial dimension. Experimental results with a wide range of workloads show that compared to the traditional inference framework, the proposed technique can reduce the memory consumption by as much as 61.05% without any time overhead.

This paper makes the following contributions:

  • Identified that main memory could be occupied by tensors uselessly for a long time in the inference phase.

  • Proposed an incremental weight loading scheme to aggressively reuse allocated memory for reduced memory consumption.

  • Proposed a data layout reorganization scheme in runtime to avoid the fragmentation of small dispersed free memory spaces.

The rest of the paper is organized as follows. Section 2 introduces the background and the related work. Section 3 presents the motivation. Section 4 describes the design details of the proposed schemes. Section 5 presents the evaluation results with some discussions. Section 6 concludes this paper.

Section snippets

Deep learning in edge systems

Deep Learning Models. Compared with traditional image classification and object detection technologies, DNN (Deep Neural Networks) has better robustness and accuracy. It is composed of an Input, a Hidden and an Output modules. The Hidden is connected by a multi-layer network, where each layer of the network performs one or several operators, such as pooling operations, convolution operations, etc. Image classification is to determine the category of the input picture through the DNN model. One

Motivation

To analyze the memory utilization of model inference, we conducted an experiment to study the memory behaviors of a set of popular models in the inference stage on a real edge device (detailed experiment setup can be found in Section 5).

Methodology

Edging servers on embedded devices heavily rely on the inference of deep learning. Our analysis revealed that the data dependency during the inference execution is almost static, which suggests that the data to be accessed is highly predictable. For resource-constrained embedded devices [31], [32] at the edge-side, the model needs to be serialized first before real inference phases. This study proposes an Memory-Efficient Deep-Learning Inference (MDI) approach. First, during the model inference

Experiment

Experiment Platform: Our experiments are based on an embedded deep learning platform, which is equipped with Quad-Core ARM® Cortex®-A57 MPCore, and 1GB 128-bit LPDDR4 Memory. The proposed MDI approach involves two schemes: Incremental Weight Loading and Data Layout Reorganization schemes, which are applied to model Weights and tensors (Blob, Workspace) generated during runtime. The two schemes are orthogonal and transparent to models. In the experiment, we use NCNN [21] as the baseline

Conclusion

This paper proposes Incremental Weight Loading and Data Layout Reorganization to improve the memory utilization of the model weight and runtime data, respectively. The first method uses layer as the granularity, and reuses the weight memory area in the time dimension; The second method uses tensors as the granularity, reorganizes the layout of the Blob and Workspace tensors generated at runtime, and reduces the total memory consumption of the tensors in the spatial dimension. Comprehensive use

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.sysarc.2021.102183.

Acknowledgments

This work was partially supported by the China Postdoctoral Science Foundation (No. 2020M671637), National Science Youth Fund of Jiangsu Province (No. BK20190224 and BK20200462), the Jiangsu Postdoctoral Science Foundation (No. 2019K224), and Ministry of Science and Technology, Taiwan (No. 109-2221-E-009-075).

References (35)

  • Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, C. Zhang, Learning efficient convolutional networks through network slimming,...
  • RastegariM. et al.

    Xnor-net: Imagenet classification using binary convolutional neural networks

  • HanS. et al.

    Learning both weights and connections for efficient neural network

  • DentonE.L. et al.

    Exploiting linear structure within convolutional networks for efficient evaluation

  • X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, X. Qian, Capuchin: Tensor-based GPU memory management for...
  • RhuM. et al.

    vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

  • L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S.L. Song, Z. Xu, T. Kraska, Superneurons: Dynamic GPU memory management for...
  • Cited by (13)

    • LNNet: Lightweight Nested Network for motion deblurring

      2022, Journal of Systems Architecture
      Citation Excerpt :

      However, this stringent computing requirement may not be feasible for MEC devices. Despite recent efforts in designing lightweight deep learning models [48–57], most of these models only consider some simple tasks. As shown in recent studies [7,8], the design of lightweight deep learning models is dependent on the task domains.

    • QEGCN: An FPGA-based accelerator for quantized GCNs with edge-level parallelism

      2022, Journal of Systems Architecture
      Citation Excerpt :

      Third, the power-law distribution of the natural graph challenges the existing memory hierarchy of CPUs and GPUs, for the sparsely distributed low-degree vertices in the graphs make it hard to reuse the graph data in GPPs. Accelerators for DNNs are well established [26–29], and there are many studies on parallel computing based on FPGAs and other platform.[30–32]. Existing accelerators lack research on data quantification and support for fine-grained parallelism.

    View all citing articles on Scopus

    This paper is an extended version of [1].

    View full text