Towards An FPGA-targeted Hardware/Software Co-design Framework for CNN-based Edge Computing

Pham-Quoc, Cuong; Nguyen, Xuan-Quang; Thinh, Tran Ngoc

doi:10.1007/s11036-022-01985-9

Towards An FPGA-targeted Hardware/Software Co-design Framework for CNN-based Edge Computing

Published: 14 May 2022

Volume 27, pages 2024–2035, (2022)
Cite this article

Download PDF

Mobile Networks and Applications Aims and scope Submit manuscript

Towards An FPGA-targeted Hardware/Software Co-design Framework for CNN-based Edge Computing

Download PDF

773 Accesses
6 Citations
Explore all metrics

Abstract

In recent years, AI-based applications have been used more frequently in many different areas. More and more convolutional neural network models for AI applications have been proposed to improve accuracy compared to other methods like pattern matching or traditional image processing. However, the required computing power for AI applications during inference phases exceeds the processing ability of most edge computing systems. In this work, we target a hardware/software co-design framework to accelerate the performance of CNN-based edge computing applications. The proposed framework targets FPGA technology, which offers much flexibility to update or configure the computing systems for different purposes or working conditions. The framework allows designers to explore design space quickly to achieve better results without much effort. We implement our prototype version with an FPGA-based MPSoC platform using the MobileNet CNN model. The experimental results show that our system is always better than a quad-core ARM Cortex-A53 processor by achieving speed-ups by up to 69.4×. Compared to an Intel Core i7 CPU, the proposed system performs speed-ups by up to 4.67×. However, sometimes our system is not as good as the Intel CPU due to huge communication overhead. Synthesis results also report that our system can function at 159 MHz and consumes only 3.179 W, which is suitable for edge computing applications.

SSD: Single Shot MultiBox Detector

A review of object detection based on deep learning

Article 12 June 2020

Youzi Xiao, Zhiqiang Tian, … Xuguang Lan

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Asifullah Khan, Anabia Sohail, … Aqsa Saeed Qureshi

1 Introduction

In recent years, Artificial Intelligence (AI) has contributed to many research and application fields, such as image processing, voice recognition, or object detection. However, with the rapid increase in the amount of data generated, AI-based applications usually fail to achieve real-time computing performance. Although execution time of the training phases of convolutional neural networks (CNNs), one of the two successful forms of deep neural networks [1], can be improved with graphic processing units (GPU), the inference phases have faced performance issues [2]. GPUs are not suitable for edge computing due to their power-hungry.

CNN models are getting more, and more precisely by using a large amount of data and requiring huge computing power [3], e.g., the VGG19 CNN model requires more than 500MB of parameter memory and executes up to 39B+ floating-point operations (FLOPs) when classifying 224 × 224 images [4]. Meanwhile, edge computing devices only offer good performance, less memory, and a small energy budget [5]. Even traditional CPUs cannot provide enough computing power for CNN-based applications while providing up to $\sim$100 GFLOPs with more than 1 joule (J) of energy per 1 GOP used.

At the end of Moore’s Law and Dennard scaling era [6], system architects have proposed a new computing approach that concurrently optimizes both software and hardware levels for specific applications to satisfy real-time processing demand. The hardware/software co-design approach can continue to improve the performance of computing systems, including embedded and edge computing. In this computing paradigm, applications software will be implemented, and hardware systems for processing software are carefully designed and developed. The most promising approach in this paradigm is to perform hefty functions of the applications with specialized hardware-based computing cores. These computing cores are optimized for both the tasks and the targeting technology. Currently, ASICs (Application Specic Integrated Circuits) and FPGAs (Field Programmable Gate Arrays) are the two most well-known technologies that scientists usually target. Compared to ASICs, FPGAs offer more chances to optimize and reuse because of reconfigurability.

Although there are many proposals in the literature Section 2 for accelerating CNNs on FPGAs, they lack software optimization, i.e., the hardware/software co-design approach is not taken into account. These studies only focus on optimizing the hardware computing cores by different techniques, for example, pruning, quantization, or data reuse. Surveys in [7,8,9] have analyzed these techniques in detail. However, one of the fundamental problems these studies have not solved is providing a framework to quickly exploit design space and develop the application without much hardware knowledge. Therefore, this paper aims to propose an FPGA-targeted hardware/software co-design framework to build CNN-based applications quickly. For the preliminary results, we target edge computing platforms and CNN models.

We use the MobileNet model [10] to validate the framework as our case study. CNN acceleration with our FPGA edge computing platform, the Xilinx UltraScale MPSoC, is compared to the execution of ARM-based processing and Intel high-performance CPU. Experimental results show that our FPGA-targeted edge computing platform outperforms the ARM processor while not offering as good performance as the Intel high-performance CPU. However, we managed to save much energy compared to Intel. Reducing energy consumption is one of the primary purposes of an edge computing platform, i.e., offering good performance with reasonable energy consumption.

The main contributions of this paper can be summarized in three folds as follows.

(i)
We present our proposed hardware/software co-design framework targeting FPGA for CNN-based edge computing system;
(ii)
We design FPGA-based architectures for various convolutions, including standard, depthwise, and pointwise;
(iii)
We conduct several experiments and reports comparison results with an embedded processor ARM Cortex and an Intel CPU.

The rest of the paper is organized as follows. Section 2 presents the background of convolutional neural networks, hardware/software co-design approach, and related work in the literature. We propose our FPGA-targeted hardware/software co-design framework for accelerating CNN-based edge computing in Section 3. Section 4 illustrates an edge computing implementation on an FPGA platform. Experimental results are then discussed in Section 5. Finally, we conclude our article in Section 6.

2 Background and related work

This section briefly presents background on convolutional neural networks and hardware/software co-design. We then conduct a literature review on CNN acceleration with hardware techniques.

2.1 Background

In this section, firstly, we present the background of general convolutional neural networks. Secondly, we introduce the MobileNet model used for our first version framework. Finally, the hardware/software co-design approach is discussed.

2.1.1 Convolutional neural networks (CNNs)

A typical CNN model is depicted in Fig. 1 including two main layers, the convolution for computing different deep-learning-related operations and the Fully connected one for classifying inputs with traditional neural networks. The Convolution main layer includes one or many sub-layers for calculating some kinds of 2D convolutions (2D Conv) and pooling. While the 2D convolution operations extract features of the inputs, pooling reduces the size of the outputs generated by 2D convolutions. For example, a 32×32 pixels HCMUT logo with color consisting of channels three red, green, and blue, as illustrated in the figure, is convoluted with 3 5 × 5 × 3 convolutions kernels to create C1 feature maps, e.g., 3 28×28 feature maps. This example uses three kernels because we plan to classify inputs into three categories. These operations can be repeated several times with various parameters to improve the quality.

Consequently, max-pooling or average-pooling, called x-pooling, reduces the redundancy of feature maps generated by convolution operations. Following the above example, as depicted in Fig. 1 sizes of feature maps are reduced 2× from 28 × 28 to 14 × 14 matrices. Finally, outputs of the first main layer are translated to 1D vectors for further processing with traditional neural networks in the fully connected layer to classify original inputs into different categories.

During the inference phase, the Convolution main layer with convolutional operations and sampling functions are responsible for more than 99% of processing time due to several steps executed [7]. The typical 2D convolutional operation (CONV2D) is illustrated in Eq. 1.

$$F(x,y) = f\left( \sum\limits_{j=0}^{k-1}{\sum\limits_{i=0}^{k-1}{I(x,y,k,i,j)\times W(i,j)+b}}\right)$$

(1)

where F(x,y) is the convolution value of the point (x,y) in the output feature maps, f is the activation function used for CNN (e.g., Tanh, Sigmoid, etc.), k is the size of the kernel W (odd number), I(x,y,k,i,j) is a pixel value in the input feature maps, and b is the bias.

2.1.2 MobileNet model

Along with several well-known CNN models, MobileNet model [10], originated from Google, is an efficient CNN model for mobile visions. Hence, it is pretty suitable for edge computing. Therefore, we choose this model as our case study for validating the proposed framework in this work. The model includes the first complete convolution layer (i.e., operations in Eq. 1) followed by 13 layers of depthwise separable convolution, one average-pooling function a fully connected layer. A depthwise separable convolution layer consists of two operations: depthwise and pointwise convolutions.

Depthwise convolution (DWCONV) computes the same operations with 2D convolution by applying each filter W_i to each channel m_i of feature maps F. However, DWCONV only filters channels of a feature map to create various multiple matrices instead of adding them to create a new single feature map like CONV2D. Therefore, a layer called Pointwise convolution with a 1×1 kernel creates a new feature map. Compared to CONV2D a reduction in computation of $\left (\frac {1}{N} + \frac {1}{{{D}_{k}^{2}}}\right )$ can be obtained [10]; where N is the number for feature maps generated and D_k is the size of the kernel. Taken the above example into account with N = 9 and D_k = 3, we can manage to save 4.5× computing time.

2.1.3 Hardware/Software co-design approach

One of the most prominent approaches to continuously improve the performance of computing systems in the ended of Moore’s law is hardware accelerators where specific hardware computing cores are used for computationally intensive functions [11]. To exploit hardware accelerator computing cores well, hardware/software co-design is a promising approach when hardware and software levels are optimized concurrently for a particular application or domain. The design flow of this approach is illustrated in Fig. 2. Hardware/software co-design can offer many advantages at the system level compared to other techniques such as performance, time-to-market, or faster and better integration [12]. The approaches can target various platforms consisting of embedded systems, co-design of application-specific instruction-set processors (ASIPs), and co-design for reconfigurable computing (FPGAs). FPGA-based computing can provide better flexibility with good performance; thus, it is more suitable for accelerating computationally-intensive applications like CNNs.

On the other hand, FPGA accelerators cannot offer such high performance due to lower working frequency than the ASIC approach. However, the FPGA-based approach is the best candidate for frequently updated computing systems because of its flexibility. Hence, in this work, we target the FPGA platforms for accelerating CNN-based applications for edge computing.

2.2 Related work

In this section, we summarize FPGA-based CNN accelerations in the literature. Then, based on the survey, we present design challenges in this topic to take them into account for the ultimate design framework.

One of the most significant advantages of FPGA is the high level of parallelism. Therefore, many proposals in the literature build multiple processing elements (PEs) or unroll time-consuming loops to improve computing throughput. We name this technique as increasing parallelism. FPGA-based Aristotle architecture for accelerating CNNs was introduced by K.Guo et al. [14]. In the Aristotle architecture, several CONV2D operations are conducted by multiple PEs connected in an array form in FPGA fabrics. With the support of the pipeline model and several PEs, the system can achieve a throughput of 1 pixel per cycle. We build the prototype system with an SoC platform, including a host processor, FPGA-based PEs for CONV2D, and external memory. Shared memory can interconnect the host hardwired CPU and PEs in the reconfigurable fabrics to transfer communication data. Similar studies using the same optimization approach in the literature are also reported. H.Li et al. [15] presented their accelerator core for CNNs that allows all layers to be computed concurrently in the pipeline. X.Lin et al. designed an FPGA-based CNN core with a mapping method for multiple layer clusters [16]. CNN’s calculation core on reconfigurable computing with parallel structures was proposed in [17]. To achieve the highest possible performance, Y.Ma et al. used complete hardware resources to synthesize their FPGA-based CNN architecture [18]. Y.Li et al. [19] presented the novel binarized architecture for parallel convolution. X.Zhang et al. [20] proposed an FPGA-based CNN’s acceleration with a fine-grained layer-based pipeline architecture. A.Podili et al. [21] developed a convolution computing engine with the highly parallelized Winograd algorithm. M.Motamedi et al. [22] exploited all hardware sources for building a deep CNN parallel core.

Although offering better performance than traditional processors, FPGAs usually suffer from a low working frequency in terms of MHz. To further improve the throughput of computing systems, reducing the complexity of convolution operations is a good approach, e.g., quantization, pruning, shift-operator, or Binarized Neural Network (BNN). Y.Yang et al. [23] proposed an FPGA-based convolution core and a novel ConvNet model with 1 × 1 kernels called Synetgy. With the slight weight of 1 bit and 4-bit inputs, outputs, and activation functions, Synetgy obtains a thrilling performance increase. M.Ghasemzadeh et al. [24] designed ReBNet using the Xnor operation to compute Binary CNN on FPGA devices. L.Jiao et al. [25] presented a low-bit-width CNN computing core for FPGA platforms. D.Moss et al. [26] proposed FPGA-based CNN cores using binary values for weights and activation functions. N.Hiroki et al. [27] used the XNOR-Multiply-Accumulate circuit streaming operation to design an FPGA-based BNN architecture. E.Nurvitadhi et al. [28] introduced a hardware-based architecture for BNNs. B.Prost et al. [29] proposed an FPGA-based architecture to accelerate convolutional ternary neural networks. Y.Umuroglu et al. [30] presented a novel optimization for building BNNs on FPGAs called FINN. Z.Ritchie designed FPGA-based CNN cores with shallow floating point precision. S.Liang et al. [31] proposed the FP-BNN architecture with XNOR and SHIFT operators used instead of multipliers. S.Cao et al. [32] offered a novel and efficient FPGA accelerator for CNNs called Bank-Balanced Sparsity. S.Kala et al. [33] presented FPGA-based CNN PEs to process Winograd filtering algorithm and general element-wise matrix multiplications. J.Wang et al. [34] design a hybrid extremely low-bit-width CNN’s FPGA architecture called ELB-NN. Caiwen Ding et al. [35] introduced the REQ-YOLO FPGA-based framework to develop efficient object detection applications.

One of the most significant disadvantages of FPGA devices is the limitation of on-chip memory (Block RAM). The data reuse optimization approach can reduce communication overhead when transferring data between the on-chip buffer and external memory. With modern FPGA-based computing systems, data communication overhead is usually responsible for up to 50% of overall processing time [36]. Therefore, system performance can be improved when reducing the communication overhead. J.Wang et al. [37] proposed the ping-pong data reuse scheme to evade data communication between external and internal memory. L.Lu et al. [38] accelerated matrix multiplication in CNNs by using an FPGA-based architecture to employ line-buffer structures. Y.Guan et al. [39] designed the FP-DNN framework to optimize the communication bandwidth of FPGA-based CNN cores. Q.Jiantao et al. [40] reduced the memory footprint of FPGA-based CNN’s cores to improve the performance. X.Wei et al. [41] accelerated CNN’s core by using a systolic array architecture for low global data transfer. Y.Shen et al. [42] introduced an FPGA-based CNN’s acceleration architecture called Escher, focusing on data buffered on-chip. K.Guo et al. [43] designed FPGA-based CNNs architecture called Angle-Eye to utilize the on-chip buffer fully. Besides the above proposals, other studies may also exist in the literature. However, we only focus on recent publications in top conferences and journals.

Although many approaches have been proposed in recent years, there are still five different open issues that the FPGA-targeted framework for CNNs should consider. Figure 3 summarizes the available research topics and their difficulty levels.

Among the five open issues, the design space exploration is the most difficult because most current FPGA-based CNN accelerations are designed manually. Therefore, an automated design framework needs to explore design space quickly.

3 Proposed framework

This section introduces our proposed framework for accelerating CNN-based edge computing on FPGA platforms. As mentioned above, we use the MobileNet model as our case study for our preliminary results. Therefore, we present our FPGA-based CNN computing core suitable for the MobileNet model.

3.1 FPGA-targeted hardware/software co-design framework

Figure 4 depicts the abstract logical architecture of our proposed framework that consists of services and tools at the software level and architectures and cores at the hardware level. At the bottom of the framework, FPGA-based CNN cores and other IP cores such as encryption/decryption, security, or cryptography are built on FPGA fabrics. The cores communicate via an Interconnect infrastructure like bus, network-on-chip, or shared buffer. Along with those accelerators, Host processor is responsible for processing other parts of applications and controlling the computing cores. The host processor communicates with the computing core on reconfigurable fabrics through shared memory. The Configuration registers allows user applications to handle computing cores like updating CNN cores’ parameters. Direct memory access transfers data between user space addresses to local buffers of the CNN and IP cores. We will explain the details and purposes of these buffers in the next section.

The framework is equipped with operating services and instruction set extensions for managing computing cores (CNN cores and other IP cores) at the software level. The services help user applications handle computing cores for starting, transferring data, checking completion, etc. Services can be developed as Application Programming Interfaces (APIs) or ISA-extensions of the host processor. Based on the OS layer with APIs and ISA-extensions, several tools are installed, including Software Development Kit (SDK) for developing software, Integrated Development Environment (IDE) for configuring the host processor and computing cores, and High-level Synthesis (HLS) for building IP cores from high-level programming languages. One of the essential tools in this layer is the Modelling for performance evaluation. With the support of Modelling, designers can evaluate performance, accuracy, and many other aspects of the deployed systems to explore design space. Finally, the application layer allows programmers to develop CNN-based applications and related graphic user interfaces quickly.

The software can process all OS API services, ISA-extensions, or SDK/IDE. IP cores such as hashing, security, or cryptography depend on applications and synthesize from high-level programming languages code HLS. However, we must implement CNN cores with hardware description languages to improve processing performance further. Therefore, the next section will introduce our MobileNet CNN cores for our preliminary results.

3.2 CNN accelerating core architecture

In this section, we present our CNN accelerating core based on the MobileNet model for convolution computing mentioned in Section 2.1.2. Figure 5 illustrates the generic architecture of our FPGA-based MobileNet CNN computing core.

As depicted in the figure, the CNN accelerator core includes three sub-core to compute 2D convolution (Conv2d core), depthwise convolution (Depthwise Conv2d core), and pointwise convolution (Pointwise Conv2d core) for the MobileNet model presented before. Details of these cores are explained in the following sections. Along with computing cores for calculating various convolutions, the CNN accelerator core also hosts several buffers for different purposes.

Input global buffer: storing input images for the MobileNet model. The buffer should store an image in the size of 224× 224 ×3 (resolution 224×224 with three color channels). In other words, the buffer size should be at least 147 KB.
Weight global buffer: keeping kernels for filtering inputs. According to the MobileNet models, the size of this buffer is various, ranging from hundreds of bytes to some megabytes because the number of kernels can be 1,024.
Bias global buffer: storing bias values for each layer. That is the b value in Eq. 1.
Output global buffers: there are two output buffers as illustrated in Fig. 5. They are used to switch data between Depthwise and Pointwise cores at different layers. Based on the MobileNet model, the size of these buffers are usually two times larger than input buffers

3.2.1 Convolution 2D core

Figure 6 illustrates the architecture of our FPGA-based Conv2d core for 3 × 3 kernels. In general, the core includes several KCPE (Kernel-Convolution Processing Element) for computing the multiply-accumulate (MAC) operators. The ideal number of KCPEs per core is the number of elements in the kernels. However, it depends on the available hardware resources. If the kernel size is greater than the amount of available KCPEs, the core needs to do multiple iterations. KCPEs are organized in matrix form like kernels. e.g. 3×3 KCPEs for 3 × 3 kernels.

When processing, weight values of kernels are loaded into KCPEs first. The core, then, collects elements of an input image row by row to reuse data temporally and spatially [44]. The number of collected elements equals the kernel matrix dimension (e.g., getting three elements each time for 3 × 3 kernels). These values are delivered to all rows of the KCPEs matrix. For example, KCPE #0 to KCPE #2 in the first row of Fig. 6 receives the same set of three values as KCPE #3 to KCPE #5 and KCPE #6 to KCPE #8. KCPEs multiply the input values and stored weight to create a partial sum of MAC. These results are cached in Psum local buffer. When the next row of input is computed, values in Psum local buffer are added with new KCPEs’ results. We call this mechanism a row-oriented pipeline computing model, illustrated in Fig. 7.

3.2.2 Depthwise convolution core

As mentioned in Section 2.1.2, the Depthwise convolution is quite similar to the standard convolution except keeping outputs for channels separately. Therefore, we can reuse the architecture of the Conv2d core by disabling adding output behaviors.

According to the MobileNet models, after applying the standard convolution in the first layer, multiple depthwise convolution operations with several kernels, up to 1,024 kernels, are conducted. Thanks to the huge amount of hardware resources in FPGA devices, we can replicate Conv2d core in k × (k = 2ⁿ so that up to k kernels can be processed concurrently. Hence, we can improve computing performance dramatically. Figure 8 illustrates the architecture for Depthwise Conv2d core with 8 replications (n = 3). The number of replication depends on hardware resources available.

3.2.3 pointwise convolution core

Different from Depthwise Conv2d core, Conv2D core is not suitable for reusing to build Pointwise Conv2d core due to 1 × 1 kernels applied. Therefore, we use the output stationary dataflow [45] for designing Pointwise Conv2d core. In this approach, weight and input values are updated every computing cycle. Partial sums are accumulated into an array of registers till the final sum is reached. The final results then are quantized to 8-bit integer numbers to store into Output global buffer. Figure 9 depicts a 8 × 8 PE FPGA-based Pointwise computing core. In general, the number of PEs in Pointwise Conv2d core is k × k (k = 2ⁿ), where is the number of kernels Depthwise Conv2d core can process at every computing cycle.

When processing, weight values are fetched from Weight Global Buffer into local FIFO buffers W_ij in PEs (column i_th and row j_th). Input values are streamed from Input Global Buffer and multi-casted to all columns of PEs. After every computing round (multiplying weight and input values), partial sums are stored in registers as mentioned above, while weights from the next kernels are fetched for the next round.

4 System implementation

This section presents our system implementation targeted at a particular FPGA platform. For the prototype version with preliminary results, we use the Ultra96v2 board [46] with the Zynq UltraScale+ MPSoC FPGA platform [47]. The MPSoC platform includes a hardwired quad-core Arm Cortex-A53 used as the host processor. FPGA fabrics are used to build CNN’s core, buffers, and DMA core. Figure 10 depicts the implementation of our prototype system. Off-chip memory stores input images, weight values trained, and outputs computed. AXI-bus is used as the backbone interconnect of computing cores and buffers.

As mentioned above, Conv2d core and Depthwise Conv2d core share the same architecture. Therefore, we build a single computing core for both as depicted in Fig. 11. To switch the behaviors of the cores, Configuration registers block is used to configure. Outputs of the cores are quantized by the Quantization module. The module scale 32-bit integer numbers to 8-bit integer numbers for optimized processing in mobile platforms.

In MobileNet model, kernels are 3 × 3 × 3 for 3 channel input images. Hence, we build 9 KCPEs for Conv2d core. Due to hardware resources limitation, the number of replications for Conv2d core and Depthwise Conv2d core is k = 8 (n = 3). Therefore, PE matrix in Pointwise core is 8 × 8. We use one 64KB BRAM for Input Global Buffer to store images of 128 × 128 × 3. Weight buffer is built with four standard 64KB blocks RAM (32-bit width, an interleaved organization to allow 128-bit width accesses. The interleave approach, in turn, optimizes kernels fetching phases of computing cores. The interleave method is used for Output Buffer #0 with 2 standard 128KB blocks RAM and Output Buffer #1 with 2 standard 64KB block RAM so that Depthwise Conv2d core and Pointwise Conv2d core can fetch 8 channels in parallel. Along with those global buffers, Conv2d core requires Psum local buffers built with 256-byte standard block RAM per buffer. In total, we need approximately 740 KB on-chip Block RAMs for the buffers. These buffers and off-chip memory are connected with a DMA (direct memory access) core via AXI-bus to transfer data among them.

5 Experiments

This section evaluates our implemented prototype version with the MobileNet model. At first, we present the synthesis results for the system. We then report the processing time of our approach with different inputs and kernels and comparison with software processing only.

5.1 Experimental setup

Our system presented in Section 4 is implemented on the Ultra96v2 board that includes an xczu3eg Xilinx FPGA with 70K+ LUTs, 950 KB Block RAMs, a quad-core ARM Cortex-A53 MPCore up to 1.5GHz. We use Verilog-HDL to develop all computing cores and buffers in the programmable logic area of the FPGA chip. In addition, the ARM Cortex is used as a host processor communicating with computing cores through a Register File.

To analyze and compare the performance of our system, we conduct the following experiments with the MobileNet model.

(i)
Software with ARM Cortex-A53 processor (SW-ARM): we only execute the convolution processing on the ARM processor at the maximum working frequency. We compare our work with AMR processors because we can use both for edge computing with low energy consumption and reasonable cost.
(ii)
Software with Intel processor (SW-INTEL): to analyze the performance of our system, we also compare with the processing of an Intel Core-i7 9750H 2.4GHz CPU with six physical cores (12 threads). The Pytorch library is used with this CPU to utilize threads fully. Although this is not a fair comparison because the Intel CPU targets high-performance computing and consumes a huge amount of energy than ours, the results can help demonstrate how good our system is.
(iii)
Our system with the FPGA platform (HW-FPGA): we use both the ARM Cortex-A53 processor as the host and our computing cores in the PS area as accelerators. The host functions at 1.5 GHz while our accelerator cores work at 150 MHz. The ARM processor executes all parts of the MobileNet model except the standard convolution, depthwise, and pointwise convolutions done by our computing cores.

In term of accuracy, all the three above systems produce the same output values without any difference.

5.2 Experimental results

In this section, we analyze our experimental results. At first, the system is synthesized to extract synthesis results, including working frequency and resource usage. We then simulate the system with exactly cycle simulation tools. Finally, we conduct actual execution in the targeted platform and measure execution to analyze system performance.

5.2.1 Synthesis results

As mentioned above, we synthesize our system with Xilinx Vivado Design Suite for the Ultra92v2 board with the target clock frequency at 150 MHz. Table 1 summarizes our hardware resources usage.

Table 1 Hardware resources usage of the proposed system

Full size table

According to the table, our system mostly fully uses on-chip memory. This limitation prevents the system from having more Conv2d replications. However, a large amount of LUTs, FFs, and DSP are available. Therefore, we have room to optimize communication channels and pipelining model in our next version.

The synthesis report also indicates that our system can function at a maximum working frequency of 159 MHz. The system-on-chip power is around 3.179 W that is much efficient for battery-based edge computing applications.

5.2.2 Simulation results

We conduct a ModelSim simulation process to validate our register transfer level design (RTL design) before programming and executing the FPGA devices. Figure 12 illustrates part of the execution waveform. The full_sum signal indicates the outputs of the computing core. As depicted in the waveform, we can compute one output every two cycles.

5.3 Performance analysis

Finally, to compare the performance of our FPGA-based system (HW-FPGA) with ARM Cortex-A53 and Intel processor, as mentioned above, we use different experiments with various channels, kernels, and inputs/outputs. Table 2 shows the results. To make a fair comparison, we only take execution time for convolutions into account (standard convolution, depthwise convolution, and pointwise convolution). The seventh columns depicts processing time of our core including time for computing and time for data transfer. Time for data transfer (by DMA) is mostly hidden due to the pipeline transferring as proposed in [48].

Table 2 Execution time (second) of convolutions with different platforms

Full size table

This table reports five different experiments, including one standard convolution, two depthwise, and one pointwise. The third column (Types) displays the data type, and its quantity is shown in the next column. The fifth column represents size of data in term of elements (e.g., 128 × 128 × 3 means 3 channels with 128 × 128 values per channel). The next column reports the size of the data. In the case of the 3^rd and 5^th experiments, sizes of input and output exceed buffer sizes as mentioned above. In these cases, we must use DMA core to swap data between on-chip buffers and off-chip memory. The last three columns represent the processing time of the three experimental platforms, HW-FPGA, SW-ARM, and SW-Intel. According to the table, both our system (HW-FPGA) and Intel (SW-Intel) outperform the processing of the ARM Cortex-A53 processor. Figure 13 depicts speed-ups of our system and the Intel CPU when compared to the ARM processor.

Our system outperforms the Intel CPU when processing the standard and depthwise convolutions. However, in the case of pointwise convolution, our system needs more time to process than the Intel CPU. This reason is that many small kernels are required to load from off-chip to our on-chip buffers. In addition, our data communication infrastructure is working at a lower frequency than the Intel CPU.

6 Conclusion and future work

This paper first proposes a framework for accelerating CNN computing cores with the hardware/software co-design approach for edge computing CNN-based applications. The method optimizes not only hardware computing cores according to the targeted technology, FPGA in this case, but also software services for handling the computing cores and data communication. With the handling of software, we can reuse convolution computing cores, more precisely the Conv2d core in this work, for different types of convolutions (standard and depthwise). We then develop an efficient architecture for building a convolution computing core in FPGA. We can reuse the structure of the standard convolution core depthwise. We also design an architecture for pointwise convolution computing. The prototype version taken the MobileNet model into account is implemented with an FPGA MPSoC platform. Experimental results show that we consistently outperform the quad-core ARM Cortex-A53 processor. In two experimental cases, we are slower than the Intel Core i7 CPU, but we beat the CPU in the other three. Future work will be buffers optimization and pipelining because the current version has almost run out of on-chip memory.

Data Availability

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available because the data are parts of the results of the B2021-20-02 project funded by VNUHCM. All results generated by the project are managed by and belong to the funder.

References

Khan A, Sohail A, Zahoora U, Qureshi AS (2020) A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev 53(8):5455–5516
Article Google Scholar
Strigl D, Kofler K, Podlipnig S (2010) Performance and scalability of gpu-based convolutional neural networks. In: 2010 18Th Euromicro conference on parallel, distributed and network-based processing, pp 317–324
Lacey G, Taylor GW, Areibi S (2016) Deep learning on fpgas: Past, present, and future. arxiv:1602.04283
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556
Wu R, Guo X, Du J, Li J (2021) Accelerating neural network inference on fpga-based platforms—a survey. Electronics 10(9). https://doi.org/10.3390/electronics10091025
Williams R (2017) What’s next? [the end of moore’s law]. Comput Sci Eng 19(02):7–13
Article Google Scholar
Guo K, Zeng S, Yu J, Wang Y, Yang H (2019) [dl] a survey of fpga-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst 12(1)
Mittal S (2020) A survey of fpga-based accelerators for convolutional neural networks. Neural Comput Applic 32(4):1109–1139
Article Google Scholar
Wu R, Guo X, Du J, Li J (2021) Accelerating neural network inference on fpga-based platforms—a survey. Electronics 10(9)
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861
Pham-Quoc C, Al-Ars Z, Bertels K (2012) A heuristic-based communication-aware hardware optimization approach in heterogeneous multicore systems. In: 2012 International conference on reconfigurable computing and FPGAs, pp 1–6. https://doi.org/10.1109/ReConFig.2012.6416720
DeMicheli G, Sami M (1996) Hardware/software co-design. Nato Science Series E. Springer Netherlands. https://www.springer.com/gp/book/9780792338833. Accessed 1 Nov 2021
Pham-Quoc C, Nguyen XQ, Thinh TN (2021) Hardware/software co-design for convolutional neural networks acceleration: a survey and open issues. In: 2021 10Th EAI International conference on context-aware systems and applications, pp 1–15
Guo K, Han S, Yao S, Wang Y, Xie Y, Yang H (2017) Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37(2):18–25
Article Google Scholar
Li H, Fan X, Jiao L, Cao W, Zhou X, Wang L (2016) A high performance fpga-based accelerator for large-scale convolutional neural networks. In: 2016 26Th International conference on field programmable logic and applications (FPL), pp 1–9
Lin X, Yin S, Tu F, Liu L, Li X, Wei S (2018) Lcp: a layer clusters paralleling mapping method for accelerating inception and residual networks on fpga. In: 2018 55Th ACM/ESDA/IEEE design automation conference (DAC), pp 1–6
Liu Z, Dou Y, Jiang J, Xu J (2016) Automatic code generation of convolutional neural networks in fpga implementation. In: 2016 International conference on field-programmable technology (FPT), pp 61–68
Ma Y, Cao Y, Vrudhula S, Seo JS (2017) Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’17, pp 45–54. ACM, New York, NY, USA
Yang L, He Z, Fan D (2018) A fully onchip binarized convolutional neural network fpga impelmentation with accurate inference. In: Proceedings of the international symposium on low power electronics and design, ISLPED ’18. ACM, New York, NY, USA
Zhang X, Wang J, Zhu C, Lin Y, Xiong J, Hwu WM, Chen D (2018) Dnnbuilder: An automated tool for building high-performance dnn hardware accelerators for fpgas. In: Proceedings of the international conference on computer-aided design, ICCAD ’18. ACM, New York, NY, USA
Podili A, Zhang C, Prasanna V (2017) Fast and efficient implementation of convolutional neural networks on fpga. In: 2017 IEEE 28Th international conference on application-specific systems, architectures and processors (ASAP), pp 11–18
Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of fpga-based deep convolutional neural networks. In: 2016 21St Asia and south pacific design automation conference (ASP-DAC), pp 575–580
Yang Y, Huang Q, Wu B, Zhang T, Ma L, Gambardella G, Blott M, Lavagno L, Vissers K, Wawrzynek J, Keutzer K (2019) Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. In: Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’19, pp 23–32. ACM, New York, NY, USA
Ghasemzadeh M, Samragh M, Koushanfar F (2018) Rebnet: Residual binarized neural network. In: IEEE 26th Annual international symposium on field-programmable custom computing machines (FCCM), pp 57–64. IEEE Computer Society, Los Alamitos, CA, USA
Jiao L, Luo C, Cao W, Zhou X, Wang L (2017) Accelerating low bit-width convolutional neural networks with embedded fpga. In: 2017 27Th International conference on field programmable logic and applications (FPL), pp 1–4
Moss DJM, Nurvitadhi E, Sim J, Mishra A, Marr D, Subhaschandra S, Leong PHW (2017) High performance binary neural networks on the xeon+fpga™platform. In: 2017 27Th International conference on field programmable logic and applications (FPL), pp 1–4
Nakahara H, Fujii T, Sato S (2017) A fully connected layer elimination for a binarizec convolutional neural network on an fpga. In: 2017 27Th international conference on field programmable logic and applications (FPL), pp 1–4
Nurvitadhi E, Sheffield D, Sim J, Mishra A, Venkatesh G, Marr D (2016) Accelerating binarized neural networks: Comparison of fpga, cpu, gpu, and asic. In: 2016 International conference on field-programmable technology (FPT), pp 77–84
Prost-Boucle A, Bourge A, Pétrot F., Alemdar H, Caldwell N, Leroy V (2017) Scalable high-performance architecture for convolutional ternary neural networks on fpga. In: 2017 27Th International conference on field programmable logic and applications (FPL), pp 1–7
Umuroglu Y, Fraser NJ, Gambardella G, Blott M, Leong P, Jahre M, Vissers K (2017) Finn: a framework for fast, scalable binarized neural network inference. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’17, pp 65–74. ACM, New York, NY, USA
Liang S, Yin S, Liu L, Luk W, Wei S (2018) Fp-bnn: Binarized neural network on fpga. Neurocomputing 275:1072–1086
Article Google Scholar
Cao S, Zhang C, Yao Z, Xiao W, Nie L, Zhan D, Liu Y, Wu M, Zhang L (2019) Efficient and effective sparse lstm on fpga with bank-balanced sparsity. In: Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’19, pp 63–72. Association for Computing Machinery
Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance cnn accelerator on fpga using unified winograd-gemm architecture. IEEE Trans Very Large Scale Integration (VLSI) Systems 27 (12):2816–2828. https://doi.org/10.1109/TVLSI.2019.2941250
Article Google Scholar
Wang J, Lou Q, Zhang X, Zhu C, Lin Y, Chen D (2018) Design flow of accelerating hybrid extremely low bit-width neural network in embedded fpga. In: 2018 28Th International conference on field programmable logic and applications (FPL), pp 163–1636. https://doi.org/10.1109/FPL.2018.00035
Ding C, Wang S, Liu N, Xu K, Wang Y, Liang Y (2019) Req-yolo: a resource-aware, efficient quantization framework for object detection on fpgas. In: Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’19, pp 33–42. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3289602.3293904
Pham-Quoc C, Al-Ars Z, Bertels K (2013) Heterogeneous hardware accelerators interconnect: an overview. In: 2013 NASA/ESA Conference on adaptive hardware and systems (AHS-2013), pp 189–197
Wang J, Lin J, Wang Z (2018) Efficient hardware architectures for deep convolutional neural network. IEEE Trans Circ Syst I: Regular Papers 65(6):1941–1953
Google Scholar
Lu L, Liang Y, Xiao Q, Yan S (2017) Evaluating fast algorithms for convolutional neural networks on fpgas. In: 2017 IEEE 25Th annual international symposium on field-programmable custom computing machines (FCCM), pp 101–108
Guan Y, Liang H, Xu N, Wang W, Shi S, Chen X, Sun G, Zhang W, Cong J (2017) Fp-dnn: an automated framework for mapping deep neural networks onto fpgas with rtl-hls hybrid templates. In: 2017 IEEE 25Th Annual international symposium on field-programmable custom computing machines (FCCM), pp 152–159
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S, Wang Y, Yang H (2016) Going deeper with embedded fpga platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’16, pp 26–35. ACM, New York, NY, USA
Wei X, Yu CH, Zhang P, Chen Y, Wang Y, Hu H, Liang Y, Cong J (2017) Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In: 2017 54Th ACM/EDAC/IEEE design automation conference (DAC), pp 1–6
Shen Y, Ferdman M, Milder P (2017) Escher: a cnn accelerator with flexible buffering to minimize off-chip transfer. In: 2017 IEEE 25Th annual international symposium on field-programmable custom computing machines (FCCM), pp 93–100
Guo K, Sui L, Qiu J, Yu J, Wang J, Yao S, Han S, Wang Y, Yang H (2018) Angel-eye: a complete design flow for mapping cnn onto embedded fpga. IEEE Trans Comput-Aided Des Integr Circuits Syst 37(1):35–47
Article Google Scholar
Nguyen XQ, Pham-Quoc C (2021) An fpga-based convolution ip core for deep neural networks acceleration. REV J Electron Commun 11(July-Dec):1–8
Google Scholar
Sim J, Lee S, Kim LS (2020) An energy-efficient deep convolutional neural network inference processor with enhanced output stationary dataflow in 65-nm cmos. IEEE Trans Very Large Scale Integration (VLSI) Syst 28(1):87–100. https://doi.org/10.1109/TVLSI.2019.2935251
Article Google Scholar
Avnet (2021) Ultra96-v2 board - arm-based, xilinx zynq ultrascale+ mpsoc development board based on the linaro 96boards consumer edition specification. https://www.avnet.com/wps/portal/us/products/new-product-introductions/npi/aes-ultra96-v2/. Accessed 10 Oct 2021
Xilinx (2021) Zynq ultrascale+ mpsoc. https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html. Accessed 10 Oct 2021
Pham-Quoc C, Heisswolf J, Werner S, Al-Ars Z, Becker J, Bertels K (2013) Hybrid interconnect design for heterogeneous hardware accelerators. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13, pp 843–846. EDA Consortium, San Jose, CA, USA

Download references

Funding

This research is funded by Vietnam National University - Ho Chi Minh City (VNU-HCM) under grant number B2021-20-02. We acknowledge the support of time and facilities from Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for this study.

Author information

Authors and Affiliations

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet street, District 10, Ho Chi Minh City, Vietnam
Cuong Pham-Quoc
Vietnam National University - Ho Chi Minh City, Thu Duc City, Ho Chi Minh City, Vietnam
Xuan-Quang Nguyen & Tran Ngoc Thinh

Authors

Cuong Pham-Quoc
View author publications
You can also search for this author in PubMed Google Scholar
Xuan-Quang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Tran Ngoc Thinh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Cuong Pham-Quoc designed system and mainly wrote the paper. Xuan-Quang Nguyen mainly implemented and tested the proposed systems. Tran Ngoc Thinh contributed in system design and paper proofread.

Corresponding author

Correspondence to Cuong Pham-Quoc.

Ethics declarations

Conflict of Interests

The authors declare that there are no conflicts of interest regarding the publication of this paper

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pham-Quoc, C., Nguyen, XQ. & Thinh, T.N. Towards An FPGA-targeted Hardware/Software Co-design Framework for CNN-based Edge Computing. Mobile Netw Appl 27, 2024–2035 (2022). https://doi.org/10.1007/s11036-022-01985-9

Download citation

Accepted: 04 April 2022
Published: 14 May 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11036-022-01985-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Towards An FPGA-targeted Hardware/Software Co-design Framework for CNN-based Edge Computing

Abstract

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

A review of object detection based on deep learning

A survey of the recent architectures of deep convolutional neural networks

1 Introduction

2 Background and related work

2.1 Background

2.1.1 Convolutional neural networks (CNNs)

2.1.2 MobileNet model

2.1.3 Hardware/Software co-design approach

2.2 Related work

3 Proposed framework

3.1 FPGA-targeted hardware/software co-design framework

3.2 CNN accelerating core architecture

3.2.1 Convolution 2D core

3.2.2 Depthwise convolution core

3.2.3 pointwise convolution core

4 System implementation

5 Experiments

5.1 Experimental setup

5.2 Experimental results

5.2.1 Synthesis results

5.2.2 Simulation results

5.3 Performance analysis

6 Conclusion and future work

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation