1 Introduction

In recent years, Artificial Intelligence (AI) has contributed to many research and application fields, such as image processing, voice recognition, or object detection. However, with the rapid increase in the amount of data generated, AI-based applications usually fail to achieve real-time computing performance. Although execution time of the training phases of convolutional neural networks (CNNs), one of the two successful forms of deep neural networks [1], can be improved with graphic processing units (GPU), the inference phases have faced performance issues [2]. GPUs are not suitable for edge computing due to their power-hungry.

CNN models are getting more, and more precisely by using a large amount of data and requiring huge computing power [3], e.g., the VGG19 CNN model requires more than 500MB of parameter memory and executes up to 39B+ floating-point operations (FLOPs) when classifying 224 × 224 images [4]. Meanwhile, edge computing devices only offer good performance, less memory, and a small energy budget [5]. Even traditional CPUs cannot provide enough computing power for CNN-based applications while providing up to \(\sim\)100 GFLOPs with more than 1 joule (J) of energy per 1 GOP used.

At the end of Moore’s Law and Dennard scaling era [6], system architects have proposed a new computing approach that concurrently optimizes both software and hardware levels for specific applications to satisfy real-time processing demand. The hardware/software co-design approach can continue to improve the performance of computing systems, including embedded and edge computing. In this computing paradigm, applications software will be implemented, and hardware systems for processing software are carefully designed and developed. The most promising approach in this paradigm is to perform hefty functions of the applications with specialized hardware-based computing cores. These computing cores are optimized for both the tasks and the targeting technology. Currently, ASICs (Application Specic Integrated Circuits) and FPGAs (Field Programmable Gate Arrays) are the two most well-known technologies that scientists usually target. Compared to ASICs, FPGAs offer more chances to optimize and reuse because of reconfigurability.

Although there are many proposals in the literature Section 2 for accelerating CNNs on FPGAs, they lack software optimization, i.e., the hardware/software co-design approach is not taken into account. These studies only focus on optimizing the hardware computing cores by different techniques, for example, pruning, quantization, or data reuse. Surveys in [7,8,9] have analyzed these techniques in detail. However, one of the fundamental problems these studies have not solved is providing a framework to quickly exploit design space and develop the application without much hardware knowledge. Therefore, this paper aims to propose an FPGA-targeted hardware/software co-design framework to build CNN-based applications quickly. For the preliminary results, we target edge computing platforms and CNN models.

We use the MobileNet model [10] to validate the framework as our case study. CNN acceleration with our FPGA edge computing platform, the Xilinx UltraScale MPSoC, is compared to the execution of ARM-based processing and Intel high-performance CPU. Experimental results show that our FPGA-targeted edge computing platform outperforms the ARM processor while not offering as good performance as the Intel high-performance CPU. However, we managed to save much energy compared to Intel. Reducing energy consumption is one of the primary purposes of an edge computing platform, i.e., offering good performance with reasonable energy consumption.

The main contributions of this paper can be summarized in three folds as follows.

  1. (i)

    We present our proposed hardware/software co-design framework targeting FPGA for CNN-based edge computing system;

  2. (ii)

    We design FPGA-based architectures for various convolutions, including standard, depthwise, and pointwise;

  3. (iii)

    We conduct several experiments and reports comparison results with an embedded processor ARM Cortex and an Intel CPU.

The rest of the paper is organized as follows. Section 2 presents the background of convolutional neural networks, hardware/software co-design approach, and related work in the literature. We propose our FPGA-targeted hardware/software co-design framework for accelerating CNN-based edge computing in Section 3. Section 4 illustrates an edge computing implementation on an FPGA platform. Experimental results are then discussed in Section 5. Finally, we conclude our article in Section 6.

2 Background and related work

This section briefly presents background on convolutional neural networks and hardware/software co-design. We then conduct a literature review on CNN acceleration with hardware techniques.

2.1 Background

In this section, firstly, we present the background of general convolutional neural networks. Secondly, we introduce the MobileNet model used for our first version framework. Finally, the hardware/software co-design approach is discussed.

2.1.1 Convolutional neural networks (CNNs)

A typical CNN model is depicted in Fig. 1 including two main layers, the convolution for computing different deep-learning-related operations and the Fully connected one for classifying inputs with traditional neural networks. The Convolution main layer includes one or many sub-layers for calculating some kinds of 2D convolutions (2D Conv) and pooling. While the 2D convolution operations extract features of the inputs, pooling reduces the size of the outputs generated by 2D convolutions. For example, a 32×32 pixels HCMUT logo with color consisting of channels three red, green, and blue, as illustrated in the figure, is convoluted with 3 5 × 5 × 3 convolutions kernels to create C1 feature maps, e.g., 3 28×28 feature maps. This example uses three kernels because we plan to classify inputs into three categories. These operations can be repeated several times with various parameters to improve the quality.

Fig. 1
figure 1

The convolutional neural network model

Consequently, max-pooling or average-pooling, called x-pooling, reduces the redundancy of feature maps generated by convolution operations. Following the above example, as depicted in Fig. 1 sizes of feature maps are reduced 2× from 28 × 28 to 14 × 14 matrices. Finally, outputs of the first main layer are translated to 1D vectors for further processing with traditional neural networks in the fully connected layer to classify original inputs into different categories.

During the inference phase, the Convolution main layer with convolutional operations and sampling functions are responsible for more than 99% of processing time due to several steps executed [7]. The typical 2D convolutional operation (CONV2D) is illustrated in Eq. 1.

$$F(x,y) = f\left( \sum\limits_{j=0}^{k-1}{\sum\limits_{i=0}^{k-1}{I(x,y,k,i,j)\times W(i,j)+b}}\right)$$
(1)

where F(x,y) is the convolution value of the point (x,y) in the output feature maps, f is the activation function used for CNN (e.g., Tanh, Sigmoid, etc.), k is the size of the kernel W (odd number), I(x,y,k,i,j) is a pixel value in the input feature maps, and b is the bias.

2.1.2 MobileNet model

Along with several well-known CNN models, MobileNet model [10], originated from Google, is an efficient CNN model for mobile visions. Hence, it is pretty suitable for edge computing. Therefore, we choose this model as our case study for validating the proposed framework in this work. The model includes the first complete convolution layer (i.e., operations in Eq. 1) followed by 13 layers of depthwise separable convolution, one average-pooling function a fully connected layer. A depthwise separable convolution layer consists of two operations: depthwise and pointwise convolutions.

Depthwise convolution (DWCONV) computes the same operations with 2D convolution by applying each filter Wi to each channel mi of feature maps F. However, DWCONV only filters channels of a feature map to create various multiple matrices instead of adding them to create a new single feature map like CONV2D. Therefore, a layer called Pointwise convolution with a 1×1 kernel creates a new feature map. Compared to CONV2D a reduction in computation of \(\left (\frac {1}{N} + \frac {1}{{{D}_{k}^{2}}}\right )\) can be obtained [10]; where N is the number for feature maps generated and Dk is the size of the kernel. Taken the above example into account with N = 9 and Dk = 3, we can manage to save 4.5× computing time.

2.1.3 Hardware/Software co-design approach

One of the most prominent approaches to continuously improve the performance of computing systems in the ended of Moore’s law is hardware accelerators where specific hardware computing cores are used for computationally intensive functions [11]. To exploit hardware accelerator computing cores well, hardware/software co-design is a promising approach when hardware and software levels are optimized concurrently for a particular application or domain. The design flow of this approach is illustrated in Fig. 2. Hardware/software co-design can offer many advantages at the system level compared to other techniques such as performance, time-to-market, or faster and better integration [12]. The approaches can target various platforms consisting of embedded systems, co-design of application-specific instruction-set processors (ASIPs), and co-design for reconfigurable computing (FPGAs). FPGA-based computing can provide better flexibility with good performance; thus, it is more suitable for accelerating computationally-intensive applications like CNNs.

Fig. 2
figure 2

The hardware/software co-design model [13]

On the other hand, FPGA accelerators cannot offer such high performance due to lower working frequency than the ASIC approach. However, the FPGA-based approach is the best candidate for frequently updated computing systems because of its flexibility. Hence, in this work, we target the FPGA platforms for accelerating CNN-based applications for edge computing.

2.2 Related work

In this section, we summarize FPGA-based CNN accelerations in the literature. Then, based on the survey, we present design challenges in this topic to take them into account for the ultimate design framework.

One of the most significant advantages of FPGA is the high level of parallelism. Therefore, many proposals in the literature build multiple processing elements (PEs) or unroll time-consuming loops to improve computing throughput. We name this technique as increasing parallelism. FPGA-based Aristotle architecture for accelerating CNNs was introduced by K.Guo et al. [14]. In the Aristotle architecture, several CONV2D operations are conducted by multiple PEs connected in an array form in FPGA fabrics. With the support of the pipeline model and several PEs, the system can achieve a throughput of 1 pixel per cycle. We build the prototype system with an SoC platform, including a host processor, FPGA-based PEs for CONV2D, and external memory. Shared memory can interconnect the host hardwired CPU and PEs in the reconfigurable fabrics to transfer communication data. Similar studies using the same optimization approach in the literature are also reported. H.Li et al. [15] presented their accelerator core for CNNs that allows all layers to be computed concurrently in the pipeline. X.Lin et al. designed an FPGA-based CNN core with a mapping method for multiple layer clusters [16]. CNN’s calculation core on reconfigurable computing with parallel structures was proposed in [17]. To achieve the highest possible performance, Y.Ma et al. used complete hardware resources to synthesize their FPGA-based CNN architecture [18]. Y.Li et al. [19] presented the novel binarized architecture for parallel convolution. X.Zhang et al. [20] proposed an FPGA-based CNN’s acceleration with a fine-grained layer-based pipeline architecture. A.Podili et al. [21] developed a convolution computing engine with the highly parallelized Winograd algorithm. M.Motamedi et al. [22] exploited all hardware sources for building a deep CNN parallel core.

Although offering better performance than traditional processors, FPGAs usually suffer from a low working frequency in terms of MHz. To further improve the throughput of computing systems, reducing the complexity of convolution operations is a good approach, e.g., quantization, pruning, shift-operator, or Binarized Neural Network (BNN). Y.Yang et al. [23] proposed an FPGA-based convolution core and a novel ConvNet model with 1 × 1 kernels called Synetgy. With the slight weight of 1 bit and 4-bit inputs, outputs, and activation functions, Synetgy obtains a thrilling performance increase. M.Ghasemzadeh et al. [24] designed ReBNet using the Xnor operation to compute Binary CNN on FPGA devices. L.Jiao et al. [25] presented a low-bit-width CNN computing core for FPGA platforms. D.Moss et al. [26] proposed FPGA-based CNN cores using binary values for weights and activation functions. N.Hiroki et al. [27] used the XNOR-Multiply-Accumulate circuit streaming operation to design an FPGA-based BNN architecture. E.Nurvitadhi et al. [28] introduced a hardware-based architecture for BNNs. B.Prost et al. [29] proposed an FPGA-based architecture to accelerate convolutional ternary neural networks. Y.Umuroglu et al. [30] presented a novel optimization for building BNNs on FPGAs called FINN. Z.Ritchie designed FPGA-based CNN cores with shallow floating point precision. S.Liang et al. [31] proposed the FP-BNN architecture with XNOR and SHIFT operators used instead of multipliers. S.Cao et al. [32] offered a novel and efficient FPGA accelerator for CNNs called Bank-Balanced Sparsity. S.Kala et al. [33] presented FPGA-based CNN PEs to process Winograd filtering algorithm and general element-wise matrix multiplications. J.Wang et al. [34] design a hybrid extremely low-bit-width CNN’s FPGA architecture called ELB-NN. Caiwen Ding et al. [35] introduced the REQ-YOLO FPGA-based framework to develop efficient object detection applications.

One of the most significant disadvantages of FPGA devices is the limitation of on-chip memory (Block RAM). The data reuse optimization approach can reduce communication overhead when transferring data between the on-chip buffer and external memory. With modern FPGA-based computing systems, data communication overhead is usually responsible for up to 50% of overall processing time [36]. Therefore, system performance can be improved when reducing the communication overhead. J.Wang et al. [37] proposed the ping-pong data reuse scheme to evade data communication between external and internal memory. L.Lu et al. [38] accelerated matrix multiplication in CNNs by using an FPGA-based architecture to employ line-buffer structures. Y.Guan et al. [39] designed the FP-DNN framework to optimize the communication bandwidth of FPGA-based CNN cores. Q.Jiantao et al. [40] reduced the memory footprint of FPGA-based CNN’s cores to improve the performance. X.Wei et al. [41] accelerated CNN’s core by using a systolic array architecture for low global data transfer. Y.Shen et al. [42] introduced an FPGA-based CNN’s acceleration architecture called Escher, focusing on data buffered on-chip. K.Guo et al. [43] designed FPGA-based CNNs architecture called Angle-Eye to utilize the on-chip buffer fully. Besides the above proposals, other studies may also exist in the literature. However, we only focus on recent publications in top conferences and journals.

Although many approaches have been proposed in recent years, there are still five different open issues that the FPGA-targeted framework for CNNs should consider. Figure 3 summarizes the available research topics and their difficulty levels.

Fig. 3
figure 3

Challenges of CNN acceleration with hardware techniques [13]

Among the five open issues, the design space exploration is the most difficult because most current FPGA-based CNN accelerations are designed manually. Therefore, an automated design framework needs to explore design space quickly.

3 Proposed framework

This section introduces our proposed framework for accelerating CNN-based edge computing on FPGA platforms. As mentioned above, we use the MobileNet model as our case study for our preliminary results. Therefore, we present our FPGA-based CNN computing core suitable for the MobileNet model.

3.1 FPGA-targeted hardware/software co-design framework

Figure 4 depicts the abstract logical architecture of our proposed framework that consists of services and tools at the software level and architectures and cores at the hardware level. At the bottom of the framework, FPGA-based CNN cores and other IP cores such as encryption/decryption, security, or cryptography are built on FPGA fabrics. The cores communicate via an Interconnect infrastructure like bus, network-on-chip, or shared buffer. Along with those accelerators, Host processor is responsible for processing other parts of applications and controlling the computing cores. The host processor communicates with the computing core on reconfigurable fabrics through shared memory. The Configuration registers allows user applications to handle computing cores like updating CNN cores’ parameters. Direct memory access transfers data between user space addresses to local buffers of the CNN and IP cores. We will explain the details and purposes of these buffers in the next section.

Fig. 4
figure 4

The hardware/software co-design framework for accelerating CNN-based applications

The framework is equipped with operating services and instruction set extensions for managing computing cores (CNN cores and other IP cores) at the software level. The services help user applications handle computing cores for starting, transferring data, checking completion, etc. Services can be developed as Application Programming Interfaces (APIs) or ISA-extensions of the host processor. Based on the OS layer with APIs and ISA-extensions, several tools are installed, including Software Development Kit (SDK) for developing software, Integrated Development Environment (IDE) for configuring the host processor and computing cores, and High-level Synthesis (HLS) for building IP cores from high-level programming languages. One of the essential tools in this layer is the Modelling for performance evaluation. With the support of Modelling, designers can evaluate performance, accuracy, and many other aspects of the deployed systems to explore design space. Finally, the application layer allows programmers to develop CNN-based applications and related graphic user interfaces quickly.

The software can process all OS API services, ISA-extensions, or SDK/IDE. IP cores such as hashing, security, or cryptography depend on applications and synthesize from high-level programming languages code HLS. However, we must implement CNN cores with hardware description languages to improve processing performance further. Therefore, the next section will introduce our MobileNet CNN cores for our preliminary results.

3.2 CNN accelerating core architecture

In this section, we present our CNN accelerating core based on the MobileNet model for convolution computing mentioned in Section 2.1.2. Figure 5 illustrates the generic architecture of our FPGA-based MobileNet CNN computing core.

Fig. 5
figure 5

The FPGA-based CNN accelerator core

As depicted in the figure, the CNN accelerator core includes three sub-core to compute 2D convolution (Conv2d core), depthwise convolution (Depthwise Conv2d core), and pointwise convolution (Pointwise Conv2d core) for the MobileNet model presented before. Details of these cores are explained in the following sections. Along with computing cores for calculating various convolutions, the CNN accelerator core also hosts several buffers for different purposes.

  • Input global buffer: storing input images for the MobileNet model. The buffer should store an image in the size of 224× 224 ×3 (resolution 224×224 with three color channels). In other words, the buffer size should be at least 147 KB.

  • Weight global buffer: keeping kernels for filtering inputs. According to the MobileNet models, the size of this buffer is various, ranging from hundreds of bytes to some megabytes because the number of kernels can be 1,024.

  • Bias global buffer: storing bias values for each layer. That is the b value in Eq. 1.

  • Output global buffers: there are two output buffers as illustrated in Fig. 5. They are used to switch data between Depthwise and Pointwise cores at different layers. Based on the MobileNet model, the size of these buffers are usually two times larger than input buffers

3.2.1 Convolution 2D core

Figure 6 illustrates the architecture of our FPGA-based Conv2d core for 3 × 3 kernels. In general, the core includes several KCPE (Kernel-Convolution Processing Element) for computing the multiply-accumulate (MAC) operators. The ideal number of KCPEs per core is the number of elements in the kernels. However, it depends on the available hardware resources. If the kernel size is greater than the amount of available KCPEs, the core needs to do multiple iterations. KCPEs are organized in matrix form like kernels. e.g. 3×3 KCPEs for 3 × 3 kernels.

Fig. 6
figure 6

The FPGA-based convolution 2D core

When processing, weight values of kernels are loaded into KCPEs first. The core, then, collects elements of an input image row by row to reuse data temporally and spatially [44]. The number of collected elements equals the kernel matrix dimension (e.g., getting three elements each time for 3 × 3 kernels). These values are delivered to all rows of the KCPEs matrix. For example, KCPE #0 to KCPE #2 in the first row of Fig. 6 receives the same set of three values as KCPE #3 to KCPE #5 and KCPE #6 to KCPE #8. KCPEs multiply the input values and stored weight to create a partial sum of MAC. These results are cached in Psum local buffer. When the next row of input is computed, values in Psum local buffer are added with new KCPEs’ results. We call this mechanism a row-oriented pipeline computing model, illustrated in Fig. 7.

Fig. 7
figure 7

The proposed row-oriented pipeline computing model for CNN core

3.2.2 Depthwise convolution core

As mentioned in Section 2.1.2, the Depthwise convolution is quite similar to the standard convolution except keeping outputs for channels separately. Therefore, we can reuse the architecture of the Conv2d core by disabling adding output behaviors.

According to the MobileNet models, after applying the standard convolution in the first layer, multiple depthwise convolution operations with several kernels, up to 1,024 kernels, are conducted. Thanks to the huge amount of hardware resources in FPGA devices, we can replicate Conv2d core in k × (k = 2n so that up to k kernels can be processed concurrently. Hence, we can improve computing performance dramatically. Figure 8 illustrates the architecture for Depthwise Conv2d core with 8 replications (n = 3). The number of replication depends on hardware resources available.

Fig. 8
figure 8

The FPGA-based depthwise computing core

3.2.3 pointwise convolution core

Different from Depthwise Conv2d core, Conv2D core is not suitable for reusing to build Pointwise Conv2d core due to 1 × 1 kernels applied. Therefore, we use the output stationary dataflow [45] for designing Pointwise Conv2d core. In this approach, weight and input values are updated every computing cycle. Partial sums are accumulated into an array of registers till the final sum is reached. The final results then are quantized to 8-bit integer numbers to store into Output global buffer. Figure 9 depicts a 8 × 8 PE FPGA-based Pointwise computing core. In general, the number of PEs in Pointwise Conv2d core is k × k (k = 2n), where is the number of kernels Depthwise Conv2d core can process at every computing cycle.

Fig. 9
figure 9

The FPGA-based pointwise computing core

When processing, weight values are fetched from Weight Global Buffer into local FIFO buffers Wij in PEs (column ith and row jth). Input values are streamed from Input Global Buffer and multi-casted to all columns of PEs. After every computing round (multiplying weight and input values), partial sums are stored in registers as mentioned above, while weights from the next kernels are fetched for the next round.

4 System implementation

This section presents our system implementation targeted at a particular FPGA platform. For the prototype version with preliminary results, we use the Ultra96v2 board [46] with the Zynq UltraScale+ MPSoC FPGA platform [47]. The MPSoC platform includes a hardwired quad-core Arm Cortex-A53 used as the host processor. FPGA fabrics are used to build CNN’s core, buffers, and DMA core. Figure 10 depicts the implementation of our prototype system. Off-chip memory stores input images, weight values trained, and outputs computed. AXI-bus is used as the backbone interconnect of computing cores and buffers.

Fig. 10
figure 10

A FPGA-based testing implementation for the MobileNet model

As mentioned above, Conv2d core and Depthwise Conv2d core share the same architecture. Therefore, we build a single computing core for both as depicted in Fig. 11. To switch the behaviors of the cores, Configuration registers block is used to configure. Outputs of the cores are quantized by the Quantization module. The module scale 32-bit integer numbers to 8-bit integer numbers for optimized processing in mobile platforms.

Fig. 11
figure 11

The implementation of Conv2d and Depthwise Conv2d cores

In MobileNet model, kernels are 3 × 3 × 3 for 3 channel input images. Hence, we build 9 KCPEs for Conv2d core. Due to hardware resources limitation, the number of replications for Conv2d core and Depthwise Conv2d core is k = 8 (n = 3). Therefore, PE matrix in Pointwise core is 8 × 8. We use one 64KB BRAM for Input Global Buffer to store images of 128 × 128 × 3. Weight buffer is built with four standard 64KB blocks RAM (32-bit width, an interleaved organization to allow 128-bit width accesses. The interleave approach, in turn, optimizes kernels fetching phases of computing cores. The interleave method is used for Output Buffer #0 with 2 standard 128KB blocks RAM and Output Buffer #1 with 2 standard 64KB block RAM so that Depthwise Conv2d core and Pointwise Conv2d core can fetch 8 channels in parallel. Along with those global buffers, Conv2d core requires Psum local buffers built with 256-byte standard block RAM per buffer. In total, we need approximately 740 KB on-chip Block RAMs for the buffers. These buffers and off-chip memory are connected with a DMA (direct memory access) core via AXI-bus to transfer data among them.

5 Experiments

This section evaluates our implemented prototype version with the MobileNet model. At first, we present the synthesis results for the system. We then report the processing time of our approach with different inputs and kernels and comparison with software processing only.

5.1 Experimental setup

Our system presented in Section 4 is implemented on the Ultra96v2 board that includes an xczu3eg Xilinx FPGA with 70K+ LUTs, 950 KB Block RAMs, a quad-core ARM Cortex-A53 MPCore up to 1.5GHz. We use Verilog-HDL to develop all computing cores and buffers in the programmable logic area of the FPGA chip. In addition, the ARM Cortex is used as a host processor communicating with computing cores through a Register File.

To analyze and compare the performance of our system, we conduct the following experiments with the MobileNet model.

  1. (i)

    Software with ARM Cortex-A53 processor (SW-ARM): we only execute the convolution processing on the ARM processor at the maximum working frequency. We compare our work with AMR processors because we can use both for edge computing with low energy consumption and reasonable cost.

  2. (ii)

    Software with Intel processor (SW-INTEL): to analyze the performance of our system, we also compare with the processing of an Intel Core-i7 9750H 2.4GHz CPU with six physical cores (12 threads). The Pytorch library is used with this CPU to utilize threads fully. Although this is not a fair comparison because the Intel CPU targets high-performance computing and consumes a huge amount of energy than ours, the results can help demonstrate how good our system is.

  3. (iii)

    Our system with the FPGA platform (HW-FPGA): we use both the ARM Cortex-A53 processor as the host and our computing cores in the PS area as accelerators. The host functions at 1.5 GHz while our accelerator cores work at 150 MHz. The ARM processor executes all parts of the MobileNet model except the standard convolution, depthwise, and pointwise convolutions done by our computing cores.

In term of accuracy, all the three above systems produce the same output values without any difference.

5.2 Experimental results

In this section, we analyze our experimental results. At first, the system is synthesized to extract synthesis results, including working frequency and resource usage. We then simulate the system with exactly cycle simulation tools. Finally, we conduct actual execution in the targeted platform and measure execution to analyze system performance.

5.2.1 Synthesis results

As mentioned above, we synthesize our system with Xilinx Vivado Design Suite for the Ultra92v2 board with the target clock frequency at 150 MHz. Table 1 summarizes our hardware resources usage.

Table 1 Hardware resources usage of the proposed system

According to the table, our system mostly fully uses on-chip memory. This limitation prevents the system from having more Conv2d replications. However, a large amount of LUTs, FFs, and DSP are available. Therefore, we have room to optimize communication channels and pipelining model in our next version.

The synthesis report also indicates that our system can function at a maximum working frequency of 159 MHz. The system-on-chip power is around 3.179 W that is much efficient for battery-based edge computing applications.

5.2.2 Simulation results

We conduct a ModelSim simulation process to validate our register transfer level design (RTL design) before programming and executing the FPGA devices. Figure 12 illustrates part of the execution waveform. The full_sum signal indicates the outputs of the computing core. As depicted in the waveform, we can compute one output every two cycles.

Fig. 12
figure 12

Simulation waveform of the proposed system

5.3 Performance analysis

Finally, to compare the performance of our FPGA-based system (HW-FPGA) with ARM Cortex-A53 and Intel processor, as mentioned above, we use different experiments with various channels, kernels, and inputs/outputs. Table 2 shows the results. To make a fair comparison, we only take execution time for convolutions into account (standard convolution, depthwise convolution, and pointwise convolution). The seventh columns depicts processing time of our core including time for computing and time for data transfer. Time for data transfer (by DMA) is mostly hidden due to the pipeline transferring as proposed in [48].

Table 2 Execution time (second) of convolutions with different platforms

This table reports five different experiments, including one standard convolution, two depthwise, and one pointwise. The third column (Types) displays the data type, and its quantity is shown in the next column. The fifth column represents size of data in term of elements (e.g., 128 × 128 × 3 means 3 channels with 128 × 128 values per channel). The next column reports the size of the data. In the case of the 3rd and 5th experiments, sizes of input and output exceed buffer sizes as mentioned above. In these cases, we must use DMA core to swap data between on-chip buffers and off-chip memory. The last three columns represent the processing time of the three experimental platforms, HW-FPGA, SW-ARM, and SW-Intel. According to the table, both our system (HW-FPGA) and Intel (SW-Intel) outperform the processing of the ARM Cortex-A53 processor. Figure 13 depicts speed-ups of our system and the Intel CPU when compared to the ARM processor.

Fig. 13
figure 13

Speedup of HW-FPGA and SW-Intel compared to SW-ARM

Our system outperforms the Intel CPU when processing the standard and depthwise convolutions. However, in the case of pointwise convolution, our system needs more time to process than the Intel CPU. This reason is that many small kernels are required to load from off-chip to our on-chip buffers. In addition, our data communication infrastructure is working at a lower frequency than the Intel CPU.

6 Conclusion and future work

This paper first proposes a framework for accelerating CNN computing cores with the hardware/software co-design approach for edge computing CNN-based applications. The method optimizes not only hardware computing cores according to the targeted technology, FPGA in this case, but also software services for handling the computing cores and data communication. With the handling of software, we can reuse convolution computing cores, more precisely the Conv2d core in this work, for different types of convolutions (standard and depthwise). We then develop an efficient architecture for building a convolution computing core in FPGA. We can reuse the structure of the standard convolution core depthwise. We also design an architecture for pointwise convolution computing. The prototype version taken the MobileNet model into account is implemented with an FPGA MPSoC platform. Experimental results show that we consistently outperform the quad-core ARM Cortex-A53 processor. In two experimental cases, we are slower than the Intel Core i7 CPU, but we beat the CPU in the other three. Future work will be buffers optimization and pipelining because the current version has almost run out of on-chip memory.