Abstract

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.

1. Introduction

The developments of computer technology and numerical schemes over the past few decades have made computational fluid dynamics (CFD) become an important tool in optimal design of aircraft and analysis of a complex flow mechanism [1, 2]. A large number of CFD applications can reduce development costs and provide technical support for research on aircraft. The scope and complexity of flow problems in CFD simulation is constantly expanding, and the grid size required for simulation is increasing. The rapid increase in grid size raises the challenge in processing these huge data on processors in engineering activities and scientific research. Traditionally, multi-CPU parallelization has been used to accelerate computation. The low parallelism degree and power inefficiency may limit the parallel performance of the cluster. Furthermore, the computing time is largely dependent on the CPU update. In recent years, the development of the CPU has been a bottleneck due to limitations in power consumption and heat dissipation prevention [3, 4].

In CFD applications, a large amount of computing resources are required for complex flow problems, such as turbulent flow, reactive flow, and multiphase flow. High-performance computing (HPC) platforms, such as graphics processing unit (GPU), many integrated core (MIC), and field programmable gate array (FPGA), exhibit a more efficient performance in parallel data processing than the CPU [58]. Faster and better numerical solutions can be obtained by executing CFD codes on these heterogeneous accelerators. In this paper, we discuss GPU parallelization in CFD applications.

GPU has a strong floating-point capability and a high memory bandwidth in data parallelism. The latest Volta architecture Tesla V100 GPU has single-precision and double-precision floating-point operations up to 14 and 7 TFLOP/s, respectively, which are much higher than the computing performance of the CPU. In 2006, NVIDIA introduced the compute unified device architecture (CUDA), which reduces the complexity of programming [9]. The programmable GPU has evolved into a highly parallel, multithreaded, many-core processor. Therefore, GPU acceleration is becoming popular in general-purpose computing areas such as molecular dynamics (MD), direct simulation Monte Carlo (DSMC), CFD, artificial intelligence (AI), and deep learning (DL) [1013].

In this work, we focus on the design and optimizations of a hybrid parallel algorithm of the message passing interface (MPI) and CUDA for CFD applications on multi-GPU HPC clusters. The compressible Navier–Stokes equations are discretized based on the cell-centered finite volume method. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. Moreover, the turbulent solution is solved by the shear stress transport (SST) two-equation model. In some previous work, the CPU was designed to perform data processing together with the GPU [14, 15]. However, for the latest Pascal or Volta Architecture GPU, the computing ability of the GPU far exceeds that of the CPU. In our design of the hybrid parallel algorithm, the CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Performance optimization involves three basic strategies: maximizing parallel execution to achieve maximum utilization, optimizing memory usage to achieve maximum memory throughput, and optimizing instruction usage to achieve maximum instruction throughput. In this study, parallel execution and memory access optimizations are investigated.

In a multi-GPU HPC cluster, ghost and singularity data are exchanged between GPUs. Some scholars use CUDA-aware MPI technology to accelerate the speed of data exchange [13, 1618]. This technology not only makes it easier to work with a CUDA + MPI application, but also makes acceleration technologies like GPUDirect be utilized by the MPI library. However, the current hardware and software configurations we used in this paper do not support this technology. Moreover, the use of the SST two-equation turbulence model increases the complexity of multi-GPU parallel programming. In our design of the hybrid parallel algorithm, we need to stage GPU buffers through host memory. Two kinds of communication approaches are considered: blocking and nonblocking communication methods. The first approach uses blocking functions MPI_Bsend and MPI_Recv without overlapping communication and computations. The second approach uses nonblocking functions MPI_Isend and MPI_Irecv with fully overlapping GPU computations, CPU_CPU communication, and CPU_GPU data transfer. The nonblocking communication method can improve computational efficiency to some extent.

Multi-GPU parallelization can achieve the maximum performance by balancing the workload among GPUs based on domain decomposition [3, 19]. The one-dimensional (1D), two-dimensional (2D), or three-dimensional (3D) domain decomposition method is commonly used for GPU implementation. In this study, we design a 1D domain decomposition algorithm based on the idea of dichotomy to load each GPU with approximately the same grid scale. Though the 1D method needs to transfer more data, the 2D or 3D method cannot achieve coalesced memory access in the global memory, which results in considerable performance loss when performing CFD applications on multi-GPU HPC clusters.

In this paper, the design and optimizations of the parallel algorithm are closely related to the hardware configurations, numerical schemes, and computational grids to obtain the optimal parallel performance on multi-GPU HPC clusters. The main contributions of this work are summarized as follows:(i)A hybrid parallel algorithm of MPI and CUDA for CFD applications implemented on multi-GPU HPC clusters is proposed, and optimization methods are adopted to improve the computational efficiency(ii)Considering the CFD numerical schemes the nonblocking communication mode is proposed to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams(iii)1D domain decomposition method based on the idea of dichotomy is used to distribute the problem among GPUs to balance workload(iv)The proposed algorithm is evaluated with the flat plate flow application, and the parallel performance has been analyzed in detail

The remainder of this paper is organized as follows. Section 2 discusses the related work on GPU-based parallelization and optimizations. Section 3 introduces the governing equations and numerical schemes. Section 4 describes the hybrid parallel algorithm and the optimizations in detail. Section 5 presents the performance evaluation results with the compressible turbulent flow over a flat plate. Section 6 provides the conclusion of this work and a plan for future work.

In the field of CFD, GPU parallelization for CFD applications has achieved numerous remarkable results. Brandvik and Pullan [20, 21] developed 2D and 3D GPU solvers for the compressible, inviscid Euler equations. This was the first CFD application to use CUDA for the 2D and 3D solutions. Ren et al. [22] and Tran et al. [23] proposed a GPU-accelerated solver for turbulent flow based on the lattice Boltzmann method, and the solver can achieve a good acceleration performance. Khajeh-Saeed and Blair Perot [24] and Salvadore et al. [25] accomplished direct numerical simulation (DNS) of turbulence using GPU-accelerated supercomputers which demonstrated that scientific problems could benefit significantly from advanced hardware. Ma et al. [26] and Zhang et al. [27] performed GPU computing of compressible flow problems by a meshless method. Xu et al. [14] and Cao et al. [28] described hybrid OpenMP + MPI + CUDA in parallel computing of CFD codes. The results showed that the GPU-accelerated algorithm had sustainably improved efficiency and scalability. Liu et al. [29] proposed a hybrid solution method for CFD applications on CPU + GPU platforms, and a domain decomposition method based on the functional performance model was used to guarantee a balanced workload.

The optimization techniques are used to enhance the performance of GPU acceleration. Memory access and communication are the most critical parts of performance optimization. A review of optimization techniques and the specific improvement factors for each technique is shown in [4].

In a multi-GPU HPC cluster, GPUs cannot communicate directly for data exchange. Meanwhile, the blocking communication mode is simple to implement but has low efficiency. Several researchers have designed the nonblocking communication mode for GPU parallelization to overlap computation and communication. Mininni et al. [30] used the nonblocking communication method to realize the overlap between GPU computation and CPU_CPU communication. Thibault and Senocak [31], Jacobsen et al. [32], Castonguay et al. [33], and Ma et al. [34] performed the nonblocking communication method with CUDA streams to fully overlap GPU computation, CPU_CPU communication, and CPU_GPU data transfer. Their results showed that the full coverage between computation and communication is the most efficient.

Domain decomposition is a commonly used method in parallel computing of CFD simulations to balance the workload in each processor. Jacobsen et al. [32] used 1D domain decomposition to decompose 3D structured meshes into a 1D layer. Wang et al. [35] studied the HPC of atmospheric general circulation models (ACGMS) in Earth science research on multi-CPU cores. They indicated an ACGMS model with 1D domain decomposition can only run in dozens of CPU cores. Therefore, they proposed a 2D domain decomposition parallel algorithm for this large-scale problem. Baghapour et al. [36] executed CFD codes on heterogeneous platforms, with 16 Tesla C2075 GPUs, where the solver works up to 190 times faster than a single core of a Xeon E5645 processor. They pointed out that 3D domain decomposition performs best in bandwidth-bound communication and not in latency-bound communication, in which 1D domain decomposition is preferred. Given that the GPU computing can execute many threads simultaneously and the communication between the CPU and the GPU becomes a source of high latency with highly noncontiguous data transfer, the 1D domain decomposition method is the most suitable for balancing the workload on GPUs.

3. Governing Equations and Numerical Schemes

The simulation of a compressible turbulent flow is considered. All volume sources are ignored due to body forces and volumetric heating, and the integral form of the 3D Navier–Stokes equations for a compressible, viscous, heat-conduction gas can be expressed as follows [37]:where is the vector of conservative variables, is the vector of convective fluxes, and is the vector of viscous fluxes:where is the density, are the local Cartesian velocity components, is the static pressure, is the total energy, is the total enthalpy, are the unit normal vectors of the cell surface, is the velocity normal to the surface element , and stands for the work of the viscous stresses and of the heat conduction, respectively.

The shear stress transport (SST) turbulence model [38] merges the Wilcox’s model [39] with a high Reynolds number model [40]. The SST two-equation turbulence model can be written in integral form as follows:where is the vector of conservative variables, and represent the convective fluxes and viscous fluxes, respectively, and is the source term:where is the turbulent kinetic energy and is the specific dissipation rate. The components of the source term and the model constant are introduced in [37].

The spatial discretization of equation (1) on structured meshes is based on the cell-centered finite volume method. The upwind AUSM + UP scheme [41], which has a high resolution and computational efficiency for all speeds, is used to compute the convective fluxes. Second-order accuracy is achieved through the monotone upstream-centered schemes for conservation law (MUSCL) [42] with Van Albada et al. limiter function [43]. The viscous fluxes are solved by the central scheme, and turbulence is modeled with the SST two-equation model.

The solution of equation (1) employs a separate discretization in space and in time, so that space and time integration can be treated separately. For the control volume , equation (1) is written in the following form:

The three-step Runge–Kutta method [44] with third-order accuracy has good data parallelism and lower storage overhead, which is used for temporal discretization of equation (5):where is the time step of control volume and stands for the residual.

4. Parallel Algorithm

4.1. Algorithm Description

The GPU implementation uses CUDA. CUDA is a general-purpose parallel computing platform and programming model for GPUs. CUDA provides a programming environment for high-level languages, such as C/C++, Fortran, and Python. For NVIDIA GPUs, CUDA has wider universality than other general programming models, such as OpenCL and OpenACC [9, 4547]. In this study, we choose CUDA as the heterogeneous model to design GPU-accelerated parallel codes for CFD on GTX 1070 and Tesla V100 GPU. A complete GPU code consists of seven parts, namely, getting the device, allocating memory, data transfer from the host to the device, kernel execution, data transfer from the device to the host, free memory space, and resetting the device. In the CUDA programming framework, the execution of a GPU code can be divided into host codes and device codes, which are executed on the CPU and the GPU, respectively. The code on the device side calls the kernel functions to execute on the GPU. The kernel corresponds to a thread grid, which consists of several thread blocks. One thread block contains multiple threads, and a thread is the smallest execution unit of a kernel. Threads within a block can cooperate by sharing data through shared memory. Although the same instructions are executed on the threads, the processed data are different; this mode of execution is called the single instruction multiple thread (SIMT).

The GPU is rich in computing power but poor in memory capacity, whereas the CPU is the opposite. To run CFD applications on a GPU, we must fully utilize the CPU and the GPU. Thus, the GPU is responsible for executing kernel functions, and the CPU only manages the execution of the GPU and communication. In a GPU parallel computing program, a thread is the smallest unit for kernel execution. Threads are executed concurrently in the streaming multiprocessor (SM) as a warp, and a warp contains 32 threads. Therefore, the optimal number of threads in each thread block is an integer multiple of 32. For current devices, the maximum number of threads on a thread block is 1024. The greater the number of threads on each thread block, the smaller the number of thread blocks an SM can call. This condition will diminish the advantage of the GPU in utilizing multithreading to hide the delays in memory acquisition and instruction execution due to the issue of thread synchronization. Too few threads in the thread block will result in idle threads and thereby insufficient device utilization. In this study, a thread block size of 256 is used. The number of thread blocks is determined by the scale of workload to ensure that each thread is loaded with the computation of a grid cell.

MPI and OpenMP are two application programming interfaces that are widely used for running parallel codes on a multi-GPU platform. OpenMP can be utilized within a node for fine-grained parallelization using shared memory, and MPI works on shared and distributed memories and is widely used for massive parallel computations. In general, MPI exhibits performance loss compared with OpenMP but is relatively easy to perform on different types of hardware and has good scalability [48, 49]. In this work, MPI-based communication for shared or distributed memory is hybridized with CUDA to implement large-scale computation on multi-GPU HPC clusters. The parallel algorithm for CFD on multi-GPU clusters based on MPI and CUDA is shown in Algorithm 1. It is noted that “Block_size” and “Thread_size” stand for the number of thread blocks and the thread block size, respectively. At the beginning of the calculation, the MPI_Init function is called to enter the MPI environment. The Device_Query function is called to run the GPU. Then, data are transferred from the CPU to the GPU using the cudaMemcpyHostToDevice function, and a series of kernel functions, including the processing of boundary, the computing of local time step, the calculation of gradient of primitive variables, the calculation of fluxes, and the update of primitive variables, are executed on the GPU. Among these kernel functions, the calculation of gradients and fluxes consumes the largest amount of time. For multi-GPU parallel computing, data exchange is required between GPUs, and the Primitive_Variables_Exchange and Grad_Primitive_Variables_Exchange functions are used to exchange the primitive variables and their gradients, respectively. After the kernel iteration ends, the cudaMemcpyDeviceToHost function is called to transfer the data from the GPU to the CPU for postprocessing. Finally, the MPI_Finalize function is called to exit the MPI environment.

(1)MPI_Init (&argc, &argv);
(2)Device_Query ( );
(3)cudaMemcpy (d_a, h_a, sizeof(float)n, cudaMemcpyHostToDevice);
(4)//Kernel execution start
(5)for i = 0; i < max_step; i++
(6) Boundary_Processing_GPU<<<Block_size, Thread_size>>> ( );
(7) Time_Step_GPU<<<Block_size, Thread_size>>> ();
(8) Primitive_Variables_Exchange ();
(9) Grad_Primitive_Variables_GPU<<<Block_size, Thread_size>>> ( );
(10) Grad_primitive_Variables_Exchange ( );
(11) Flux_GPU<<<Block_size, Thread_size>>> ( );
(12) Primitive_Variables_Update_GPU<<<Block_size, Thread_size>>> ( );
(13)end for
(14)//kernel execution end
(15)cudaMemcpy (h_a, d_a, sizeof(float)n. cudaMemcpyDeviceToHost);
(16)Flow_post-processing ( );
(17)MPI_finalize ( );

The data transfer is essential in multi-GPU parallel computing. In this paper, the exchange of data is done by the CPUs controlling the GPUs. The data transfer process between GPUs is shown in Figure 1. First, we call the cudaMemcpyDeviceToHost function to transfer data from the device to the relevant host through the PCI-e bus. Then, the data are transferred between CPUs with the MPI. Finally, we call the cudaMemcpyHostToDevice function to transfer the data from the host to the target device through the PCI-e bus.

4.2. Algorithm Optimizations

The performance of the GPU parallel program can be optimized using two main methods: parallel execution optimization for the highest degree of parallelism utilization and memory access optimization for maximum memory throughput.

4.2.1. Parallel Execution Optimization

CUDA programs have two execution modes: synchronous mode and asynchronous mode. Synchronous mode means that control does not return to the host until the current kernel function is executed. Asynchronous mode means that control returns to the host immediately once the kernel function is started. Therefore, the host can start new kernel functions and perform data exchange simultaneously. Streams are sequential sequences of commands that can be managed by the CUDA program to control device-level parallelism. Commands in one stream are executed in order, but streams from different commands can be executed in parallel. Thus, the concurrent execution of multiple kernel functions, which is called asynchronous concurrent execution, can be implemented via streams. For the massive parallel computing of CFD on a GPU, a series of kernel functions needs to be executed. The concurrent execution of these kernel functions can be implemented via streams. CUDA creates and destroys streams through the cudaStreamCreate and cudaStreamDestroy functions, and synchronization between threads can be achieved through the cudaDeviceSynchronize function. The implementation of the asynchronous concurrent execution algorithm of CFD on the GPU is shown in Algorithm 2. The initialization of the gradients and residuals, the boundary condition processing, and the time step calculation are concurrently executed by creating a stream. In addition, the calculation of the inviscid fluxes can also be concurrently executed with the calculation of the gradients of primitive variables.

(1)cudaStreamCreate (&stream[j]);
(2) Boundary_Processing_GPU<<<Block_size, Thread_size, stream[0]>>>( );
(3) Time_Step_GPU<<<Block_size, Thread_size, stream[1]>>>( );
(4) Grad_Initial<<<Block_Size, Thread_Size, stream[2]>>>( );
(5) RHS_Initial<<<Block_Size, Thread_Size, stream[3]>>>( );
(6)cudaDeviceSynchronize ( );
(7) Grad_Primitive_Variables_GPU<<<Block_size, Thread_size, stream[0]>>>( );
(8) Convective_Flux_GPU<<<Block_size, Thread_size, stream[1]>>>( );
(9)cudaDeviceSynchronize ( );
(10) Viscous_Flux_GPU<<<Block_size, Thread_size, stream[0]>>>( );
(11) RHS_GPU<<<Block_size, Thread_size, stream[0]>>>( );
(12) Primitive_Variables_Update_GPU<<<Block_size, Thread_size, stream[0]>>>( );
(13)cudaDeviceSynchronize ();
(14)cudaStreamDestroy (stream[j]);
4.2.2. Memory Access Optimization

The data transfer between the host and the device is realized via the PCI-e bus. The transmission speed is considerably lower than the GPU bandwidth. Therefore, the application should reduce the data exchange between the host and the device as much as possible. In this work, the kernel iteration is completely performed on the GPU, and data transmission only occurs at the beginning and end of the kernel iteration. Intermediate variables can be created in the data memory and released after the calculation is completed. However, for multi-GPU parallel computing, data transfer is inevitable between the host and the device due to the features of communication between GPUs.

The GPU provides different levels of memory structure: global memory, texture memory, shared memory, and the registers. This storage mode ensures that the GPU can reduce the data transfer between the global memory and the device. The global memory locates in the video memory with a large access delay. Therefore, the maximum bandwidth can only be obtained by employing the coalesced memory access. Texture memory is also part of the video memory. Compared with the global memory, the texture memory can use the cache to improve the data access speed and obtain a high bandwidth without strictly observing the conditions of coalesced memory access. The shared memory has a bandwidth much higher than that of the global and texture memories. Data sharing among threads in the SM can be realized by storing the frequently used data in the shared memory. The register is the exclusive storage type of the thread, which stores the variables declared in the kernel to accelerate data processing. The GPU computing efficiency can be improved by properly using the texture memory, the shared memory, and the registers and reducing the number of accesses to the global memory.

4.3. Nonblocking Communication Mode

When performing the parallel computing of CFD on multi-GPU HPC clusters, the kernel iteration process needs to exchange data on the boundary, including primitive variables and their gradients. The primitive variables are chosen to ensure that as small data as possible are exchanged. In this process, CPU_CPU communication and CPU_GPU data transfer exist. Optimizing the communication between GPUs significantly affects the performance of multi-GPU parallel systems.

The traditional method is the blocking communication mode, as shown in Figure 2. Algorithm 3 shows the algorithm of the blocking communication mode. The blocking communication mode calls the MPI_Bsend and MPI_Recv functions for data transmission and reception, respectively. In the blocking communication mode, GPU computing, CPU_CPU communication, and CPU_GPU data transfer are completely separated. The communication time in this communication mode is a pure overhead, which seriously reduces the efficiency of the parallel system.

(1)if device_count>1 then
(2) cudaMemory (h_a, d_a, sizeof(float)n, cudaMemcpyDeviceToHost);
(3)MPI_Bsend (buf, int count, MPI_Datatype, int dest, int tag,MPI_COMM_WORLD);
(4)MPI_Recv (buf, int count, MPI_Datatype, int source, int tag,MPI_COMM_WORLD, MPI_Status n, cudaMemcpyHostToDevice);
(5) cudaMemcpy (d_a, h_a, sizeof(float)n, cudaMemcpyHostToDevice);
(6)end if

The nonblocking communication mode shields the communication time with the computing time by overlapping the computation and the communication, as shown in Figure 3. Algorithm 4 shows the algorithm of the nonblocking communication mode. The overlaps among GPU computing, CPU_CPU communication, and CPU_GPU data transfer are achieved by creating two CUDA streams. When exchanging primitive variables, stream 0 is used for CPU_GPU data transfer and stream 1 is used for boundary condition processing and time step calculation. When the gradients of the primitive variables are exchanged, stream 1 is used to calculate the inviscid fluxes. This can be done because the computations of the inviscid fluxes do not depend on the values being transferred among GPUs. The exchange of the gradients to the host can start as soon as the data are packaged by using stream 0. Then, the data transmission between CPUs is implemented with MPI. At the same time, the Convective_Flux_GPU function is executed by using stream 1. Finally, the target device can receive the data from the host with stream 0. Therefore, the overlaps among GPU computing, CPU_CPU communication, and CPU_GPU data transfer can be realized. In this work, the nonblocking communication mode based on multistream computing is used to optimize the communication of GPU parallel programs. The nonblocking communication mode calls the MPI_Isend and MPI_Irecv functions for data transmission and reception, respectively, and the MPI_Waitall function to await the communication completion and query the completion status. The cudaMemcpyAsync function is used for asynchronous data transmission.

(1)if device_count>1 then
(2) cudaMemcpyAsync (h_a, d_a, sizeof(float)n, cudaMemcpyDeviceToHost, stream[0]);
(3)MPI_Isend (buf, int count, MPI_Datatype, int dest, int tag, MPI_COMM_WORLD, MPI_Request request);
(4)//Primitive_Variables_Exchange;
(5) Boundary_Processing_GPU<<<Block_size, Thread_size, stream[1]>>> ( );
(6) Time_Step_GPU<<<Block_size, Thread_size, stream[1]>>> ( );
(7)//Grad_Primitive_Variables_Exchange;
(8) Convective_Flux_GPU<<<Block_size, Thread_size, stream[1]>>> ( );
(9)MPI_Irecv (buf, int count, MPI_Datatype, int source, int tag, MPI_COMM_WORLD, MPI_Status status, MPI_Request request);
(10)MPI_Waitall ( );
(11) cudaMemcpyAsync (d_a, h_a, sizeof(float)n, cudaMemcpyHostToDevice, stream[0]);
(12)end if
4.4. Domain Decomposition and Load Balancing

In the multi-GPU parallel computing, the computational grid needs to be partitioned. Considering the load balancing problem, this work uses the 1D domain decomposition method to load each GPU with approximately the same number of computational grid. The 1D domain decomposition is shown in Figure 4. This method adopts the concept of the bisection method, which is simple to implement and facilitates load balancing. The dichotomy algorithm is shown in Algorithm 5.

(1)for i = 1; i < partition_number; i++
(2)a = min_XYZ; b = max_XYZ; xyz_current = 0.5(min_XYZ + max_XYZ);
(3)while abs (dis) > 1.0exp-10 do
(4)  dis=(current_count-average_count)/average_count;
(5)  if dis>0.0 then
(6)   b = xyz_current;
(7)  else
(8)   a = xyz_current;
(9)  end if
(10)  xyz_current = 0.5(a + b);
(11)end while
(12)end for

Meanwhile, the coalesced memory access is easy to implement due to its boundary data alignment for effectively improving the efficiency of boundary data communication. When the nonblocking communication mode is used, the data in the GPU are divided into three parts (top, middle, and bottom). The top and bottom parts of the data need to be exchanged with other devices. The data transfer of the top and bottom parts can occur simultaneously with the computation of the middle portion.

5. Results and Discussion

5.1. Test Case: Flat Plate Flow

The supersonic flow over a flat plate is a well-known benchmark problem for compressible turbulent flow of CFD applications [50, 51]. This test case has been studied by many researchers and is widely used to verify and validate CFD codes.

The free stream Ma number is 4.5, the Reynolds number based on the unit length is , the static temperature is 61.1 K, and the angle of attack is 0°. No-slip boundary condition is applied at the stationary flat plate surface, which is also assumed to be adiabatic.

The supersonic flat plate boundary layer problem is solved on various meshes, namely, mesh 1 (0.72 million), mesh 2 (1.44 million), mesh 3 (2.88 million), mesh 4 (5.76 million), mesh 5 (11.52 million), mesh 6 (23.04 million), mesh 7 (46.08 million), and mesh 8 (92.16 million).

5.2. Hardware Environment and Benchmarking

In this work, two types of devices, namely, GTX 1070 GPU and Tesla V100 GPU, are introduced. These two devices were introduced by NVIDIA Corporation. Table 1 shows the main performance parameters of GPUs. For these types of devices, the single-precision floating-point operations far exceed double-precision ones. Therefore, the single-precision data for GPU parallelization are used. The performance of the latest Turing architecture Tesla V100 GPU is greatly improved compared with the previous architecture GPU.

In this study, we use CUDA version 10.1, Visual Studio 2017 for C code and MPICH2 1.4.1 for MPI communication. In addition, a node contains one Intel Xeon E5-2670 CPU at 2.6 GHz with eight cores and four GPUs.

5.3. Performance Analysis

Speedup (SP) and parallel efficiency (PE) are important parameters for measuring the performance of a hybrid parallel algorithm. Speedup and parallel efficiency are defined as follows [9, 52]:where is the runtime of one iteration step for one CPU with eight cores, is the runtime of one iteration step for GPUs, and are the problem sizes, and are the number of GPUs, and and are the computation times. If , then the problem size remains constant when the number of GPUs is increased. The parallel efficiency of strong scalability is . If , then the problem size grows proportionally with the number of GPUs; thus, the problem size remains constant in each GPU. The parallel efficiency of weak scalability is . Here, subscript 1 denotes a single GPU and subscript 2 stands for multi-GPU HPC clusters. The runtime of one iteration step is achieved by averaging the execution time of ten thousand time steps.

5.3.1. Single GPU Implementation

For a single GPU implementation, the time required for the calculation of one iteration step is provided in Table 2 (time is given in milliseconds). Figure 5 shows that the speedup of a single GPU increases with the increase in grid size. For the Tesla V100 GPU, the speedup reaches 135.39 for mesh 1 and 182.11 for mesh 8. Thus, the Tesla V100 GPU has a considerably greater speedup than the GTX 1070 GPU. GPU parallelization can greatly improve the computational efficiency compared with CPU-based parallel computing. For meshes 7 and 8, the GTX 1070 GPU cannot afford such a large amount of calculations because of the limitation of device memory. For our GPU codes, 1 GB of device memory can load approximately 3 million grid cells. As the grid size is increased, the speedup of the GPU code increases gradually. The reason is that, as the grid size increases, the proportion of kernel execution increases with those of data arrangement and communication. Meanwhile, the growth rate of speedup is gradually decreasing because of the limitation of the number of SMs and CUDA cores.

5.3.2. Scalability

In this section, the blocking communication mode is used to study the scalability of GPU codes. Here, the performance of GTX 1070 and Tesla V100 multi-GPU HPC clusters is discussed.

Strong scaling tests are performed for meshes 1 to 8 on multi-GPU HPC clusters. Tables 3 and 4 provide the time required for the calculation of one iteration step for multi-GPU implementation (time is given in milliseconds). Figures 6 and 7 show the strong speedup and parallel efficiency, respectively. In Figure 6, the strong speedup is shown for different grid sizes. Evidently, a large grid size can reach a high speedup. For GTX 1070 and Tesla V100 multi-GPU clusters, the speedups of four GPUs reach 172.59 and 576.49, respectively. These values are far greater than the speedups achieved by a single GPU. Thus, a high degree of strong scaling performance is maintained. Multi-GPU parallelization can considerably improve the computational efficiency with the increase in grid size. However, multi-GPU parallel computing does not show obvious advantages when the grid size is small. This can be explained by the fact that the relative weight of data exchange is inversely proportional to the grid size. GPU is specialized for compute-intensive, highly parallel computation. In Figure 7, the strong parallel efficiency is shown for different grid sizes. This result is consistent with the change law of speedup; that is, the parallel efficiency increases with the increase in grid size, and the strong parallel efficiency performance of GTX 1070 multi-clusters is slightly better than that of Tesla V100 GPUs. For mesh 8, the strong parallel efficiency of four Tesla V100 GPUs is close to 80%. In addition, a larger number of GPUs indicate a lower parallel efficiency because of the increase in the amount of data transfer. Figure 8 shows the amount of memory communications for parallel computing with four GPUs. As the grid size increases, the amount of memory communication increases proportionally.

The weak scaling tests for parallel efficiency are shown in Figure 9, and the grid size loaded on each block remains constant. As expected, the parallel efficiency decreases as the number of GPUs increases because the amount of data exchange increases with the increase in the number of GPUs. For the grid size of mesh 6 on each GTX 1070 GPU and Tesla V100 GPU, the weak parallel efficiency of four GPUs can reach 88.05% and 79.68%, respectively. Thus, a high degree of weak scaling performance is maintained.

5.3.3. Performance of Nonblocking Mode between GPUs

In this section, the nonblocking communication mode is used to study the performance of GTX 1070 GPUs and Tesla V100 multi-GPU HPC clusters. As an example, mesh 6 is investigated on one node with four GPUs to compare the performance of the two communication modes.

Figures 10 and 11 show the strong scalability performance of the nonblocking communication mode compared with the blocking communication mode. Figure 12 shows the weak parallel efficiency of the two methods. The results show that the nonblocking communication mode can shield the communication time with the computing time by overlapping the communication and the computation. For four GPUs, the strong speedups of GTX 1070 and Tesla V100 multi-GPUs increase by 15.15% and 19.63%, respectively. In addition, the weak parallel efficiency remains above 87% with the nonblocking communication mode. Moreover, the GPU-based parallelization with Tesla V100 GPUs can exhibit a better performance improvement than the GTX 1070 multi-GPU HPC clusters. This result is because the relative weight of the communication time of Tesla V100 GPUs is higher.

6. Conclusions

In this work, a hybrid parallel algorithm of MPI and CUDA for CFD applications on multi-GPU HPC clusters has been proposed. Optimization methods have been adopted to achieve the highest degree of parallelism utilization and maximum memory throughput. In this study, a thread block size of 256 is used. The number of thread blocks is determined by the scale of workload to ensure that each thread is loaded with the computation of a grid cell. For a single GPU implementation, two types of devices have been discussed. We obtain an acceleration ratio of more than 36 times, which indicates that GPU parallelization can greatly improve the computational efficiency compared with CPU-based parallel computing. Meanwhile, a large grid size can reach a high speedup due to the increase in the proportion of kernel executions compared with those of data arrangement and communication. The speedups of four GPUs reach 172.59 and 576.49 for GTX 1070 GPUs and Tesla V100 multi-GPU HPC clusters, respectively. The strong and weak parallel efficiency are maintained at a high level when the grid size is at a large value. Thus, the parallel algorithm has good strong and weak scalability. Nonblocking communication mode has been proposed to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer. For four GPUs, the strong speedups of GTX 1070 and Tesla V100 multi-GPUs increase by 15.15% and 19.63%, respectively. In addition, the weak parallel efficiency remains above 87% with the nonblocking communication mode.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation (NNSF of China) project (no. 11472004). Also, the first author would like to thank Dr. Feng Liu.