A multi-GPU implementation of a full-field crystal plasticity solver for efficient modeling of high-resolution microstructures☆
Introduction
Polycrystal plasticity models can be used for predicting material behavior in simulations of metal forming and evaluation of component performances under service conditions. In metal forming, material typically undergoes large plastic strains while developing spatially heterogeneous stress–strain fields [1], [2], [3], [4], [5], [6].Crystallographic slip while accommodating plastic strains induces anisotropy in the material response by evolution of texture and microstructure, which play important roles in the local and overall deformation processes of a material. Local deformation behavior can be captured using complex full-field crystal plasticity models, where the constituent grains explicitly interact with each other. The ability to perform such complex numerical simulations is recognized as a large computational challenge, because the material models must take into account a large number of physical details at multiple length and temporal scales [7], [8], [9], [10], [11]. Indeed, one of the main deterrent to the use of crystal plasticity theories in place of the continuum plasticity theories presently used in practice, is that the implementation of crystal plasticity theories in a full-field modeling framework requires a prohibitive increase in computational effort. This paper is concerned with the development of an efficient computational framework while emphasizing the cutting-edge, high-performance algorithms for full-field crystal plasticity models. Efficient numerical schemes at the level of the microstructural cell as a representative volume element (RVE) of a polycrystalline aggregate presented here are aimed at rendering possible the future accurate multi-level simulations of deformation in metallic materials by embedding this microstructural cell constitutive models within macro-scale finite element (FE) frameworks at each FE integration point.
Effective properties of a microstructural cell embedding crystal plasticity can be solved using finite elements with sub-grain mesh resolution [12], [13], [14], [15], [16], [17], [18], [19]. Subsequently, these FE calculations of microstructure sensitive material behavior can be embedded within macroscopic FE model [20]. Since both the cell and macro-scale calculations are carried out simultaneously, the strategy is known as the FE2 method. The FE2 method is not practical because it is extremely computationally intensive. A Green function method has been developed as an alternative to FE for solving the field equations over a spatial microstructural cell domain [21], [22], [23], [24]. It relies on the efficient fast Fourier transform (FFT) algorithm to solve the convolution integral representing stress equilibrium under strain compatibility constraint over a voxel-based microstructural cell, as opposed to finite element mesh. The elasto-visco plastic FFT (EVPFFT) formulation is the most advanced of several known Green’s-based solvers for crystal plasticity simulations [25]. Nevertheless, numerical implementations of EVPFFT within FE would also demand substantial computational resources. Thus, acceleration of the full-field FFT-based computations is an essential task.
Several approaches involving efficient numerical schemes and high performance computational platforms have been explored to accelerate numerical procedures [26], [27], [28], [29], [30], [31], [32], [33], [34], [35]. Some of the most promising approaches rely on building databases ofpre-computed single crystal solutions in the form of a spectral representation [36], [37], [38], [39], [40], [41], [42], [43], [44], or storing the polycrystal responses calculated during the actual simulation and using them in an adaptive manner [45], [46]. The single crystal solutions are used within homogenization models as well as FE full field models to represent the overall behavior of the polycrystal [41], [42], [47], [48], [49], [50], [51], [52], [53], [54], [55]. In a recent work [56], we have reported a message passing interface (MPI)-based domain decomposition parallel implementation of the EVPFFT model. The domain decomposition was performed over voxels of a microstructural cell. Moreover, we have evaluated the efficiency of several FFT libraries like FFTW [57]. Depending on the hardware at hand, significant speedups have been achieved using MPI-EVPFFTW.
In this work, we present major extensions to the previously reported implementation to take full advantage of graphics processing units (GPUs). GPUs can perform floating point arithmetic computations much faster than the traditional central processing units (CPUs). With the advent of GPUs, the era of high performance computing (HPC) has been revolutionized [58], [59], [60]. While there are many large clusters using conventional CPUs to run a job in parallel, the operating cost for running CPU-only clusters is significantly higher comparing to GPUs [61]. As an example, an ExaFLOP supercomputer operating on CPU, was estimated to demands electric power equal to the amount needed to initiate the Bay area power system [62]. GPUs are accelerators originally designed for 3D visualization and optimized for parallel processing of millions of polygons with massive data sets [63]. Hardware-wise, GPUs are much more computationally powerful than CPUs when it comes to massive parallelism. While the memory bandwidth for CPU is not more than 68 GB/s for systems with PC3-17000 DDR3 modules and quad-channel architecture, the NVIDIA Tesla K80 and Tesla P100 own memory bandwidths’ of up to 480 and 720 , respectively. While the cutting-edge Intel Xeon phi processor 7250 is composed of up to 68 cores, the Tesla K80 and Tesla P100 are composed of 4992 (i.e. 2 × 2496) and 3584 computing cores, respectively, resulting in computing power of up to 2910 (i.e. 2 × 1455) and 4670 GFlops. It is notable that GPU cores (Compute Unified Device Architecture (CUDA) cores) are weaker comparing to conventional CPU cores, however, thousands of them working together will result in significantly higher computational power.
The implementation developed here combines OpenACC [64] and MPI. First, the single crystal Newton–Raphson (NR) solver is accelerated using OpenACC to run on single and multiple GPUs. The latter is MPI-OpenACC. Next, the FFT calculations are performed using the CUDA FFT library, CUFFT [65]. OpenACC-CUDA interoperability is used to control the data transfer between CPU and GPU when interfacing with native CUDA code. Finally, the remaining subroutines, except read and write, are ported to GPU for ultimate speed up gains. The overall implementation is termed ACC-EVPCUFFT for execution on a single GPU and MPI-ACC-EVPCUFFT for execution on multiple GPUs. We present speedups obtained using NVIDIA Tesla K80 and P100 GPUs relative to the original serial implementation and a recent MPI implementation of the code [56]. The proposed computational algorithms have been successfully applied to crystal plasticity modeling of pure copper and a dual-phase steel.
Section snippets
Summary of the EVPFFT model
In our notation, tensors are denoted by non-italic bold letters while scalars are italic and not bold. In the crystal plasticity framework, the viscoplastic strain rate, is related to the stress at a single-crystal material point through a sum over the N active slip systems, of the form [66], [67] where and are, respectively, the shear rate, the critical resolved shear stress (CRSS) and the
Simple compression and plain strain compression of oxygen free high conductivity (OFHC) copper
In order to compare the accuracy of the developed parallel implementations, we performed a plain strain compression (PSC) case study, in which the deformation behavior and texture evolution of an oxygen free high conductivity (OFHC) polycrystalline copper are simulated. The sample RVE underwent PSC with applied strain rate of 0.001 (1/s) up to the accumulated equivalent strain level of 0.5. The copper polycrystal is composed of face-centered cubic (FCC) grains with random orientation
EVPFFT intensive computations — the hotspot analysis
The EVPFFT code was profiled using the Portland Group Inc. (PGI) performance profiler 2018 v18.30 [79]. This was done for four different problem sizes of 163, 323, 643, and 1283 representing the total number of FFT points (voxels) in the representative volume element (RVE). Fig. 3 represents the distribution of hotspots (i.e. intensive computations) throughout different parts of the code. Evidently, the NR iterative solver for Eq. (17) including the elasto-plastic decomposition and the Jacobian
Background on GPU, OpenACC, and CUDA
OpenACC, originally developed by major vendors CAPS [80], CRAY [81], and NVIDIA PGI [82], is a high level performance-portable parallel programming model based on directives/pragmas to enable scientists and programmers to accelerate their codes without changing the code structure significantly [64]. The main reason behind using OpenACC is to maintain performance portability of a given code. In some cases, OpenACC can result in a better efficiency compared to its peers CUDA and OPENCL [83], [84]
Porting NR to GPU using OpenACC
The NR solver was ported to the GPU using OpenACC directives. Fig. 4 describes a Pseudo code for NR representing the GPU implementation using OpenACC parallel and data directives. Readers can refer to [84] for detailed information about OpenACC and how to employ it efficiently for porting a code to run on a GPU-based hardware.
Note that the NR solver includes several subroutine calls inside the three nested loops iterating over all the voxels. Calling subroutines from device code, necessitates a
FFT libraries: GPU vs. multicore CPU
One of the common FFT solvers used in a wide variety of scientific codes, including the original EVPFFT solver, is the “FOURN” routine, presented in Numerical Recipes in FORTRAN and C++ [94], [95]. Although this routine is commonly used, more advanced libraries have been recently developed to perform FFTs with a higher level of efficiency. FFTW and its MPI version [57], [96], [97], [98], [99] and CUFFT [65], [100], [101] are currently among the fastest FFT libraries, running on a single CPU
GPU acceleration of remaining routines
Running NR and FFTs on GPU using OpenACC and CUDA, the main portion of the code is GPU accelerated according to the workload distribution provided early in the text in Fig. 3. However, since other routines of the code are not running on GPU, the data transfer between CPU and GPU back and forth is unavoidable. This becomes more important when the invocations occur frequently due to highly nested iterations of NR inside the FFT equilibrium field iterations. Same analogy fits for the data copy
Multi-GPU implementation combining OpenACC with MPI (MPI-ACC-EVPCUFFT)
In order to run ACC-EVPCUFFT on multiple GPUs, we leverage our previous work that used the domain decomposition approach and the message passing interface (MPI) standard [105], [106], [107] to provide capability of utilizing many GPUs in a GPU cluster. To this end, the very outermost loops over the FFT voxels (see Fig. 4) are split into chunks of data (i.e. domain decomposition). Fig. 15 shows a schematic view of this domain decomposition for 4 GPUs in one direction.
In order to enable the code
MPI-ACC-EVPCUFFT benchmark on Cray Titan: Cluster of distributed GPU nodes
In order to benchmark the code on distributed GPU nodes, the Titan super computer [108] located at Oak Ridge National Laboratory (ORNL) is used to facilitate our crystal plasticity simulations. This supercomputer is equipped with NVIDIA Tesla K20X GPUs. Table 2 provides the hardware specs for NVIDIA Tesla K20X.
Note that NVIDIA Tesla K20X is older than its more recent peers K80 and P100 resulting in noticeable lower performance due to the hardware architecture. It is notable that a new
Application of ACC-EVPCUFFT to resolving fields in a dual phase steel DP 590 microstructure
To demonstrate another utility of the implementation we use it for resolving fine microstructural features. A large RVE sizes are desirable for understanding behavior of multi-phase alloys. Understanding the strain and stress gradients varying at interfaces is crucial for performance design of polycrystalline multi-phase alloys. In a ferrite-martensitic dual phase steel, the master phase is ferrite matrix in which martensitic phase in a reinforcement. Phase fractions depend on the alloy type.
A flowchart summarizing the developed parallel implementations of the EVPFFT solver
In order to sum up all of the parallel implementations of the EVPFFT solver developed so far, a comprehensive flowchart is provided in Fig. 20, illustrating the flow of the code for all parallel implementations including the features from previous work [56] as well. This schematic will help the reader significantly to review all the performance improvements presented here in a nut shell.
Conclusions
In this work, we develop a computationally efficient implementation of a full-field crystal plasticity solver for predicting micromechanical behavior of crystalline materials taking advantages of GPUs. While porting the NR subroutine on GPU, it was found that using GJE in NR solver results in an improvement over the LU decomposition because the GJE linear equation solver suppresses sequential runs on GPU threads. Next, the GPU implementation executes the FFT calculations using the CUFFT library
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was sponsored by the U.S. National Science Foundation and was accomplished under the CAREER, USDA Grant No. CMMI-1650641.
References (113)
- et al.
Int. J. Mech. Sci.
(2020) - et al.
Mater. Des.
(2019) - et al.
Mater. Sci. Eng. A
(2019) - et al.
Comput. Aided Des.
(2013) - et al.
Comput. Mater. Sci.
(2018) - et al.
Mater. Charact.
(2015) - et al.
Materialia
(2019) - et al.
Comput. Mater. Sci.
(2002) - et al.
Comput. Methods Appl. Mech. Engrg.
(2015) - et al.
Int. J. Plast.
(2008)
Int. J. Plast.
Acta Mater.
Int. J. Plast.
Scr. Mater.
Comput. Methods Appl. Mech. Engrg.
Int. J. Plast.
Comput. Mater. Sci.
Int. J. Plast.
Int. J. Plast.
Acta Mater.
Int. J. Plast.
Acta Mater.
Mech. Mater.
Acta Mater.
Mech. Mater.
Int. J. Plast.
Comput. Mater. Sci.
Powder Technol.
Int. J. Plast.
Comput. Methods Appl. Mech. Engrg.
Int. J. Plast.
Int. J. Plast.
J. Mech. Phys. Solids
Mater. Sci. Eng. A
Acta Mater.
Int. J. Plast.
Mech. Mater.
Comput. Methods Appl. Mech. Engrg.
Adv. Eng. Softw.
Adv. Eng. Softw.
Comput. Phys. Comm.
Int. J. Plast.
Comput. Methods Appl. Mech. Engrg.
Comput. Methods Appl. Mech. Engrg.
Acta Mater.
J. Mech. Phys. Solids
Int. J. Plast.
Int. J. Plast.
Parallel Comput.
Appl. Math. Comput.
Cited by (33)
A parallel and performance portable implementation of a full-field crystal plasticity model
2024, Computer Physics Communications
- ☆
The review of this paper was arranged by Prof. D.P. Landau.