A multi-GPU implementation of a full-field crystal plasticity solver for efficient modeling of high-resolution microstructures

https://doi.org/10.1016/j.cpc.2020.107231Get rights and content

Highlights

  • A Multi-GPU implementation of EVPFFT model is developed by coupling MPI and OpenACC.

  • OpenACC-CUDA interoperability is used to facilitate CUDA implementation within OpenACC.

  • Compute unified device architecture FFT library is used to accelerate FFT calculations.

  • P100 GPU outperforms 64 Intel Xeon 2695 v4 processes for an RVE size of 1283 or greater.

  • The GPU implementation facilitates massive crystal plasticity simulations using EVPFFT.

Abstract

In a recent publication (Eghtesad et al., 2018), we have reported a message passing interface (MPI)-based domain decomposition parallel implementation of an elasto-viscoplastic fast Fourier transform-based (EVPFFT) micromechanical solver to facilitate computationally efficient crystal plasticity modeling of polycrystalline materials. In this paper, we present major extensions to the previously reported implementation to take advantage of graphics processing units (GPUs), which can perform floating point arithmetic operations much faster than traditional central processing units (CPUs). In particular, the applications are developed to utilize a single GPU and multiple GPUs from one computer as well as a large number of GPUs across nodes of a supercomputer. To this end, the implementation combines the OpenACC programming model for GPU acceleration with MPI for distributed computing. Moreover, the FFT calculations are performed using the efficient Compute Unified Device Architecture (CUDA) FFT library, called CUFFT. Finally, to maintain performance portability, OpenACC-CUDA interoperability for data transfers between CPU and GPUs is used. The overall implementations are termed ACC-EVPCUFFT for single GPU and MPI-ACC-EVPCUFFT for multiple GPUs. To facilitate performance evaluation studies of the developed computational framework, deformation of a single phase copper is simulated, while to further demonstrate utility of the implementation for resolving fine microstructures, deformation of a dual-phase steel DP590 is simulated. The implementations and results are presented and discussed in this paper.

Introduction

Polycrystal plasticity models can be used for predicting material behavior in simulations of metal forming and evaluation of component performances under service conditions. In metal forming, material typically undergoes large plastic strains while developing spatially heterogeneous stress–strain fields [1], [2], [3], [4], [5], [6].Crystallographic slip while accommodating plastic strains induces anisotropy in the material response by evolution of texture and microstructure, which play important roles in the local and overall deformation processes of a material. Local deformation behavior can be captured using complex full-field crystal plasticity models, where the constituent grains explicitly interact with each other. The ability to perform such complex numerical simulations is recognized as a large computational challenge, because the material models must take into account a large number of physical details at multiple length and temporal scales [7], [8], [9], [10], [11]. Indeed, one of the main deterrent to the use of crystal plasticity theories in place of the continuum plasticity theories presently used in practice, is that the implementation of crystal plasticity theories in a full-field modeling framework requires a prohibitive increase in computational effort. This paper is concerned with the development of an efficient computational framework while emphasizing the cutting-edge, high-performance algorithms for full-field crystal plasticity models. Efficient numerical schemes at the level of the microstructural cell as a representative volume element (RVE) of a polycrystalline aggregate presented here are aimed at rendering possible the future accurate multi-level simulations of deformation in metallic materials by embedding this microstructural cell constitutive models within macro-scale finite element (FE) frameworks at each FE integration point.

Effective properties of a microstructural cell embedding crystal plasticity can be solved using finite elements with sub-grain mesh resolution [12], [13], [14], [15], [16], [17], [18], [19]. Subsequently, these FE calculations of microstructure sensitive material behavior can be embedded within macroscopic FE model [20]. Since both the cell and macro-scale calculations are carried out simultaneously, the strategy is known as the FE2 method. The FE2 method is not practical because it is extremely computationally intensive. A Green function method has been developed as an alternative to FE for solving the field equations over a spatial microstructural cell domain [21], [22], [23], [24]. It relies on the efficient fast Fourier transform (FFT) algorithm to solve the convolution integral representing stress equilibrium under strain compatibility constraint over a voxel-based microstructural cell, as opposed to finite element mesh. The elasto-visco plastic FFT (EVPFFT) formulation is the most advanced of several known Green’s-based solvers for crystal plasticity simulations [25]. Nevertheless, numerical implementations of EVPFFT within FE would also demand substantial computational resources. Thus, acceleration of the full-field FFT-based computations is an essential task.

Several approaches involving efficient numerical schemes and high performance computational platforms have been explored to accelerate numerical procedures [26], [27], [28], [29], [30], [31], [32], [33], [34], [35]. Some of the most promising approaches rely on building databases ofpre-computed single crystal solutions in the form of a spectral representation [36], [37], [38], [39], [40], [41], [42], [43], [44], or storing the polycrystal responses calculated during the actual simulation and using them in an adaptive manner [45], [46]. The single crystal solutions are used within homogenization models as well as FE full field models to represent the overall behavior of the polycrystal [41], [42], [47], [48], [49], [50], [51], [52], [53], [54], [55]. In a recent work [56], we have reported a message passing interface (MPI)-based domain decomposition parallel implementation of the EVPFFT model. The domain decomposition was performed over voxels of a microstructural cell. Moreover, we have evaluated the efficiency of several FFT libraries like FFTW [57]. Depending on the hardware at hand, significant speedups have been achieved using MPI-EVPFFTW.

In this work, we present major extensions to the previously reported implementation to take full advantage of graphics processing units (GPUs). GPUs can perform floating point arithmetic computations much faster than the traditional central processing units (CPUs). With the advent of GPUs, the era of high performance computing (HPC) has been revolutionized [58], [59], [60]. While there are many large clusters using conventional CPUs to run a job in parallel, the operating cost for running CPU-only clusters is significantly higher comparing to GPUs [61]. As an example, an ExaFLOP supercomputer operating on CPU, was estimated to demands electric power equal to the amount needed to initiate the Bay area power system [62]. GPUs are accelerators originally designed for 3D visualization and optimized for parallel processing of millions of polygons with massive data sets [63]. Hardware-wise, GPUs are much more computationally powerful than CPUs when it comes to massive parallelism. While the memory bandwidth for CPU is not more than 68 GB/s for systems with PC3-17000 DDR3 modules and quad-channel architecture, the NVIDIA Tesla K80 and Tesla P100 own memory bandwidths’ of up to 480 GBs and 720 GBs, respectively. While the cutting-edge Intel Xeon phi processor 7250 is composed of up to 68 cores, the Tesla K80 and Tesla P100 are composed of 4992 (i.e. 2 × 2496) and 3584 computing cores, respectively, resulting in computing power of up to 2910 (i.e. 2 × 1455) and 4670 GFlops. It is notable that GPU cores (Compute Unified Device Architecture (CUDA) cores) are weaker comparing to conventional CPU cores, however, thousands of them working together will result in significantly higher computational power.

The implementation developed here combines OpenACC [64] and MPI. First, the single crystal Newton–Raphson (NR) solver is accelerated using OpenACC to run on single and multiple GPUs. The latter is MPI-OpenACC. Next, the FFT calculations are performed using the CUDA FFT library, CUFFT [65]. OpenACC-CUDA interoperability is used to control the data transfer between CPU and GPU when interfacing with native CUDA code. Finally, the remaining subroutines, except read and write, are ported to GPU for ultimate speed up gains. The overall implementation is termed ACC-EVPCUFFT for execution on a single GPU and MPI-ACC-EVPCUFFT for execution on multiple GPUs. We present speedups obtained using NVIDIA Tesla K80 and P100 GPUs relative to the original serial implementation and a recent MPI implementation of the code [56]. The proposed computational algorithms have been successfully applied to crystal plasticity modeling of pure copper and a dual-phase steel.

Section snippets

Summary of the EVPFFT model

In our notation, tensors are denoted by non-italic bold letters while scalars are italic and not bold. In the crystal plasticity framework, the viscoplastic strain rate, ε̇p(x) is related to the stress σ(x) at a single-crystal material point x through a sum over the N active slip systems, of the form [66], [67] ε̇p(x)=s=1Nms(x)γ̇s(x)=γ̇0s=1Nms(x)|ms(x):σ(x)|τcs(x)nsgn×ms(x):σ(x), where γ̇s(x),τcs(x) and ms(x) are, respectively, the shear rate, the critical resolved shear stress (CRSS) and the

Simple compression and plain strain compression of oxygen free high conductivity (OFHC) copper

In order to compare the accuracy of the developed parallel implementations, we performed a plain strain compression (PSC) case study, in which the deformation behavior and texture evolution of an oxygen free high conductivity (OFHC) polycrystalline copper are simulated. The sample RVE underwent PSC with applied strain rate of 0.001 (1/s) up to the accumulated equivalent strain level of 0.5. The copper polycrystal is composed of face-centered cubic (FCC) grains with random orientation

EVPFFT intensive computations — the hotspot analysis

The EVPFFT code was profiled using the Portland Group Inc. (PGI) performance profiler 2018 v18.30 [79]. This was done for four different problem sizes of 163, 323, 643, and 1283 representing the total number of FFT points (voxels) in the representative volume element (RVE). Fig. 3 represents the distribution of hotspots (i.e. intensive computations) throughout different parts of the code. Evidently, the NR iterative solver for Eq. (17) including the elasto-plastic decomposition and the Jacobian

Background on GPU, OpenACC, and CUDA

OpenACC, originally developed by major vendors CAPS [80], CRAY [81], and NVIDIA PGI [82], is a high level performance-portable parallel programming model based on directives/pragmas to enable scientists and programmers to accelerate their codes without changing the code structure significantly [64]. The main reason behind using OpenACC is to maintain performance portability of a given code. In some cases, OpenACC can result in a better efficiency compared to its peers CUDA and OPENCL [83], [84]

Porting NR to GPU using OpenACC

The NR solver was ported to the GPU using OpenACC directives. Fig. 4 describes a Pseudo code for NR representing the GPU implementation using OpenACC parallel and data directives. Readers can refer to [84] for detailed information about OpenACC and how to employ it efficiently for porting a code to run on a GPU-based hardware.

Note that the NR solver includes several subroutine calls inside the three nested loops iterating over all the voxels. Calling subroutines from device code, necessitates a

FFT libraries: GPU vs. multicore CPU

One of the common FFT solvers used in a wide variety of scientific codes, including the original EVPFFT solver, is the “FOURN” routine, presented in Numerical Recipes in FORTRAN and C++ [94], [95]. Although this routine is commonly used, more advanced libraries have been recently developed to perform FFTs with a higher level of efficiency. FFTW and its MPI version [57], [96], [97], [98], [99] and CUFFT [65], [100], [101] are currently among the fastest FFT libraries, running on a single CPU

GPU acceleration of remaining routines

Running NR and FFTs on GPU using OpenACC and CUDA, the main portion of the code is GPU accelerated according to the workload distribution provided early in the text in Fig. 3. However, since other routines of the code are not running on GPU, the data transfer between CPU and GPU back and forth is unavoidable. This becomes more important when the invocations occur frequently due to highly nested iterations of NR inside the FFT equilibrium field iterations. Same analogy fits for the data copy

Multi-GPU implementation combining OpenACC with MPI (MPI-ACC-EVPCUFFT)

In order to run ACC-EVPCUFFT on multiple GPUs, we leverage our previous work that used the domain decomposition approach and the message passing interface (MPI) standard [105], [106], [107] to provide capability of utilizing many GPUs in a GPU cluster. To this end, the very outermost loops over the FFT voxels (see Fig. 4) are split into chunks of data (i.e. domain decomposition). Fig. 15 shows a schematic view of this domain decomposition for 4 GPUs in one direction.

In order to enable the code

MPI-ACC-EVPCUFFT benchmark on Cray Titan: Cluster of distributed GPU nodes

In order to benchmark the code on distributed GPU nodes, the Titan super computer [108] located at Oak Ridge National Laboratory (ORNL) is used to facilitate our crystal plasticity simulations. This supercomputer is equipped with NVIDIA Tesla K20X GPUs. Table 2 provides the hardware specs for NVIDIA Tesla K20X.

Note that NVIDIA Tesla K20X is older than its more recent peers K80 and P100 resulting in noticeable lower performance due to the hardware architecture. It is notable that a new

Application of ACC-EVPCUFFT to resolving fields in a dual phase steel DP 590 microstructure

To demonstrate another utility of the implementation we use it for resolving fine microstructural features. A large RVE sizes are desirable for understanding behavior of multi-phase alloys. Understanding the strain and stress gradients varying at interfaces is crucial for performance design of polycrystalline multi-phase alloys. In a ferrite-martensitic dual phase steel, the master phase is ferrite matrix in which martensitic phase in a reinforcement. Phase fractions depend on the alloy type.

A flowchart summarizing the developed parallel implementations of the EVPFFT solver

In order to sum up all of the parallel implementations of the EVPFFT solver developed so far, a comprehensive flowchart is provided in Fig. 20, illustrating the flow of the code for all parallel implementations including the features from previous work [56] as well. This schematic will help the reader significantly to review all the performance improvements presented here in a nut shell.

Conclusions

In this work, we develop a computationally efficient implementation of a full-field crystal plasticity solver for predicting micromechanical behavior of crystalline materials taking advantages of GPUs. While porting the NR subroutine on GPU, it was found that using GJE in NR solver results in an improvement over the LU decomposition because the GJE linear equation solver suppresses sequential runs on GPU threads. Next, the GPU implementation executes the FFT calculations using the CUFFT library

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was sponsored by the U.S. National Science Foundation and was accomplished under the CAREER, USDA Grant No.​ CMMI-1650641.

References (113)

  • BarrettT.J. et al.

    Int. J. Mech. Sci.

    (2020)
  • PoulinC.M. et al.

    Mater. Des.

    (2019)
  • KnezevicM. et al.

    Mater. Sci. Eng. A

    (2019)
  • PanchalJ.H. et al.

    Comput. Aided Des.

    (2013)
  • BarrettT.J. et al.

    Comput. Mater. Sci.

    (2018)
  • JahediM. et al.

    Mater. Charact.

    (2015)
  • BarrettT.J. et al.

    Materialia

    (2019)
  • DiardO. et al.

    Comput. Mater. Sci.

    (2002)
  • ArdeljanM. et al.

    Comput. Methods Appl. Mech. Engrg.

    (2015)
  • ZhaoZ. et al.

    Int. J. Plast.

    (2008)
  • ArdeljanM. et al.

    Int. J. Plast.

    (2017)
  • ArdeljanM. et al.

    Acta Mater.

    (2018)
  • ArdeljanM. et al.

    Int. J. Plast.

    (2015)
  • KnezevicM. et al.

    Scr. Mater.

    (2016)
  • FeyelF.

    Comput. Methods Appl. Mech. Engrg.

    (2003)
  • LebensohnR.A. et al.

    Int. J. Plast.

    (2012)
  • KnezevicM. et al.

    Comput. Mater. Sci.

    (2007)
  • KnezevicM. et al.

    Int. J. Plast.

    (2008)
  • KnezevicM. et al.

    Int. J. Plast.

    (2008)
  • FrommB.S. et al.

    Acta Mater.

    (2009)
  • ShafferJ.B. et al.

    Int. J. Plast.

    (2010)
  • KalidindiS.R. et al.

    Acta Mater.

    (2006)
  • KnezevicM. et al.

    Mech. Mater.

    (2015)
  • KnezevicM. et al.

    Acta Mater.

    (2009)
  • ZecevicM. et al.

    Mech. Mater.

    (2015)
  • ZecevicM. et al.

    Int. J. Plast.

    (2015)
  • KnezevicM. et al.

    Comput. Mater. Sci.

    (2014)
  • JahediM. et al.

    Powder Technol.

    (2017)
  • BartonN.R. et al.

    Int. J. Plast.

    (2008)
  • BartonN.R. et al.

    Comput. Methods Appl. Mech. Engrg.

    (2015)
  • ArdeljanM. et al.

    Int. J. Plast.

    (2016)
  • FeatherW.G. et al.

    Int. J. Plast.

    (2019)
  • KnezevicM. et al.

    J. Mech. Phys. Solids

    (2013)
  • KnezevicM. et al.

    Mater. Sci. Eng. A

    (2013)
  • KnezevicM. et al.

    Acta Mater.

    (2010)
  • ZecevicM. et al.

    Int. J. Plast.

    (2017)
  • ZecevicM. et al.

    Mech. Mater.

    (2019)
  • BarrettT.J. et al.

    Comput. Methods Appl. Mech. Engrg.

    (2019)
  • EghtesadA. et al.

    Adv. Eng. Softw.

    (2018)
  • EghtesadA. et al.

    Adv. Eng. Softw.

    (2018)
  • BauerP. et al.

    Comput. Phys. Comm.

    (2016)
  • LebensohnR.A. et al.

    Int. J. Plast.

    (2012)
  • KnezevicM. et al.

    Comput. Methods Appl. Mech. Engrg.

    (2016)
  • KnezevicM. et al.

    Comput. Methods Appl. Mech. Engrg.

    (2014)
  • EghtesadA. et al.

    Acta Mater.

    (2018)
  • EghtesadA. et al.

    J. Mech. Phys. Solids

    (2020)
  • GhorbanpourS. et al.

    Int. J. Plast.

    (2017)
  • GhorbanpourS. et al.

    Int. J. Plast.

    (2020)
  • MelhemR.

    Parallel Comput.

    (1987)
  • StanimirovićP.S. et al.

    Appl. Math. Comput.

    (2013)
  • Cited by (33)

    View all citing articles on Scopus

    The review of this paper was arranged by Prof. D.P. Landau.

    View full text