A multi-GPU implementation of a full-field crystal plasticity solver for efficient modeling of high-resolution microstructures

doi:10.1016/j.cpc.2020.107231

Computer Physics Communications

Volume 254, September 2020, 107231

https://doi.org/10.1016/j.cpc.2020.107231 Get rights and content

Highlights

•
A Multi-GPU implementation of EVPFFT model is developed by coupling MPI and OpenACC.
•
OpenACC-CUDA interoperability is used to facilitate CUDA implementation within OpenACC.
•
Compute unified device architecture FFT library is used to accelerate FFT calculations.
•
P100 GPU outperforms 64 Intel Xeon 2695 v4 processes for an RVE size of 128³ or greater.
•
The GPU implementation facilitates massive crystal plasticity simulations using EVPFFT.

Abstract

In a recent publication (Eghtesad et al., 2018), we have reported a message passing interface (MPI)-based domain decomposition parallel implementation of an elasto-viscoplastic fast Fourier transform-based (EVPFFT) micromechanical solver to facilitate computationally efficient crystal plasticity modeling of polycrystalline materials. In this paper, we present major extensions to the previously reported implementation to take advantage of graphics processing units (GPUs), which can perform floating point arithmetic operations much faster than traditional central processing units (CPUs). In particular, the applications are developed to utilize a single GPU and multiple GPUs from one computer as well as a large number of GPUs across nodes of a supercomputer. To this end, the implementation combines the OpenACC programming model for GPU acceleration with MPI for distributed computing. Moreover, the FFT calculations are performed using the efficient Compute Unified Device Architecture (CUDA) FFT library, called CUFFT. Finally, to maintain performance portability, OpenACC-CUDA interoperability for data transfers between CPU and GPUs is used. The overall implementations are termed ACC-EVPCUFFT for single GPU and MPI-ACC-EVPCUFFT for multiple GPUs. To facilitate performance evaluation studies of the developed computational framework, deformation of a single phase copper is simulated, while to further demonstrate utility of the implementation for resolving fine microstructures, deformation of a dual-phase steel DP590 is simulated. The implementations and results are presented and discussed in this paper.

Introduction

Polycrystal plasticity models can be used for predicting material behavior in simulations of metal forming and evaluation of component performances under service conditions. In metal forming, material typically undergoes large plastic strains while developing spatially heterogeneous stress–strain fields [1], [2], [3], [4], [5], [6].Crystallographic slip while accommodating plastic strains induces anisotropy in the material response by evolution of texture and microstructure, which play important roles in the local and overall deformation processes of a material. Local deformation behavior can be captured using complex full-field crystal plasticity models, where the constituent grains explicitly interact with each other. The ability to perform such complex numerical simulations is recognized as a large computational challenge, because the material models must take into account a large number of physical details at multiple length and temporal scales [7], [8], [9], [10], [11]. Indeed, one of the main deterrent to the use of crystal plasticity theories in place of the continuum plasticity theories presently used in practice, is that the implementation of crystal plasticity theories in a full-field modeling framework requires a prohibitive increase in computational effort. This paper is concerned with the development of an efficient computational framework while emphasizing the cutting-edge, high-performance algorithms for full-field crystal plasticity models. Efficient numerical schemes at the level of the microstructural cell as a representative volume element (RVE) of a polycrystalline aggregate presented here are aimed at rendering possible the future accurate multi-level simulations of deformation in metallic materials by embedding this microstructural cell constitutive models within macro-scale finite element (FE) frameworks at each FE integration point.

Effective properties of a microstructural cell embedding crystal plasticity can be solved using finite elements with sub-grain mesh resolution [12], [13], [14], [15], [16], [17], [18], [19]. Subsequently, these FE calculations of microstructure sensitive material behavior can be embedded within macroscopic FE model [20]. Since both the cell and macro-scale calculations are carried out simultaneously, the strategy is known as the FE² method. The FE² method is not practical because it is extremely computationally intensive. A Green function method has been developed as an alternative to FE for solving the field equations over a spatial microstructural cell domain [21], [22], [23], [24]. It relies on the efficient fast Fourier transform (FFT) algorithm to solve the convolution integral representing stress equilibrium under strain compatibility constraint over a voxel-based microstructural cell, as opposed to finite element mesh. The elasto-visco plastic FFT (EVPFFT) formulation is the most advanced of several known Green’s-based solvers for crystal plasticity simulations [25]. Nevertheless, numerical implementations of EVPFFT within FE would also demand substantial computational resources. Thus, acceleration of the full-field FFT-based computations is an essential task.

Several approaches involving efficient numerical schemes and high performance computational platforms have been explored to accelerate numerical procedures [26], [27], [28], [29], [30], [31], [32], [33], [34], [35]. Some of the most promising approaches rely on building databases ofpre-computed single crystal solutions in the form of a spectral representation [36], [37], [38], [39], [40], [41], [42], [43], [44], or storing the polycrystal responses calculated during the actual simulation and using them in an adaptive manner [45], [46]. The single crystal solutions are used within homogenization models as well as FE full field models to represent the overall behavior of the polycrystal [41], [42], [47], [48], [49], [50], [51], [52], [53], [54], [55]. In a recent work [56], we have reported a message passing interface (MPI)-based domain decomposition parallel implementation of the EVPFFT model. The domain decomposition was performed over voxels of a microstructural cell. Moreover, we have evaluated the efficiency of several FFT libraries like FFTW [57]. Depending on the hardware at hand, significant speedups have been achieved using MPI-EVPFFTW.

In this work, we present major extensions to the previously reported implementation to take full advantage of graphics processing units (GPUs). GPUs can perform floating point arithmetic computations much faster than the traditional central processing units (CPUs). With the advent of GPUs, the era of high performance computing (HPC) has been revolutionized [58], [59], [60]. While there are many large clusters using conventional CPUs to run a job in parallel, the operating cost for running CPU-only clusters is significantly higher comparing to GPUs [61]. As an example, an ExaFLOP supercomputer operating on CPU, was estimated to demands electric power equal to the amount needed to initiate the Bay area power system [62]. GPUs are accelerators originally designed for 3D visualization and optimized for parallel processing of millions of polygons with massive data sets [63]. Hardware-wise, GPUs are much more computationally powerful than CPUs when it comes to massive parallelism. While the memory bandwidth for CPU is not more than 68 GB/s for systems with PC3-17000 DDR3 modules and quad-channel architecture, the NVIDIA Tesla K80 and Tesla P100 own memory bandwidths’ of up to 480 $GB ∕ s$ and 720 $GB ∕ s$ , respectively. While the cutting-edge Intel Xeon phi processor 7250 is composed of up to 68 cores, the Tesla K80 and Tesla P100 are composed of 4992 (i.e. 2 × 2496) and 3584 computing cores, respectively, resulting in computing power of up to 2910 (i.e. 2 × 1455) and 4670 GFlops. It is notable that GPU cores (Compute Unified Device Architecture (CUDA) cores) are weaker comparing to conventional CPU cores, however, thousands of them working together will result in significantly higher computational power.

The implementation developed here combines OpenACC [64] and MPI. First, the single crystal Newton–Raphson (NR) solver is accelerated using OpenACC to run on single and multiple GPUs. The latter is MPI-OpenACC. Next, the FFT calculations are performed using the CUDA FFT library, CUFFT [65]. OpenACC-CUDA interoperability is used to control the data transfer between CPU and GPU when interfacing with native CUDA code. Finally, the remaining subroutines, except read and write, are ported to GPU for ultimate speed up gains. The overall implementation is termed ACC-EVPCUFFT for execution on a single GPU and MPI-ACC-EVPCUFFT for execution on multiple GPUs. We present speedups obtained using NVIDIA Tesla K80 and P100 GPUs relative to the original serial implementation and a recent MPI implementation of the code [56]. The proposed computational algorithms have been successfully applied to crystal plasticity modeling of pure copper and a dual-phase steel.

Section snippets

Summary of the EVPFFT model

In our notation, tensors are denoted by non-italic bold letters while scalars are italic and not bold. In the crystal plasticity framework, the viscoplastic strain rate, ${\dot{ε}}^{p} (x)$ is related to the stress $σ (x)$ at a single-crystal material point $x$ through a sum over the N active slip systems, of the form [66], [67] ${\dot{ε}}^{p} (x) = \sum_{s = 1}^{N} m^{s} (x) {\dot{γ}}^{_{s}} (x) = {\dot{γ}}_{0} \sum_{s = 1}^{N} m^{s} (x) {(\frac{| m^{s} (x) : σ (x) |}{τ_{c}^{s} (x)})}^{n} sgn \times (m^{s} (x) : σ (x)),$ where ${\dot{γ}}^{_{s}} (x), τ_{c}^{s} (x)$ and $m^{s} (x)$ are, respectively, the shear rate, the critical resolved shear stress (CRSS) and the

Simple compression and plain strain compression of oxygen free high conductivity (OFHC) copper

In order to compare the accuracy of the developed parallel implementations, we performed a plain strain compression (PSC) case study, in which the deformation behavior and texture evolution of an oxygen free high conductivity (OFHC) polycrystalline copper are simulated. The sample RVE underwent PSC with applied strain rate of 0.001 (1/s) up to the accumulated equivalent strain level of 0.5. The copper polycrystal is composed of face-centered cubic (FCC) grains with random orientation

EVPFFT intensive computations — the hotspot analysis

The EVPFFT code was profiled using the Portland Group Inc. (PGI) performance profiler 2018 v18.30 [79]. This was done for four different problem sizes of 16³, 32³, 64³, and 128³ representing the total number of FFT points (voxels) in the representative volume element (RVE). Fig. 3 represents the distribution of hotspots (i.e. intensive computations) throughout different parts of the code. Evidently, the NR iterative solver for Eq. (17) including the elasto-plastic decomposition and the Jacobian

Background on GPU, OpenACC, and CUDA

OpenACC, originally developed by major vendors CAPS [80], CRAY [81], and NVIDIA PGI [82], is a high level performance-portable parallel programming model based on directives/pragmas to enable scientists and programmers to accelerate their codes without changing the code structure significantly [64]. The main reason behind using OpenACC is to maintain performance portability of a given code. In some cases, OpenACC can result in a better efficiency compared to its peers CUDA and OPENCL [83], [84]

Porting NR to GPU using OpenACC

The NR solver was ported to the GPU using OpenACC directives. Fig. 4 describes a Pseudo code for NR representing the GPU implementation using OpenACC parallel and data directives. Readers can refer to [84] for detailed information about OpenACC and how to employ it efficiently for porting a code to run on a GPU-based hardware.

Note that the NR solver includes several subroutine calls inside the three nested loops iterating over all the voxels. Calling subroutines from device code, necessitates a

FFT libraries: GPU vs. multicore CPU

One of the common FFT solvers used in a wide variety of scientific codes, including the original EVPFFT solver, is the “FOURN” routine, presented in Numerical Recipes in FORTRAN and C++ [94], [95]. Although this routine is commonly used, more advanced libraries have been recently developed to perform FFTs with a higher level of efficiency. FFTW and its MPI version [57], [96], [97], [98], [99] and CUFFT [65], [100], [101] are currently among the fastest FFT libraries, running on a single CPU

GPU acceleration of remaining routines

Running NR and FFTs on GPU using OpenACC and CUDA, the main portion of the code is GPU accelerated according to the workload distribution provided early in the text in Fig. 3. However, since other routines of the code are not running on GPU, the data transfer between CPU and GPU back and forth is unavoidable. This becomes more important when the invocations occur frequently due to highly nested iterations of NR inside the FFT equilibrium field iterations. Same analogy fits for the data copy

Multi-GPU implementation combining OpenACC with MPI (MPI-ACC-EVPCUFFT)

In order to run ACC-EVPCUFFT on multiple GPUs, we leverage our previous work that used the domain decomposition approach and the message passing interface (MPI) standard [105], [106], [107] to provide capability of utilizing many GPUs in a GPU cluster. To this end, the very outermost loops over the FFT voxels (see Fig. 4) are split into chunks of data (i.e. domain decomposition). Fig. 15 shows a schematic view of this domain decomposition for 4 GPUs in one direction.

In order to enable the code

MPI-ACC-EVPCUFFT benchmark on Cray Titan: Cluster of distributed GPU nodes

In order to benchmark the code on distributed GPU nodes, the Titan super computer [108] located at Oak Ridge National Laboratory (ORNL) is used to facilitate our crystal plasticity simulations. This supercomputer is equipped with NVIDIA Tesla K20X GPUs. Table 2 provides the hardware specs for NVIDIA Tesla K20X.

Note that NVIDIA Tesla K20X is older than its more recent peers K80 and P100 resulting in noticeable lower performance due to the hardware architecture. It is notable that a new

Application of ACC-EVPCUFFT to resolving fields in a dual phase steel DP 590 microstructure

To demonstrate another utility of the implementation we use it for resolving fine microstructural features. A large RVE sizes are desirable for understanding behavior of multi-phase alloys. Understanding the strain and stress gradients varying at interfaces is crucial for performance design of polycrystalline multi-phase alloys. In a ferrite-martensitic dual phase steel, the master phase is ferrite matrix in which martensitic phase in a reinforcement. Phase fractions depend on the alloy type.

A flowchart summarizing the developed parallel implementations of the EVPFFT solver

In order to sum up all of the parallel implementations of the EVPFFT solver developed so far, a comprehensive flowchart is provided in Fig. 20, illustrating the flow of the code for all parallel implementations including the features from previous work [56] as well. This schematic will help the reader significantly to review all the performance improvements presented here in a nut shell.

Conclusions

In this work, we develop a computationally efficient implementation of a full-field crystal plasticity solver for predicting micromechanical behavior of crystalline materials taking advantages of GPUs. While porting the NR subroutine on GPU, it was found that using GJE in NR solver results in an improvement over the LU decomposition because the GJE linear equation solver suppresses sequential runs on GPU threads. Next, the GPU implementation executes the FFT calculations using the CUFFT library

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was sponsored by the U.S. National Science Foundation and was accomplished under the CAREER, USDA Grant No. CMMI-1650641.

References (113)

BarrettT.J. et al.
Int. J. Mech. Sci.
(2020)
PoulinC.M. et al.
Mater. Des.
(2019)
KnezevicM. et al.
Mater. Sci. Eng. A
(2019)
PanchalJ.H. et al.
Comput. Aided Des.
(2013)
BarrettT.J. et al.
Comput. Mater. Sci.
(2018)
JahediM. et al.
Mater. Charact.
(2015)
BarrettT.J. et al.
Materialia
(2019)
DiardO. et al.
Comput. Mater. Sci.
(2002)
ArdeljanM. et al.
Comput. Methods Appl. Mech. Engrg.
(2015)
ZhaoZ. et al.
Int. J. Plast.
(2008)

Cited by (33)

A parallel and performance portable implementation of a full-field crystal plasticity model
2024, Computer Physics Communications
We have developed a parallel implementation of an Elasto-Viscoplastic Fast Fourier Transform-based (EVPFFT) micromechanical solver to enable computationally efficient crystal plasticity modeling for polycrystalline materials. Our primary focus lies in achieving performance portability, allowing a single EVPFFT implementation to run optimally on various homogeneous architectures, including multi-core Central Processing Units (CPUs), as well as on heterogeneous computer architectures comprising multi-core CPUs and Graphics Processing Units (GPUs) from different vendors. To accomplish this goal, we have leveraged MATAR, a C++ software library that simplifies the creation and utilization of multidimensional dense or sparse matrix and array data structures. These data structures are designed to be portable across diverse architectures through the use of Kokkos, a performance-portable library. Additionally, we have employed the Message Passing Interface (MPI) to efficiently distribute the computational workload among processors. The heFFTe (Highly Efficient FFT for Exascale) library is used to facilitate the performance portability of the fast Fourier transforms (FFTs) computation. The computational performance of EVPFFT is evaluated and presented in terms of parallel scalability and simulation runtime on different high-performance computing (HPC) architectures. The utility of the developed framework to efficiently simulate the micro-mechanical fields in polycrystalline microstructures in engineering applications is discussed.
Program Title: EVPFFT
CPC Library link to program files: https://doi.org/10.17632/2k8579fyyv.1
Developer's repository link: https://github.com/lanl/Fierro
Licensing provisions: BSD 3-Clause License
Programming language: C++
External routines/libraries: MPI, Kokkos, MATAR, HeFFTe, HDF5
Nature of problem: EVPFFT is a crystal plasticity code designed to compute micro-mechanical fields within a polycrystalline representative volume element (RVE) and predict the macroscale response of the RVE.
Solution method: EVPFFT uses the periodic Green's function method in Fourier space to solve the field equations of static stress equilibrium in a periodic spatial domain.
Effects of element type on accuracy of microstructural mesh crystal plasticity finite element simulations and comparisons with elasto-viscoplastic fast Fourier transform predictions
2024, Computational Materials Science
In this work, we performed massive crystal plasticity finite element (CPFE) simulations to reveal the effects of element type on the accuracy of predicted mechanical fields and overall response over explicit periodic grain structure meshes of polycrystalline Cu during simple tension (ST), simple shear (SS), and a strain path change from ST to SS. Post-processing of the results provided a list of guidance for effective CPFE modeling of explicit microstructures. First, it was confirmed that quadratic tetrahedral elements (C3D10) are the most suitable for CPFE modeling owing to their accuracy, efficiency, and flexibility to describe complex geometries intrinsic to microstructures. Moreover, these elements predicted the overall response between the stiff linear hexahedral (C3D8) and compliant quadratic hexahedral (C3D20) elements. Next, quadratic hexahedral elements with reduced integration (C3D20R) arose as the second choice for CPFE modeling owing to their accuracy and computational efficiency but these elements generally require more memory than C3D10 elements. These elements were also effective in relaxing the issue of volumetric locking intrinsic to C3D20 elements. The issue could not be eliminated by involving C3D20H or C3D20RH hybrid elements with constant pressure. Finally, corresponding simulations of the same explicit grain structure represented in voxel-based formats using elasto-viscoplastic fast Fourier transform (EVPFFT) full-field verified the overall response but predicted the local fields to deviate with plastic strain for all element types.
A parametric study into the influence of Taylor-type scale-bridging artifacts on accuracy of multi-level crystal plasticity finite element models for Mg alloys
2024, Computational Materials Science
Meshing grain structures explicitly to incorporate microstructure dependent material behavior in modeling tools for metal forming processes and structural components is unfeasible. Grain-to-polycrystal meso-level homogenization theories are employed to embed polycrystal plasticity constitutive laws in finite element (FE) frameworks, which hence relate the meso-scale to the macro-scale response. A Taylor-type polycrystal plasticity (T-CP) theory is often used at the meso-scale within FE frameworks (T-CPFE) owing to its simplicity and computational efficiency. The theory relies on the iso-strain constraint imposed over constituent grains at each integration point. The constraint can be relaxed by spreading orientation distributions over finite elements in a mesh. This work seeks to establish an optimal number of crystal orientations to spread over integration points for accurate modeling of Mg alloys. Quasi-static and high strain-rates simulations of tension, compression, simple-shear, and plane strain deformation conditions are performed and post-processed to reveal the differences in predicted mechanical fields between T-CPFE and an explicit polycrystalline grain microstructure of an AZ31 Mg alloy. Moreover, several flow stress curves of the alloy are simulated and compared with measured data. A range of meshes with variable degree of relaxed constraints at integration points are suitably designed and used in the simulations. The performance of the six crystals per integration point is established to be an optimum in smoothing local deformation while allowing heterogenous deformation over the mesh. As such, the relaxed Taylor homogenization can provide tractable part level simulations of Mg alloys.
Assessing strength of phases in a quadruplex high entropy alloy via high-throughput nanoindentation to clarify origins of strain hardening
2024, Materials Characterization
This paper describes the main findings from an experimental investigation into local and overall strength and fracture behavior of a microstructurally flexible, quadruplex, high entropy alloy (HEA), Fe₄₂Mn₂₈Co₁₀Cr₁₅Si₅ (in at.%). The alloy consists of metastable face-centered cubic austenite (γ), stable hexagonal epsilon martensite (ε), stable body-centered cubic ferrite (α), and stable tetragonal sigma (σ) phases. The overall behavior of the alloy in compression features a great deal of plasticity and strain hardening before fracture. While the contents of diffusion created α and σ phases remain constant during deformation, the fraction of ε increases at the expanse of γ due to the diffusionless strain induced γ → ε phase transformation. High-throughput nanoindentation mapping is used to assess the mechanical hardness of individual phases contributing to the plasticity and hardening of the alloy. Increasing the fraction of the dislocated ε phase during deformation due to the transformation is found to act as a secondary source of hardening because γ and ε exhibit similar hardness at a given strain level. While these two phases exhibit moderate hardening during plasticity, significant softening is observed in σ owing to the phase fragmentation. While the phase transformation mechanism facilitates accommodation of the plasticity, the primary source of strain hardening in the alloy is the refinement of the structure during the transformation inducing a dynamic Hall-Petch-type barrier effect. Results pertaining to the evolution of microstructure and local behavior of the alloy under compression are presented and discussed clarifying the origins of strain hardening. While good under compression, the alloy poorly behaves under tension. Fracture surfaces after tension feature brittle micromechanisms of fracture. Such behavior is attributed to the presence of the brittle σ phase.
Embedding strain-rate sensitivities of multiple deformation mechanisms to predict the behavior of a precipitate-hardened WE43 alloy under a wide range of strain rates
2023, Mechanics of Materials
A rare earth Mg alloy, WE43, exhibits high strength, good ductility, low anisotropy, and moderately high strain rate sensitivity. As such, the alloy is a viable candidate for high strain rate applications. In this work, a comprehensive set of mechanical and microstructure data recorded during quasi-static, high strain rate split Hopkinson bar (SHB), and impact tests on specimens of WE43 Mg alloy reported in (Savage et al., 2020b) is simulated and interpreted using an advanced Taylor-type crystal plasticity finite element (T-CPFE) model. The T-CPFE model is formulated physically to embed two sources of strain-rate sensitivities inherent to each slip and twinning mode in WE43, one that occurs under constant structure and another that affects structure evolution. The model parameters are established for the alloy by achieving agreement in the stress-strain response and microstructure evolution under quasi-static and SHB tests. Density functional theory calculations of anti-phase boundary (APB) energy are carried out to explain origins of the unusually large initial slip resistance for basal dislocations, which shear precipitates in the alloy. The initial slip resistances of the prismatic and pyramidal dislocations are, instead, rationalized by Orowan looping around precipitates. After calibration and validation, the model is shown to successfully predict WE43 response at much larger strain rates than those used for model calibration. Specifically, mechanical response, specimen geometry changes, twin volume fractions, and texture evolution are predicted for different orientations of the Taylor cylinders. Details of the modeling framework, comparison between simulation and experimental results, and insights from the results are presented and discussed.
Field fluctuations viscoplastic self-consistent crystal plasticity: Applications to predicting texture evolution during deformation and recrystallization of cubic polycrystalline metals
2023, Acta Materialia
Recent advances pertaining to modeling of grain fragmentation during deformation and recrystallization of polycrystalline metals using viscoplastic self-consistent (VPSC) polycrystal plasticity are combined into a field fluctuations VPSC (FF-VPSC) model. The FF-VPSC model is a higher-order formulation calculating the second moments of lattice rotation rates based on the second moments of stress fields inside grains and resulting intragranular misorientation distributions. The misorientation distributions are used to define a grain fragmentation sub-model for improving predictions of deformation texture evolution and to formulate kinetics sub-models for nucleation as well as to influence the stored energy governing grain growth for the predictions of recrystallization texture evolution. Formation of a copper-like texture in moderately high stacking fault energy (SFE) Cu and a brass-like texture in low SFE brass during rolling to very large strains are successfully predicted using the model. Remarkably, the model also predicts recrystallization textures from the deformation textures of the two metals after adjusting tradeoffs between transition-bands and grain boundary nucleation mechanisms. Additionally, rolling and recrystallization of an interstitial-free steel, tension and recrystallization of AA5182-O, and recrystallization of an additively manufacturing cobalt-based alloy MarM-509 are simulated to predict texture evolution. Through these case studies involving multiple alloys and thermo-mechanical processes we show that, in addition to being predictive with good accuracy, the key advantage of the model lies in its versatility. The FF-VPSC model, simulation results, and insights from the results are presented and discussed in this paper.

View all citing articles on Scopus

^☆: The review of this paper was arranged by Prof. D.P. Landau.

View full text

A multi-GPU implementation of a full-field crystal plasticity solver for efficient modeling of high-resolution microstructures☆

Highlights

Abstract

Introduction

Section snippets

Summary of the EVPFFT model

Simple compression and plain strain compression of oxygen free high conductivity (OFHC) copper

EVPFFT intensive computations — the hotspot analysis

Background on GPU, OpenACC, and CUDA

Porting NR to GPU using OpenACC

FFT libraries: GPU vs. multicore CPU

GPU acceleration of remaining routines

Multi-GPU implementation combining OpenACC with MPI (MPI-ACC-EVPCUFFT)

MPI-ACC-EVPCUFFT benchmark on Cray Titan: Cluster of distributed GPU nodes

Application of ACC-EVPCUFFT to resolving fields in a dual phase steel DP 590 microstructure

A flowchart summarizing the developed parallel implementations of the EVPFFT solver

Conclusions

Declaration of Competing Interest

Acknowledgments

Int. J. Mech. Sci.

Mater. Des.

Mater. Sci. Eng. A

Comput. Aided Des.

Comput. Mater. Sci.

Mater. Charact.

Materialia

Comput. Mater. Sci.

Comput. Methods Appl. Mech. Engrg.

Int. J. Plast.

Int. J. Plast.

Acta Mater.

Int. J. Plast.

Scr. Mater.

Comput. Methods Appl. Mech. Engrg.

Int. J. Plast.

Comput. Mater. Sci.

Int. J. Plast.

Int. J. Plast.

Acta Mater.

Int. J. Plast.

Acta Mater.

Mech. Mater.

Acta Mater.

Mech. Mater.

Int. J. Plast.

Comput. Mater. Sci.

Powder Technol.

Int. J. Plast.

Comput. Methods Appl. Mech. Engrg.

Int. J. Plast.

Int. J. Plast.

J. Mech. Phys. Solids

Mater. Sci. Eng. A

Acta Mater.

Int. J. Plast.

Mech. Mater.

Comput. Methods Appl. Mech. Engrg.

Adv. Eng. Softw.

Adv. Eng. Softw.

Comput. Phys. Comm.

Int. J. Plast.

Comput. Methods Appl. Mech. Engrg.

Comput. Methods Appl. Mech. Engrg.

Acta Mater.

J. Mech. Phys. Solids

Int. J. Plast.

Int. J. Plast.

Parallel Comput.

Appl. Math. Comput.