Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Peterson, Brad; Humphrey, Alan; Sunderland, Dan; Sutherland, James; Saad, Tony; Dasari, Harish; Berzins, Martin

doi:10.1007/s10766-018-0619-1

Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Published: 07 December 2018

Volume 47, pages 1086–1116, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Brad Peterson¹,
Alan Humphrey¹,
Dan Sunderland³,
James Sutherland²,
Tony Saad²,
Harish Dasari¹ &
…
Martin Berzins¹

204 Accesses
4 Citations
Explore all metrics

Abstract

The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. Uintah is based on a distributed directed acyclic graph of computational tasks, with a task scheduler that efficiently schedules and executes these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a task graph prior to the execution of these tasks, automatically generates MPI message tags, and automatically performs halo transfers for simulation variables. Automating halo transfers in a heterogeneous environment poses significant challenges when tasks compute within a few milliseconds, as runtime overhead affects wall time execution, or when simulation variables require large halos spanning most or all of the computational domain, as task dependencies become expensive to process. These challenges are magnified at production scale when application developers require each compute node perform thousands of different halo transfers among thousands simulation variables. The principal contribution of this work is to (1) identify and address inefficiencies that arise when mapping tasks onto the GPU in the presence of automated halo transfers, (2) implement new schemes to reduce runtime system overhead, (3) minimize application developer involvement with the runtime, and (4) show overhead reduction results from these improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 14

Cluster Optimization and Parallelization of Simulations with Dynamically Adaptive Grids

Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures

A High-Productivity Framework for Adaptive Mesh Refinement on Multiple GPUs

References

Scientific Computing and Imaging Institute. Uintah Web Page (2015). http://www.uintah.utah.edu/
Humphrey, A., Meng, Q., Berzins, M., Harman, T.: Radiation modeling using the uintah heterogeneous CPU/GPU runtime system. In: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment (XSEDE 2012). ACM (2012)
Peterson, B., Dasari, H., Humphrey, A., Sutherland, J., Saad, T., Berzins, M.: Reducing overhead in the uintah framework to support short-lived tasks on GPU-heterogeneous architectures. In: Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC ’15, pp. 4:1–4:8. ACM, New York (2015)
Meng, Q., Humphrey, A., Berzins, M.: The Uintah framework: a unified heterogeneous task scheduling and runtime system. In: Digital Proceedings of Supercomputing 12—WOLFHPC Workshop. IEEE (2012)
Berzins, M.: Status of Release of the Uintah Computational Framework. Technical report UUSCI-2012-001. Scientific Computing and Imaging Institute (2012)
Kashiwa, B.A., Gaffney, E.S.: Design Basis for CFDLIB. Technical report LA-UR-03-1295. Los Alamos National Laboratory (2003)
Bardenhagen, S.G., Guilkey, J.E., Roessig, K.M., Brackbill, J.U., Witzel, W.M., Foster, J.C.: An improved contact algorithm for the material point method and application to stress propagation in granular material. Comput. Model. Eng. Sci. 2, 509–522 (2001)
MATH Google Scholar
Guilkey, J.E., Harman, T.B., Xia, A., Kashiwa, B.A., McMurtry, P.A.: An Eulerian-Lagrangian approach for large deformation fluid-structure interaction problems, part 1: algorithm development. In: Chakrabarti, S.K., Brebbia, C.A., Almorza, D., Gonzalez-Palma, R. (eds.) Fluid Structure Interaction II. WIT Press, Cadiz (2003)
Google Scholar
Spinti, J., Thornock, J., Eddings, E., Smith, P.J., Sarofim, A.: Heat transfer to objects in pool fires. In: Faghri, M., Sundén, B. (eds.) Transport Phenomena in Fires. WIT Press, Southampton (2008)
Google Scholar
Saad, T., Sutherland, J.C.: Wasatch: an architecture-proof multiphysics development environment using a domain specific language and graph theory. J. Comput. Sci. 17, 639–646 (2016)
Article Google Scholar
Meng, Q., Berzins, M., Schmidt, J.: Using hybrid parallelism to improve memory use in the uintah framework. In: Proceedings of the 2011 TeraGrid Conference (TG11), Salt Lake City, Utah (2011)
Meng, Q., Humphrey, A., Schmidt, J., Berzins, M.: Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 96:1–96:12. ACM, New York (2013)
Peterson, B., Humphrey, A., Schmidt, J., Berzins, M.: Addressing global data dependencies in heterogeneous asynchronous runtime systems on GPUs. In: Submitted—Third International Workshop on Extreme Scale Programming Models and Middleware, ESPM2. IEEE Press (2017)
Bauer, M., Treichler, Sean, S., Elliott, A., Alex: Legion: expressing locality and independence with logical regions. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 66:1–66:11. IEEE Computer Society Press, Los Alamitos (2012)
Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. SIGPLAN Not. 28(10), 91–108 (1993)
Article Google Scholar
Augonnet, Cédric, Thibault, Samuel, Namyst, Raymond, Wacrenier, Pierre-André: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)
Article Google Scholar
Bosilca, George, Bouteiller, Aurelien, Danalis, Anthony, Herault, Thomas, Lemarinier, Pierre, Dongarra, Jack: DAGuE: A Generic Distributed DAG Engine for High Performance Computing. Parallel Comput. 38(1–2), 37–51 (2012)
Article Google Scholar
Humphrey, A., Sunderland, D., Harman, T., Berzins, M.: Radiative heat transfer calculation on 16384 GPUs using a reverse Monte Carlo ray tracing approach with adaptive mesh refinement. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1222–1231 (2016)
Berzins, M., Beckvermit, J., Harman, T., Bezdjian, A., Humphrey, A., Meng, Q., Schmidt, J., Wight, C.: Extending the Uintah framework through the petascale modeling of detonation in arrays of high explosive devices. SIAM J. Sci. Comput. 38(5), S101–S122 (2016)
Article MathSciNet Google Scholar
Bourd, A.: The OpenCL Specification (2017). https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf
OpenACC member companies and CAPS Enterprise and CRAY Inc and The Portland Group Inc (PGI) and NVIDIA. OpenACC 2.5 Specification (2015). https://www.openacc.org/specification
OpenMP Architecture Review Board. Openmp application program interface version 4.0 (2013)
Keasler, J., Hornung, R.: The RAJA Portability Layer: Overview and Status. Technical report LLNL-TR-661403, Lawrence Livermore National Laboratory (2014)
Edwards, H.C., Sunderland, D.: Kokkos array performance-portable manycore programming model. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM ’12, pp. 1–10. ACM, New York (2012)
Srman, T.: Comparison of Technologies for General-Purpose Computing on Graphics Processing Units. Master’s thesis, Department of Electrical Engineering, Linkping University (2016)
Martineau, M., Price, J., McIntosh-Smith, S., Gaudin, W.: Pragmatic performance portability with OpenMP 4.x. In: Maruyama, N., de Supinski, B.R., Wahib, M. (eds.) OpenMP: Memory, Devices, and Tasks: 12th International Workshop on OpenMP, IWOMP 2016, Nara, Japan, October 5–7, 2016, Proceedings, pp. 253–267. Springer, Cham (2016)
Google Scholar
Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.C.: An investigation of unified memory access performance in CUDA. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2014)
Nvidia. CUDA C Programming Guide v8.0 Web page—J. Unified Memory Programming (2017). http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd
Notz, P.K., Pawlowski, R.P., Sutherland, J.C.: Graph-based software design for managing complexity and enabling concurrency in multiphysics PDE software. ACM Trans. Math. Softw. TOMS 39(1), 1 (2012)
Article MathSciNet Google Scholar
Earl, C., Might, M., Bagusetty, A., Sutherland, J.C.: Nebo: an efficient, parallel, and portable domain-specific language for numerically solving partial differential equations. J. Syst. Softw. 125, 389–400 (2017)
Article Google Scholar
Sutherland, J.C., Saad, T.: The discrete operator approach to the numerical solution of partial differential equations. In: 20th AIAA Computational Fluid Dynamics Conference, pp. AIAA–2011–3377, Honolulu, Hawaii, USA (2011)
Nvidia. Nvlink web page (2015). http://www.nvidia.com/object/nvlink.html
Wu, W., Bosilca, G., vandeVaart, R., Jeaugey, S., Dongarra, J.: GPU-aware non-contiguous data movement in open MPI. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’16, pp. 231–242. ACM, New York (2016)
Ren, B., Ravi, N., Yang, Y., Feng, M., Agrawal, G., Chakradhar, S.: Automatic and efficient data host-device communication for many-core coprocessors. In: Revised Selected Papers of the 28th International Workshop on Languages and Compilers for Parallel Computing—LCPC 2015, vol. 9519, p. 173–190. Springer, New York (2016)
Chapter Google Scholar
Humphrey, A., Harman, T., Berzins, M., Smith, P.: A scalable algorithm for radiative heat transfer using reverse Monte Carlo ray tracing. In: Kunkel, J.M., Ludwig, T. (eds.) High Performance Computing, Volume 9137 of Lecture Notes in Computer Science, pp. 212–230. Springer, New York (2015)
Google Scholar
Burns, S.P., Christen, M.A.: Spatial domain-based parallelism in large-scale, participating-media, radiative transport applications. Numer. Heat Transf. B Fundam. 31(4), 401–421 (1997)
Article Google Scholar
Slaughter, E., Lee, W., Treichler, S., Bauer, M., Aiken, A.: Regent: a high-productivity programming language for HPC with logical regions. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pp. 81:1–81:12. ACM, New York (2015)
Bosilca, G., Bouteiller, A., Hérault, T., Lemarinier, P., Saengpatsa, N.O., Tomov, S., Dongarra, J.J.: Performance portability of a GPU enabled factorization with the DAGuE framework. In: 2011 IEEE International Conference on Cluster Computing, pp. 395–402 (2011)
Bauer, M.E.: Legion: programming distributed heterogeneous architectures with logical regions. Ph.D. thesis, Stanford University (2014)
Bhatele, A., Yeom, J.-S., Jain, N., Kuhlman, C.J., Livnat, Y., Bisset, K.R., Kale, L.V., Marathe, M.V.: Massively parallel simulations of spread of infectious diseases over realistic social networks. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid ’17, pp. 689–694. IEEE Press, Piscataway (2017)
Agullo, E., Aumage, O., Faverge, M., Furmento, N., Pruvost, F., Sergent, M., Thibault, S.: Achieving high performance on supercomputers with a sequential task-based programming model. In: [Research Report] RR-8927, Inria Bordeaux Sud-Ouest, Bordeaux INP, CNRS, Université de Bordeaux, CEA, p. 27 (2016)
Danalis, A., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J.: PTG: an abstraction for unhindered parallelism. In: Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC ’14, pp. 21–30. IEEE Press, Piscataway (2014)

Download references

Acknowledgements

Funding from NSF and DOE is gratefully acknowledged. This material is based upon work supported by the National Science Foundation under Grant No. 1337145. This material is based upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number(s) DE-NA0002375. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. We would also like to acknowledge Oak Ridge Leadership Computing Facility ALCC award CSC188, “Demonstration of the Scalability of Programming Environments By Simulating Multi-Scale Applications” for time on Titan. We would also like to thank all those involved with Uintah past and present.

Author information

Authors and Affiliations

Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, 84112, USA
Brad Peterson, Alan Humphrey, Harish Dasari & Martin Berzins
Department of Chemical Engineering, University of Utah, Salt Lake City, UT, 84112, USA
James Sutherland & Tony Saad
Sandia National Laboratories, PO Box 5800, MS 1418, Albuquerque, NM, 87185, USA
Dan Sunderland

Authors

Brad Peterson
View author publications
You can also search for this author in PubMed Google Scholar
Alan Humphrey
View author publications
You can also search for this author in PubMed Google Scholar
Dan Sunderland
View author publications
You can also search for this author in PubMed Google Scholar
James Sutherland
View author publications
You can also search for this author in PubMed Google Scholar
Tony Saad
View author publications
You can also search for this author in PubMed Google Scholar
Harish Dasari
View author publications
You can also search for this author in PubMed Google Scholar
Martin Berzins
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brad Peterson.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peterson, B., Humphrey, A., Sunderland, D. et al. Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime. Int J Parallel Prog 47, 1086–1116 (2019). https://doi.org/10.1007/s10766-018-0619-1

Download citation

Received: 15 May 2016
Accepted: 24 November 2018
Published: 07 December 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10766-018-0619-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Abstract

Access this article

Similar content being viewed by others

Cluster Optimization and Parallelization of Simulations with Dynamically Adaptive Grids

Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures

A High-Productivity Framework for Adaptive Mesh Refinement on Multiple GPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Abstract

Access this article

Similar content being viewed by others

Cluster Optimization and Parallelization of Simulations with Dynamically Adaptive Grids

Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures

A High-Productivity Framework for Adaptive Mesh Refinement on Multiple GPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation