Abstract
Recent advances in GPUs opened a new opportunity in harnessing their computing power for general purpose computing. CUDA, an extension to C programming, is developed for programming NVIDIA GPUs. However, efficiently programming GPUs using CUDA is very tedious and error prone even for the expert programmers. Programmer has to optimize the resource occupancy and manage the data transfers between host and GPU, and across the memory system. This paper presents the basic architectural optimizations and explore their implementations in research and industry compilers. The focus of the presented review is on accelerating computational science applications such as the class of structured grid computation (SGC). It also discusses the mismatch between current compiler techniques and the requirements for implementing efficient iterative linear solvers. It explores the approaches used by computational scientists to program SGCs. Finally, a set of tools with the main optimization functionalities for an integrated library are proposed to ease the process of defining complex SGC data structure and optimizing solver code using intelligent high-level interface and domain specific annotations.
Similar content being viewed by others
References
Al-Mouhamed MA, Khan AH (2017) SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73(9):3761–3795. https://doi.org/10.1007/s11227-017-1972-3
Aldinucci M, Danelutto M, Drocco M, Kilpatrick P, Misale C, Peretti Pezzi G, Torquati M (2018) A parallel pattern for iterative stencil + reduce. J Supercomput 74(11):5690–5705. https://doi.org/10.1007/s11227-016-1871-z
Almousa A (2017) Experimental evaluation and enhancement of optimizations of annotation-based and automatic parallel code generators for GPUs. PhD thesis, King Fahd University of Petroleum and Minerals
Alnæs MS, Logg A, Ølgaard KB, Rognes ME, Wells GN (2014) Unified form language: a domain-specific language for weak formulations of partial differential equations. ACM Trans Math Softw (TOMS) 40(2):9
Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly UM, Amarasinghe S (2014) Opentuner: an extensible framework for program autotuning. In: Proceedings of the 23rd international conference on parallel architectures and compilation. ACM, pp 303–316
Anzt H, Tomov S, Luszczek P, Sawyer W, Dongarra J (2015) Acceleration of GPU-based krylov solvers via data transfer reduction. Int J High Perform Comput Appl 29(3):366–383
Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the conference on high performance computing networking, storage and analysis. ACM, p 18
Beyer JC, Stotzer EJ, Hart A, de Supinski BR (2011) OpenMP for accelerators. In: IWOMP, lecture notes in computer science. Springer, pp 108–121
Bodin F, Bihan S (2009) Heterogeneous multicore parallel programming for graphics processing units. Sci Program 17(4):325–336
Bondhugula U, Hartono A, Ramanujam J, Sadayappan P (2008) A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not 43(6):101–113. https://doi.org/10.1145/1379022.1375595
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. ACM Trans Graph 23(3):777–786. https://doi.org/10.1145/1015706.1015800
Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91
Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55. https://doi.org/10.1109/99.660313
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, p 4
Ernstsson A, Li L, Kessler C (2018) Skepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systems. Int J Parallel Program 46(1):62–80. https://doi.org/10.1007/s10766-017-0490-5
Galvez R, van Anders G (2011) Accelerating the solution of families of shifted linear systems with cuda. arXiv:1102.2143
Gao J, Qi P, He G (2016) Efficient CSR-based sparse matrix-vector multiplication on GPU. Math Probl Eng. https://doi.org/10.1155/2016/4596943
Gebali F (2011) Algorithms and parallel computing, vol 84. Wiley, Hoboken
Godwin J, Holewinski J, Sadayappan P (2012) High-performance sparse matrix–vector multiplication on GPUs for structured grid computations. In: Proceedings of the 5th annual workshop on general purpose processing with graphics processing units. ACM, pp 47–56
Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming. IEEE Trans Parallel Distrib Syst 22:78–90. https://doi.org/10.1109/TPDS.2010.62
Huan G, Qian Z (2012) A new method of sparse matrix–vector multiplication on GPU. In: 2012 2nd International conference on computer science and network technology (ICCSNT). IEEE, pp 954–958
Kamil S (2009) A generalized framework for auto-tuning stencil computations. Lawrence Berkeley National Laboratory, Berkeley
Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE international symposium on parallel and distributed processing (IPDPS). IEEE, pp 1–12
Khan A, Al-Mouhamed M, Fatayer A, Mohammad N (2016) Optimizing the matrix multiplication using strassen and winograd algorithms with limited recursions on many-core. Int J Parallel Program 44(4):801–830
Khan AH, Al-Mouhamed M, Al-Mulhem M, Ahmed AF (2017) RT-CUDA: a software tool for CUDA code restructuring. Int J Parallel Program 45(3):551–594
Khan M, Basu P, Rudy G, Hall M, Chen C, Chame J (2013) A script-based autotuning compiler system to generate high-performance cuda code. ACM Trans Archit Code Optim 9(4):31:1–31:25. https://doi.org/10.1145/2400682.2400690
Lee S, Eigenmann R (2013) OpenMPC: extended OpenMP for efficient programming and tuning on GPUs. Int J Comput Sci Eng (IJCSE) 8(1):4–20
Leung A, Vasilache N, Meister B, Baskaran M, Wohlford D, Bastoul C, Lethin R (2010) A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units, GPGPU ’10. ACM, New York, NY, USA, pp 51–61
Liao SW, Du Z, Wu G, Lueh GY (2006) Data and computation transformations for brook streaming applications on multiprocessors. In: Fourth IEEE/ACM international symposium on code generation and optimization (CGO). pp 196–207
Lowell D, Godwin J, Holewinski J, Karthik D, Choudary C, Mametjanov A, Norris B, Sabin G, Sadayappan P, Sarich J (2013) Stencil-aware GPU optimization of iterative solvers. SIAM J Sci Comput 35(5):S209–S228
Mametjanov A, Lowell D, Ma CC, Norris B (2012) Autotuning stencil-based computations on GPUs. In: 2012 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 266–274
Maruyama N, Aoki T (2014) Optimizing stencil computations for NVIDIA Kepler GPUs. In: Proceedings of the 1st international workshop on high-performance stencil computations, Vienna. pp 89–95
Mueller K, Xu F, Neophytou N (2007) Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography? Proc SPIE 6498:64980N – 6498 – 12
OpenMP: The OpenMP®API specification for parallel programming (2018). http://openmp.org/wp/. Accessed Jan 2019
Peercy M, Segal M, Gerstmann D (2006) A performance-oriented data parallel virtual machine for GPUs. In: SIGGRAPH ’06: ACM SIGGRAPH 2006 Sketches. ACM, New York, NY, USA, p 184
PGI: Portland group (2019). http://www.pgroup.com/resources/accel.htm. Accessed Jan 2019
Rivera-Polanco D (2009) Collective communication and barrier synchronization on NVIDIA CUDA GPU. Ms thesis, University of Kentucky
Sedaghati N, Ashari A, Pouchet LN, Parthasarathy S, Sadayappan P (2015) Characterizing dataset dependence for sparse matrix–vector multiplication on GPUs. In: Proceedings of the 2nd workshop on parallel programming for analytics applications. ACM, pp 17–24
Sedaghati N, Mu T, Pouchet LN, Parthasarathy S, Sadayappan P (2015) Automatic selection of sparse matrix representation on GPUs. In: Proceedings of the 29th ACM on international conference on supercomputing. ACM, pp 99–108
Tojo N, Tanabe K, Matsuzaki H (2014) US Patent and Trademark Office, Washington, DC, US Patent No. 8,732,684
Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5):232–240
Ueng SZ, Lathara M, Baghsorkhi SS, Hwu WMW (2008) Cuda-lite: reducing GPU programming complexity. In: Amaral JN (ed) Languages and compilers for parallel computing. Springer, Berlin, pp 1–15
Volkov V, Demmel J (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE conference on high performance computing. p 31
Wang Z, Xu X, Zhao W, Zhang Y, He S (2010) Optimizing sparse matrix–vector multiplication on CUDA. In: International conference on education technology and computer. https://doi.org/10.1109/ICETC.2010.5529724
Wikipedia: Algorithmic skeleton (2019). https://en.wikipedia.org/wiki/Algorithmic_skeleton. Accessed 01 June 2019
Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2009) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Comput 35(3):178–194
Xiao S, chun Feng W (2010) Inter-block GPU communication via fast barrier synchronization. In: IPDPS. pp 1–12
Yang M, Sun C, Li Z, Cao D (2012) An improved sparse matrix–vector multiplication kernel for solving modified equation in large scale power flow calculation on CUDA. In: IEEE 7th international power electronics and motion control conference—ECCE Asia
Acknowledgements
The authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science and Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) for funding this work through project No. 12-INF3008-04 as part of the National Science, Technology and Innovation Plan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Al-Mouhamed, M.A., Khan, A.H. & Mohammad, N. A review of CUDA optimization techniques and tools for structured grid computing. Computing 102, 977–1003 (2020). https://doi.org/10.1007/s00607-019-00744-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-019-00744-1
Keywords
- Scientific simulations
- Structured grid computing (SGC)
- CUDA
- Massively parallel programming
- Kernel optimizations