Skip to main content
Log in

A review of CUDA optimization techniques and tools for structured grid computing

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Recent advances in GPUs opened a new opportunity in harnessing their computing power for general purpose computing. CUDA, an extension to C programming, is developed for programming NVIDIA GPUs. However, efficiently programming GPUs using CUDA is very tedious and error prone even for the expert programmers. Programmer has to optimize the resource occupancy and manage the data transfers between host and GPU, and across the memory system. This paper presents the basic architectural optimizations and explore their implementations in research and industry compilers. The focus of the presented review is on accelerating computational science applications such as the class of structured grid computation (SGC). It also discusses the mismatch between current compiler techniques and the requirements for implementing efficient iterative linear solvers. It explores the approaches used by computational scientists to program SGCs. Finally, a set of tools with the main optimization functionalities for an integrated library are proposed to ease the process of defining complex SGC data structure and optimizing solver code using intelligent high-level interface and domain specific annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Al-Mouhamed MA, Khan AH (2017) SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73(9):3761–3795. https://doi.org/10.1007/s11227-017-1972-3

    Article  Google Scholar 

  2. Aldinucci M, Danelutto M, Drocco M, Kilpatrick P, Misale C, Peretti Pezzi G, Torquati M (2018) A parallel pattern for iterative stencil + reduce. J Supercomput 74(11):5690–5705. https://doi.org/10.1007/s11227-016-1871-z

    Article  Google Scholar 

  3. Almousa A (2017) Experimental evaluation and enhancement of optimizations of annotation-based and automatic parallel code generators for GPUs. PhD thesis, King Fahd University of Petroleum and Minerals

  4. Alnæs MS, Logg A, Ølgaard KB, Rognes ME, Wells GN (2014) Unified form language: a domain-specific language for weak formulations of partial differential equations. ACM Trans Math Softw (TOMS) 40(2):9

    Article  MathSciNet  Google Scholar 

  5. Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly UM, Amarasinghe S (2014) Opentuner: an extensible framework for program autotuning. In: Proceedings of the 23rd international conference on parallel architectures and compilation. ACM, pp 303–316

  6. Anzt H, Tomov S, Luszczek P, Sawyer W, Dongarra J (2015) Acceleration of GPU-based krylov solvers via data transfer reduction. Int J High Perform Comput Appl 29(3):366–383

    Article  Google Scholar 

  7. Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the conference on high performance computing networking, storage and analysis. ACM, p 18

  8. Beyer JC, Stotzer EJ, Hart A, de Supinski BR (2011) OpenMP for accelerators. In: IWOMP, lecture notes in computer science. Springer, pp 108–121

  9. Bodin F, Bihan S (2009) Heterogeneous multicore parallel programming for graphics processing units. Sci Program 17(4):325–336

    Google Scholar 

  10. Bondhugula U, Hartono A, Ramanujam J, Sadayappan P (2008) A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not 43(6):101–113. https://doi.org/10.1145/1379022.1375595

    Article  Google Scholar 

  11. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. ACM Trans Graph 23(3):777–786. https://doi.org/10.1145/1015706.1015800

    Article  Google Scholar 

  12. Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91

    Article  Google Scholar 

  13. Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55. https://doi.org/10.1109/99.660313

    Article  Google Scholar 

  14. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, p 4

  15. Ernstsson A, Li L, Kessler C (2018) Skepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systems. Int J Parallel Program 46(1):62–80. https://doi.org/10.1007/s10766-017-0490-5

    Article  Google Scholar 

  16. Galvez R, van Anders G (2011) Accelerating the solution of families of shifted linear systems with cuda. arXiv:1102.2143

  17. Gao J, Qi P, He G (2016) Efficient CSR-based sparse matrix-vector multiplication on GPU. Math Probl Eng. https://doi.org/10.1155/2016/4596943

    Article  MathSciNet  MATH  Google Scholar 

  18. Gebali F (2011) Algorithms and parallel computing, vol 84. Wiley, Hoboken

    Book  Google Scholar 

  19. Godwin J, Holewinski J, Sadayappan P (2012) High-performance sparse matrix–vector multiplication on GPUs for structured grid computations. In: Proceedings of the 5th annual workshop on general purpose processing with graphics processing units. ACM, pp 47–56

  20. Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming. IEEE Trans Parallel Distrib Syst 22:78–90. https://doi.org/10.1109/TPDS.2010.62

    Article  Google Scholar 

  21. Huan G, Qian Z (2012) A new method of sparse matrix–vector multiplication on GPU. In: 2012 2nd International conference on computer science and network technology (ICCSNT). IEEE, pp 954–958

  22. Kamil S (2009) A generalized framework for auto-tuning stencil computations. Lawrence Berkeley National Laboratory, Berkeley

    Google Scholar 

  23. Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE international symposium on parallel and distributed processing (IPDPS). IEEE, pp 1–12

  24. Khan A, Al-Mouhamed M, Fatayer A, Mohammad N (2016) Optimizing the matrix multiplication using strassen and winograd algorithms with limited recursions on many-core. Int J Parallel Program 44(4):801–830

    Article  Google Scholar 

  25. Khan AH, Al-Mouhamed M, Al-Mulhem M, Ahmed AF (2017) RT-CUDA: a software tool for CUDA code restructuring. Int J Parallel Program 45(3):551–594

    Article  Google Scholar 

  26. Khan M, Basu P, Rudy G, Hall M, Chen C, Chame J (2013) A script-based autotuning compiler system to generate high-performance cuda code. ACM Trans Archit Code Optim 9(4):31:1–31:25. https://doi.org/10.1145/2400682.2400690

    Article  Google Scholar 

  27. Lee S, Eigenmann R (2013) OpenMPC: extended OpenMP for efficient programming and tuning on GPUs. Int J Comput Sci Eng (IJCSE) 8(1):4–20

    Google Scholar 

  28. Leung A, Vasilache N, Meister B, Baskaran M, Wohlford D, Bastoul C, Lethin R (2010) A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units, GPGPU ’10. ACM, New York, NY, USA, pp 51–61

  29. Liao SW, Du Z, Wu G, Lueh GY (2006) Data and computation transformations for brook streaming applications on multiprocessors. In: Fourth IEEE/ACM international symposium on code generation and optimization (CGO). pp 196–207

  30. Lowell D, Godwin J, Holewinski J, Karthik D, Choudary C, Mametjanov A, Norris B, Sabin G, Sadayappan P, Sarich J (2013) Stencil-aware GPU optimization of iterative solvers. SIAM J Sci Comput 35(5):S209–S228

    Article  MathSciNet  Google Scholar 

  31. Mametjanov A, Lowell D, Ma CC, Norris B (2012) Autotuning stencil-based computations on GPUs. In: 2012 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 266–274

  32. Maruyama N, Aoki T (2014) Optimizing stencil computations for NVIDIA Kepler GPUs. In: Proceedings of the 1st international workshop on high-performance stencil computations, Vienna. pp 89–95

  33. Mueller K, Xu F, Neophytou N (2007) Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography? Proc SPIE 6498:64980N – 6498 – 12

  34. OpenMP: The OpenMP®API specification for parallel programming (2018). http://openmp.org/wp/. Accessed Jan 2019

  35. Peercy M, Segal M, Gerstmann D (2006) A performance-oriented data parallel virtual machine for GPUs. In: SIGGRAPH ’06: ACM SIGGRAPH 2006 Sketches. ACM, New York, NY, USA, p 184

  36. PGI: Portland group (2019). http://www.pgroup.com/resources/accel.htm. Accessed Jan 2019

  37. Rivera-Polanco D (2009) Collective communication and barrier synchronization on NVIDIA CUDA GPU. Ms thesis, University of Kentucky

  38. Sedaghati N, Ashari A, Pouchet LN, Parthasarathy S, Sadayappan P (2015) Characterizing dataset dependence for sparse matrix–vector multiplication on GPUs. In: Proceedings of the 2nd workshop on parallel programming for analytics applications. ACM, pp 17–24

  39. Sedaghati N, Mu T, Pouchet LN, Parthasarathy S, Sadayappan P (2015) Automatic selection of sparse matrix representation on GPUs. In: Proceedings of the 29th ACM on international conference on supercomputing. ACM, pp 99–108

  40. Tojo N, Tanabe K, Matsuzaki H (2014) US Patent and Trademark Office, Washington, DC, US Patent No. 8,732,684

  41. Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5):232–240

    Article  Google Scholar 

  42. Ueng SZ, Lathara M, Baghsorkhi SS, Hwu WMW (2008) Cuda-lite: reducing GPU programming complexity. In: Amaral JN (ed) Languages and compilers for parallel computing. Springer, Berlin, pp 1–15

    Google Scholar 

  43. Volkov V, Demmel J (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE conference on high performance computing. p 31

  44. Wang Z, Xu X, Zhao W, Zhang Y, He S (2010) Optimizing sparse matrix–vector multiplication on CUDA. In: International conference on education technology and computer. https://doi.org/10.1109/ICETC.2010.5529724

  45. Wikipedia: Algorithmic skeleton (2019). https://en.wikipedia.org/wiki/Algorithmic_skeleton. Accessed 01 June 2019

  46. Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2009) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Comput 35(3):178–194

    Article  Google Scholar 

  47. Xiao S, chun Feng W (2010) Inter-block GPU communication via fast barrier synchronization. In: IPDPS. pp 1–12

  48. Yang M, Sun C, Li Z, Cao D (2012) An improved sparse matrix–vector multiplication kernel for solving modified equation in large scale power flow calculation on CUDA. In: IEEE 7th international power electronics and motion control conference—ECCE Asia

Download references

Acknowledgements

The authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science and Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) for funding this work through project No. 12-INF3008-04 as part of the National Science, Technology and Innovation Plan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayaz H. Khan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Mouhamed, M.A., Khan, A.H. & Mohammad, N. A review of CUDA optimization techniques and tools for structured grid computing. Computing 102, 977–1003 (2020). https://doi.org/10.1007/s00607-019-00744-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-019-00744-1

Keywords

Mathematics Subject Classification

Navigation