Skip to main content
Log in

IterML: Iterative Machine Learning for Intelligent Parameter Pruning and Tuning in Graphics Processing Units

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

With the rise of graphics processing units (GPUs), the parallel computing community needs better tools to productively extract performance from the GPU. While modern compilers provide flags to activate different optimizations to improve performance, the effectiveness of such automated optimization has been limited at best. As a consequence, extracting the best performance from an algorithm on a GPU requires significant expertise and manual effort to exploit both spatial and temporal sharing of computing resources. In particular, maximizing the performance of an algorithm on a GPU requires extensive hyperparameter (e.g., thread-block size) selection and tuning. Given the myriad of hyperparameter dimensions to optimize across, the search space of optimizations is extremely large, making it infeasible to exhaustively evaluate. This paper proposes an approach that uses statistical analysis with iterative machine learning (IterML) to prune and tune hyperparameters to achieve better performance. During each iteration, we leverage machine-learning models to guide the pruning and tuning for subsequent iterations. We evaluate our IterML approach on the GPU thread-block size across many benchmarks running on an NVIDIA P100 or V100 GPU. Our experimental results show that our automated IterML approach reduces search effort by 40% to 80% when compared to traditional (non-iterative) ML and that the performance of our (unmodified) GPU applications can improve significantly — between 67% and 95% — simply by changing the thread-block size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20

Similar content being viewed by others

Notes

  1. A well-known computational fluid dynamics (CFD) problem for viscous incompressible fluid flow.

References

  1. Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185.

    MathSciNet  Google Scholar 

  2. Breiman, L. (2017). Classification and regression trees. Routledge.

  3. Choi, J. W., Singh, A., & Vuduc, R. W. (2010). Model-driven autotuning of sparse matrix-vector multiply on gpus. In ACM Sigplan notices, (Vol. 45 pp. 115–126): ACM.

  4. Cui, X., Scogland, T. R., de Supinski, B. R., & Feng, W. C. (2017). Directive-based partitioning and pipelining for graphics processing units. In Parallel and distributed processing symposium (IPDPS), 2017 IEEE international (pp. 575–584): IEEE.

  5. Dongarra, J. J., Meuer, H. W., & Strohmaier, E. (1994). Top500 Supercomputer sites.

  6. Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J., & Vapnik, V. (1997). Support vector regression machines. In Advances in neural information processing systems (pp. 155–161).

  7. Hong, S., & Kim, H. (2009). An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer architecture news, (Vol. 37 pp. 152–163): ACM.

  8. Hou, K., Feng, W. C., & Che, S. (2017). Auto-tuning strategies for parallelizing sparse matrix-vector (spmv) multiplication on multi-and many-core processors. In Parallel and distributed processing symposium workshops (IPDPSW), 2017 IEEE international (pp. 713–722): IEEE.

  9. Hou, K., Wang, H., & Feng, W. C. (2017). Gpu-unicache: Automatic code generation of spatial blocking for stencils on gpus. In Proceedings of the computing frontiers conference (pp. 107–116): ACM.

  10. Hou, K., Wang, H., Feng, W. C., Vetter, J. S., & Lee, S. (2018). Highly efficient compensation-based parallelism for wavefront loops on gpus. In 2018 IEEE International parallel and distributed processing symposium (IPDPS) (pp. 276–285): IEEE.

  11. Johnson, N. (2013). Epcc openacc benchmark suite.

  12. Joseph, P., Vaswani, K., & Thazhuthaveetil, M. J. (2006). Construction and use of linear regression models for processor performance analysis. In The twelfth international symposium on high-performance computer architecture, 2006 (pp. 99–108): IEEE.

  13. Lee, R., Wang, H., & Zhang, X. (2018). Software-defined software: a perspective of machine learning-based software production. In 2018 IEEE 38th international conference on distributed computing systems (ICDCS) (pp. 1270–1275): IEEE.

  14. Li, W., Jin, G., Cui, X., & See, S. (2015). An evaluation of unified memory technology on nvidia gpus. In 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing (pp. 1092–1098): IEEE.

  15. Li, Y., Chang, K., Bel, O., Miller, E. L., & Long, D. D. (2017). Capes: unsupervised storage performance tuning using neural network-based deep reinforcement learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis (p. 42): ACM.

  16. Li, Y., Dongarra, J., & Tomov, S. (2009). A note on auto-tuning gemm for gpus. In International conference on computational science (pp. 884–892): Springer.

  17. Liaw, A., Wiener, M., & et al. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.

    Google Scholar 

  18. Mittal, S., & Vetter, J. S. (2015). A survey of methods for analyzing and improving gpu energy efficiency. ACM Computing Surveys (CSUR), 47(2), 19.

    Article  Google Scholar 

  19. Pal, S. K., & Mitra, S. (1992). Multilayer perceptron, fuzzy sets, classification.

  20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & et al. (2011). Scikit-learn: machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  21. Pouchet, L.N. (2012). Polybench: the polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench.

  22. Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S. Z., Stratton, J. A., & Hwu, W. M. W. (2008). Program optimization space pruning for a multithreaded gpu. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization (pp. 195–204): ACM.

  23. Tran, N. P., Lee, M., & Choi, J. (2017). Parameter based tuning model for optimizing performance on gpu. Cluster Computing, 20(3), 2133–2142.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the Air Force Office of Scientific Research (AFOSR) Computational Mathematics Program via Grant No. AFOSR Grant FA9550-17-1-0205 as well as Virginia Tech’s Advanced Research Computing (ARC) via access to their high-performance computing resources with graphics processing units (GPUs).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuewen Cui.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, X., Feng, Wc. IterML: Iterative Machine Learning for Intelligent Parameter Pruning and Tuning in Graphics Processing Units. J Sign Process Syst 93, 391–403 (2021). https://doi.org/10.1007/s11265-020-01604-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-020-01604-4

Keywords

Navigation