Abstract
With the rise of graphics processing units (GPUs), the parallel computing community needs better tools to productively extract performance from the GPU. While modern compilers provide flags to activate different optimizations to improve performance, the effectiveness of such automated optimization has been limited at best. As a consequence, extracting the best performance from an algorithm on a GPU requires significant expertise and manual effort to exploit both spatial and temporal sharing of computing resources. In particular, maximizing the performance of an algorithm on a GPU requires extensive hyperparameter (e.g., thread-block size) selection and tuning. Given the myriad of hyperparameter dimensions to optimize across, the search space of optimizations is extremely large, making it infeasible to exhaustively evaluate. This paper proposes an approach that uses statistical analysis with iterative machine learning (IterML) to prune and tune hyperparameters to achieve better performance. During each iteration, we leverage machine-learning models to guide the pruning and tuning for subsequent iterations. We evaluate our IterML approach on the GPU thread-block size across many benchmarks running on an NVIDIA P100 or V100 GPU. Our experimental results show that our automated IterML approach reduces search effort by 40% to 80% when compared to traditional (non-iterative) ML and that the performance of our (unmodified) GPU applications can improve significantly — between 67% and 95% — simply by changing the thread-block size.
Similar content being viewed by others
Notes
A well-known computational fluid dynamics (CFD) problem for viscous incompressible fluid flow.
References
Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185.
Breiman, L. (2017). Classification and regression trees. Routledge.
Choi, J. W., Singh, A., & Vuduc, R. W. (2010). Model-driven autotuning of sparse matrix-vector multiply on gpus. In ACM Sigplan notices, (Vol. 45 pp. 115–126): ACM.
Cui, X., Scogland, T. R., de Supinski, B. R., & Feng, W. C. (2017). Directive-based partitioning and pipelining for graphics processing units. In Parallel and distributed processing symposium (IPDPS), 2017 IEEE international (pp. 575–584): IEEE.
Dongarra, J. J., Meuer, H. W., & Strohmaier, E. (1994). Top500 Supercomputer sites.
Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J., & Vapnik, V. (1997). Support vector regression machines. In Advances in neural information processing systems (pp. 155–161).
Hong, S., & Kim, H. (2009). An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer architecture news, (Vol. 37 pp. 152–163): ACM.
Hou, K., Feng, W. C., & Che, S. (2017). Auto-tuning strategies for parallelizing sparse matrix-vector (spmv) multiplication on multi-and many-core processors. In Parallel and distributed processing symposium workshops (IPDPSW), 2017 IEEE international (pp. 713–722): IEEE.
Hou, K., Wang, H., & Feng, W. C. (2017). Gpu-unicache: Automatic code generation of spatial blocking for stencils on gpus. In Proceedings of the computing frontiers conference (pp. 107–116): ACM.
Hou, K., Wang, H., Feng, W. C., Vetter, J. S., & Lee, S. (2018). Highly efficient compensation-based parallelism for wavefront loops on gpus. In 2018 IEEE International parallel and distributed processing symposium (IPDPS) (pp. 276–285): IEEE.
Johnson, N. (2013). Epcc openacc benchmark suite.
Joseph, P., Vaswani, K., & Thazhuthaveetil, M. J. (2006). Construction and use of linear regression models for processor performance analysis. In The twelfth international symposium on high-performance computer architecture, 2006 (pp. 99–108): IEEE.
Lee, R., Wang, H., & Zhang, X. (2018). Software-defined software: a perspective of machine learning-based software production. In 2018 IEEE 38th international conference on distributed computing systems (ICDCS) (pp. 1270–1275): IEEE.
Li, W., Jin, G., Cui, X., & See, S. (2015). An evaluation of unified memory technology on nvidia gpus. In 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing (pp. 1092–1098): IEEE.
Li, Y., Chang, K., Bel, O., Miller, E. L., & Long, D. D. (2017). Capes: unsupervised storage performance tuning using neural network-based deep reinforcement learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis (p. 42): ACM.
Li, Y., Dongarra, J., & Tomov, S. (2009). A note on auto-tuning gemm for gpus. In International conference on computational science (pp. 884–892): Springer.
Liaw, A., Wiener, M., & et al. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.
Mittal, S., & Vetter, J. S. (2015). A survey of methods for analyzing and improving gpu energy efficiency. ACM Computing Surveys (CSUR), 47(2), 19.
Pal, S. K., & Mitra, S. (1992). Multilayer perceptron, fuzzy sets, classification.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & et al. (2011). Scikit-learn: machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Pouchet, L.N. (2012). Polybench: the polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench.
Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S. Z., Stratton, J. A., & Hwu, W. M. W. (2008). Program optimization space pruning for a multithreaded gpu. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization (pp. 195–204): ACM.
Tran, N. P., Lee, M., & Choi, J. (2017). Parameter based tuning model for optimizing performance on gpu. Cluster Computing, 20(3), 2133–2142.
Acknowledgments
This work was supported in part by the Air Force Office of Scientific Research (AFOSR) Computational Mathematics Program via Grant No. AFOSR Grant FA9550-17-1-0205 as well as Virginia Tech’s Advanced Research Computing (ARC) via access to their high-performance computing resources with graphics processing units (GPUs).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cui, X., Feng, Wc. IterML: Iterative Machine Learning for Intelligent Parameter Pruning and Tuning in Graphics Processing Units. J Sign Process Syst 93, 391–403 (2021). https://doi.org/10.1007/s11265-020-01604-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-020-01604-4