Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors

Park, Yoosang; Kim, Raehyun; Nguyen, Thi My Tuyen; Choi, Jaeyoung

doi:10.1007/s10586-021-03274-8

Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors

Published: 12 April 2021

Volume 26, pages 2539–2549, (2023)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Yoosang Park¹,
Raehyun Kim²,
Thi My Tuyen Nguyen¹ &
…
Jaeyoung Choi ORCID: orcid.org/0000-0002-7321-9682¹

707 Accesses
4 Citations
Explore all metrics

Abstract

In high-performance computing, the general matrix-matrix multiplication (xGEMM) routine is the core of the Level 3 BLAS kernel for effective matrix-matrix multiplication operations. The performance of parallel xGEMM (PxGEMM) is significantly affected by two main factors: the flop rate that can be achieved by calculating the operations and the communication costs for broadcasting submatrices to others. In this study, an approach is proposed to improve and adjust the parallel double-precision general matrix-matrix multiplication (PDGEMM) routine for modern Intel computers such as Knights Landing (KNL) and Xeon Scalable Processors (SKL). The proposed approach consists of two methods to deal with the aforementioned factors. First, the improvement of PDGEMM for the computational part is suggested based on a blocked GEMM algorithm that provides better fits for the architectures of KNL and SKL to perform better block size computation. Second, a communication routine adjustment with the message passing interface is proposed to overcome the settings of the basic linear algebra communication subprograms to improve the time-wise cost efficiency. Consequently, it is shown that performance improvements are achieved in the case of smaller matrix multiplications on the SKL clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Article 01 June 2018

Roktaek Lim, Yeongha Lee, … Jaeyoung Choi

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Article 15 October 2020

Qingchang Han, Hailong Yang, … Depei Qian

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster

References

High-Performance Computing (HPC). https://www.nics.tennessee.edu/computing-resources/what-is-hpc. Accessed 22 Nov 2020
Geist, A., Reed, D.A.: A survey of high-performance computing scaling challenges. Int. J. High Perform. Comput. Appl. 31, 104–113 (2017)
Article Google Scholar
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., Gschwandtner, P., Lemarinier, P., Markidis, S., Jordan, H., Fahringer, T., Katrinis, K., Laure, E., Nikolopoulos, D.S.: A taxonomy of task-based parallel programming technologies for high-performance computing. J. SuperComput. 74, 1422–1434 (2018)
Article Google Scholar
Basic Linear Algebra Subprograms (BLAS). http://www.netlib.org/blas. Accessed 22 Nov 2020.
Parallel Basic Linear Algebra Subprograms (PBLAS). http://www.netlib.org/scalapack/pblas_qref.html. Accessed 22 Nov 2020.
Filippone, S.: Parallel libraries on distributed memory architectures: The IBM Parallel ESSL. In: Waśniewski, J., Dongarra, J., Madsen, K., Olesen, D. (eds.) Applied Parallel Computing Industrial Computation and Optimization, pp.247-255. Springer (1996)
ScaLAPACK. http://www.netlib.org/scalapack. Accessed 22 Nov 2020.
Intel Math Kernel Library (MKL). https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html. Accessed 22 Nov 2020.
Yan, D., Wang, W., Chu, X.: Optimizing batched winograd convolution on GPUs. Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’20). https://doi.org/10.1145/3332466.3374520
Catalán, S., Castelló, A., Igual, F.D., Rodríguez-Sánchez, R., Quintana-Ortí, E.S.: Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput. 23, 359–375 (2020)
Article Google Scholar
Frison, G., Sartor, T., Zanelli, A., Diehl, M.: The BLAS API of BLASFEO: optimizing performance for small matrices. ACM Transact. Mathem. Software (2020). https://doi.org/10.1145/3378671
Article MathSciNet MATH Google Scholar
Labini, P.S., Cianfriglia, M., Perri, D., Gervasi, O., Fursin, G., Lokhmotov, A., Nugteren, C., Carpentieri, B., Zollo, F., Vella, F.: On the anatomy of predictive models for accelerating GPU convolution kernels and beyond. ACM Transact. Architect. Code Optimiz. (2021). https://doi.org/10.1145/3434402
Article Google Scholar
Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.: PVM: Parallel virtual machine: a users’ guide and tutorial for networked parallel computing. MIT Press, Cambridge, MA (1994)
Book MATH Google Scholar
Kotsifakou, M., Srivastava, P., Sinclair, M.D., Komuravelli, R., Adve, V., Adve, S.: HPVM: Heterogeneous parallel virtual machine. proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming (2018). https://doi.org/10.1145/3178487.3178493
Hempel, R.: The MPI standard for message passing. In: Gentzsch, W. Harms, U. (eds.) High-Performance Computing and Networking, pp. 247-252. Springer, (1994)
Zhang, J., Lu, X., Panda, D.K.: High performance mpi library for container-based HPC cloud on InfiniBand clusters. Int. Conf. Parallel Process. (2016). https://doi.org/10.1109/ICPP.2016.38
Article Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)
Article Google Scholar
Ayguade, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, H.: The design of OpenMP tasks. IEEE Transact. Parallel Distr. Syst. 20, 404–418 (2008)
Article Google Scholar
Diener, M., Kale, L.V., Bodony, D.J.: Heterogeneous computing with OpenMP and Hydra. Concurr. Comput.: Practice Exp. 32, e5728 (2020)
Article Google Scholar
Sampath, S., Sagar, B.B., Nanjesh, B.R.: Performance evaluation and comparison of MPI and PVM using a cluster based parallel computing architecture. International Conference on Circuits, Power and Computing Technologies (2013). https://doi.org/10.1109/ICCPCT.2013.6529020
Lusk, E., Chan, A.: Early Experiments with the OpenMP/MPI Hybrid Programming model. In: Eigenmann, R., de Supinski, B.R. (eds.) OpenMP in a New Era of Parallelism, pp. 36-47. Springer (2008)
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights, Landing Morgan Kaufmann, Burlington, Massachusetts (2016)
Google Scholar
Zhao, Z., Marsman, M., Wende, F., Kim, J.: Performance of hybrid MPI/OpenMP VASP on Cray XC40 based on Intel Knights landing many integrated core architecture. Cray User Group Proceedings (2017)
Basic Linear Algebra Communication Subprograms (BLACS). http://www.netlib.org/blacs. Accessed 22 Nov 2020
Walker, D., Sawyer, W., Deshpande, V.: An MPI implementation of the BLACS. Proceedings of 3rd International Conference on High Performance Computing (1996). https://doi.org/10.1109/HIPC.1996.565864
Chen, C., Fang, J., Tang, T., Yang, C.: LU factorization on heterogeneous systems: an energy-efficient appraoch towards high performance. Computing 99, 791–811 (2017)
Article MathSciNet MATH Google Scholar
Nagasaka, Y., Matsuoka, S., Azad, A., Buluc, A.: High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures. Proceedings of the 47th International Conference on Parallel Processing Companion (2018). https://doi.org/10.1145/3229710.3229720
Lim, R., Lee, Y., Kim, R., Choi, J.: An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput. 21, 1785–1795 (2018)
Article Google Scholar
Lim, R., Lee, Y., Kim, R., Choi, J., Lee, M.: Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. J. Supercomput. 75, 7895–7908 (2019)
Article Google Scholar
Zhang, X., Wang, Q., Werber, S.: Openblas. http://www.openblas.net. Accessed 22 Nov. 2020
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the atlas project. Parallel Comput. 27, 321–354 (2001)
MATH Google Scholar
van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transact. Mathem. Softw. (2015). https://doi.org/10.1145/2764454
Article MathSciNet MATH Google Scholar
van de Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurr. Practice Exp. 94, 255–274 (1997)
Article Google Scholar
Choi, J.: A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers. Proceedings of 11th International Parallel Processing Symposium (1997). https://doi.org/10.1109/IPPS.1997.580916
Choi, J.: A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr. Practice Exp. 10, 655–670 (1998)
Article MATH Google Scholar
Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Transact. Mathem. Softw. 34, 1–25 (2008). https://doi.org/10.1145/1356052.1356053
Article MathSciNet MATH Google Scholar
Gunnels, J.A., Henry, G.M., van de Geijn, R.A.: A family of high-performance matrix multiplication algorithms. Int. Conf. Comput. Sci. (2001). https://doi.org/10.1007/3-540-45545-0_15
Article MATH Google Scholar
Kim, R., Choi, J., Lee, M.: Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512. Int. Conf. High Perf. Comput. Asia-Pac. Region (2019). https://doi.org/10.1145/3293320.3293334
Article Google Scholar
Lim, R., Lee, Y., Kim, R., Choi, J.: OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing. Workshops of HPC Asia (2018). https://doi.org/10.1145/3176364.3176374
Article Google Scholar
Recommended value of block size for Intel processor. https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/intel-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html. Accessed 22 Nov 2020.

Download references

Acknowledgements

This work was partially supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2015M3C4A7065662), and partially supported by the Supercomputer Development Leading Program of the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. 2020M3H6A1084853). Also this work was supported by the National Supercomputing Center with supercomputing resources including technical support (No. KSC-2020-CRE-0195).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Soongsil University, Seoul, South Korea
Yoosang Park, Thi My Tuyen Nguyen & Jaeyoung Choi
Department of Mathematics, University of California, Berkeley, Berkeley, USA
Raehyun Kim

Authors

Yoosang Park
View author publications
You can also search for this author in PubMed Google Scholar
Raehyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Thi My Tuyen Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Jaeyoung Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaeyoung Choi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, Y., Kim, R., Nguyen, T.M.T. et al. Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Cluster Comput 26, 2539–2549 (2023). https://doi.org/10.1007/s10586-021-03274-8

Download citation

Received: 25 November 2020
Revised: 13 March 2021
Accepted: 20 March 2021
Published: 12 April 2021
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10586-021-03274-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors

Abstract

Access this article

Similar content being viewed by others

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors

Abstract

Access this article

Similar content being viewed by others

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation