Skip to main content
Log in

Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In high-performance computing, the general matrix-matrix multiplication (xGEMM) routine is the core of the Level 3 BLAS kernel for effective matrix-matrix multiplication operations. The performance of parallel xGEMM (PxGEMM) is significantly affected by two main factors: the flop rate that can be achieved by calculating the operations and the communication costs for broadcasting submatrices to others. In this study, an approach is proposed to improve and adjust the parallel double-precision general matrix-matrix multiplication (PDGEMM) routine for modern Intel computers such as Knights Landing (KNL) and Xeon Scalable Processors (SKL). The proposed approach consists of two methods to deal with the aforementioned factors. First, the improvement of PDGEMM for the computational part is suggested based on a blocked GEMM algorithm that provides better fits for the architectures of KNL and SKL to perform better block size computation. Second, a communication routine adjustment with the message passing interface is proposed to overcome the settings of the basic linear algebra communication subprograms to improve the time-wise cost efficiency. Consequently, it is shown that performance improvements are achieved in the case of smaller matrix multiplications on the SKL clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. High-Performance Computing (HPC). https://www.nics.tennessee.edu/computing-resources/what-is-hpc. Accessed 22 Nov 2020

  2. Geist, A., Reed, D.A.: A survey of high-performance computing scaling challenges. Int. J. High Perform. Comput. Appl. 31, 104–113 (2017)

    Article  Google Scholar 

  3. Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., Gschwandtner, P., Lemarinier, P., Markidis, S., Jordan, H., Fahringer, T., Katrinis, K., Laure, E., Nikolopoulos, D.S.: A taxonomy of task-based parallel programming technologies for high-performance computing. J. SuperComput. 74, 1422–1434 (2018)

    Article  Google Scholar 

  4. Basic Linear Algebra Subprograms (BLAS). http://www.netlib.org/blas. Accessed 22 Nov 2020.

  5. Parallel Basic Linear Algebra Subprograms (PBLAS). http://www.netlib.org/scalapack/pblas_qref.html. Accessed 22 Nov 2020.

  6. Filippone, S.: Parallel libraries on distributed memory architectures: The IBM Parallel ESSL. In: Waśniewski, J., Dongarra, J., Madsen, K., Olesen, D. (eds.) Applied Parallel Computing Industrial Computation and Optimization, pp.247-255. Springer (1996)

  7. ScaLAPACK. http://www.netlib.org/scalapack. Accessed 22 Nov 2020.

  8. Intel Math Kernel Library (MKL). https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html. Accessed 22 Nov 2020.

  9. Yan, D., Wang, W., Chu, X.: Optimizing batched winograd convolution on GPUs. Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’20). https://doi.org/10.1145/3332466.3374520

  10. Catalán, S., Castelló, A., Igual, F.D., Rodríguez-Sánchez, R., Quintana-Ortí, E.S.: Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput. 23, 359–375 (2020)

    Article  Google Scholar 

  11. Frison, G., Sartor, T., Zanelli, A., Diehl, M.: The BLAS API of BLASFEO: optimizing performance for small matrices. ACM Transact. Mathem. Software (2020). https://doi.org/10.1145/3378671

    Article  MathSciNet  MATH  Google Scholar 

  12. Labini, P.S., Cianfriglia, M., Perri, D., Gervasi, O., Fursin, G., Lokhmotov, A., Nugteren, C., Carpentieri, B., Zollo, F., Vella, F.: On the anatomy of predictive models for accelerating GPU convolution kernels and beyond. ACM Transact. Architect. Code Optimiz. (2021). https://doi.org/10.1145/3434402

    Article  Google Scholar 

  13. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.: PVM: Parallel virtual machine: a users’ guide and tutorial for networked parallel computing. MIT Press, Cambridge, MA (1994)

    Book  MATH  Google Scholar 

  14. Kotsifakou, M., Srivastava, P., Sinclair, M.D., Komuravelli, R., Adve, V., Adve, S.: HPVM: Heterogeneous parallel virtual machine. proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming (2018). https://doi.org/10.1145/3178487.3178493

  15. Hempel, R.: The MPI standard for message passing. In: Gentzsch, W. Harms, U. (eds.) High-Performance Computing and Networking, pp. 247-252. Springer, (1994)

  16. Zhang, J., Lu, X., Panda, D.K.: High performance mpi library for container-based HPC cloud on InfiniBand clusters. Int. Conf. Parallel Process. (2016). https://doi.org/10.1109/ICPP.2016.38

    Article  Google Scholar 

  17. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)

    Article  Google Scholar 

  18. Ayguade, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, H.: The design of OpenMP tasks. IEEE Transact. Parallel Distr. Syst. 20, 404–418 (2008)

    Article  Google Scholar 

  19. Diener, M., Kale, L.V., Bodony, D.J.: Heterogeneous computing with OpenMP and Hydra. Concurr. Comput.: Practice Exp. 32, e5728 (2020)

    Article  Google Scholar 

  20. Sampath, S., Sagar, B.B., Nanjesh, B.R.: Performance evaluation and comparison of MPI and PVM using a cluster based parallel computing architecture. International Conference on Circuits, Power and Computing Technologies (2013). https://doi.org/10.1109/ICCPCT.2013.6529020

  21. Lusk, E., Chan, A.: Early Experiments with the OpenMP/MPI Hybrid Programming model. In: Eigenmann, R., de Supinski, B.R. (eds.) OpenMP in a New Era of Parallelism, pp. 36-47. Springer (2008)

  22. Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights, Landing Morgan Kaufmann, Burlington, Massachusetts (2016)

    Google Scholar 

  23. Zhao, Z., Marsman, M., Wende, F., Kim, J.: Performance of hybrid MPI/OpenMP VASP on Cray XC40 based on Intel Knights landing many integrated core architecture. Cray User Group Proceedings (2017)

  24. Basic Linear Algebra Communication Subprograms (BLACS). http://www.netlib.org/blacs. Accessed 22 Nov 2020

  25. Walker, D., Sawyer, W., Deshpande, V.: An MPI implementation of the BLACS. Proceedings of 3rd International Conference on High Performance Computing (1996). https://doi.org/10.1109/HIPC.1996.565864

  26. Chen, C., Fang, J., Tang, T., Yang, C.: LU factorization on heterogeneous systems: an energy-efficient appraoch towards high performance. Computing 99, 791–811 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  27. Nagasaka, Y., Matsuoka, S., Azad, A., Buluc, A.: High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures. Proceedings of the 47th International Conference on Parallel Processing Companion (2018). https://doi.org/10.1145/3229710.3229720

  28. Lim, R., Lee, Y., Kim, R., Choi, J.: An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput. 21, 1785–1795 (2018)

    Article  Google Scholar 

  29. Lim, R., Lee, Y., Kim, R., Choi, J., Lee, M.: Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. J. Supercomput. 75, 7895–7908 (2019)

    Article  Google Scholar 

  30. Zhang, X., Wang, Q., Werber, S.: Openblas. http://www.openblas.net. Accessed 22 Nov. 2020

  31. Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the atlas project. Parallel Comput. 27, 321–354 (2001)

    MATH  Google Scholar 

  32. van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transact. Mathem. Softw. (2015). https://doi.org/10.1145/2764454

    Article  MathSciNet  MATH  Google Scholar 

  33. van de Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurr. Practice Exp. 94, 255–274 (1997)

    Article  Google Scholar 

  34. Choi, J.: A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers. Proceedings of 11th International Parallel Processing Symposium (1997). https://doi.org/10.1109/IPPS.1997.580916

  35. Choi, J.: A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr. Practice Exp. 10, 655–670 (1998)

    Article  MATH  Google Scholar 

  36. Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Transact. Mathem. Softw. 34, 1–25 (2008). https://doi.org/10.1145/1356052.1356053

    Article  MathSciNet  MATH  Google Scholar 

  37. Gunnels, J.A., Henry, G.M., van de Geijn, R.A.: A family of high-performance matrix multiplication algorithms. Int. Conf. Comput. Sci. (2001). https://doi.org/10.1007/3-540-45545-0_15

    Article  MATH  Google Scholar 

  38. Kim, R., Choi, J., Lee, M.: Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512. Int. Conf. High Perf. Comput. Asia-Pac. Region (2019). https://doi.org/10.1145/3293320.3293334

    Article  Google Scholar 

  39. Lim, R., Lee, Y., Kim, R., Choi, J.: OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing. Workshops of HPC Asia (2018). https://doi.org/10.1145/3176364.3176374

    Article  Google Scholar 

  40. Recommended value of block size for Intel processor. https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/intel-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html. Accessed 22 Nov 2020.

Download references

Acknowledgements

This work was partially supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2015M3C4A7065662), and partially supported by the Supercomputer Development Leading Program of the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. 2020M3H6A1084853). Also this work was supported by the National Supercomputing Center with supercomputing resources including technical support (No. KSC-2020-CRE-0195).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaeyoung Choi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, Y., Kim, R., Nguyen, T.M.T. et al. Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Cluster Comput 26, 2539–2549 (2023). https://doi.org/10.1007/s10586-021-03274-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03274-8

Keywords

Navigation