当前位置: X-MOL 学术Cluster Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
Cluster Computing ( IF 4.4 ) Pub Date : 2021-04-12 , DOI: 10.1007/s10586-021-03274-8
Yoosang Park , Raehyun Kim , Thi My Tuyen Nguyen , Jaeyoung Choi

In high-performance computing, the general matrix-matrix multiplication (xGEMM) routine is the core of the Level 3 BLAS kernel for effective matrix-matrix multiplication operations. The performance of parallel xGEMM (PxGEMM) is significantly affected by two main factors: the flop rate that can be achieved by calculating the operations and the communication costs for broadcasting submatrices to others. In this study, an approach is proposed to improve and adjust the parallel double-precision general matrix-matrix multiplication (PDGEMM) routine for modern Intel computers such as Knights Landing (KNL) and Xeon Scalable Processors (SKL). The proposed approach consists of two methods to deal with the aforementioned factors. First, the improvement of PDGEMM for the computational part is suggested based on a blocked GEMM algorithm that provides better fits for the architectures of KNL and SKL to perform better block size computation. Second, a communication routine adjustment with the message passing interface is proposed to overcome the settings of the basic linear algebra communication subprograms to improve the time-wise cost efficiency. Consequently, it is shown that performance improvements are achieved in the case of smaller matrix multiplications on the SKL clusters.



中文翻译:

通过在英特尔骑士降落和至强可伸缩处理器上利用AVX-512指令来改进分块矩阵矩阵乘法例程

在高性能计算中,通用矩阵矩阵乘法(xGEMM)例程是有效的矩阵矩阵乘法运算的3级BLAS内核的核心。并行xGEMM(PxGEMM)的性能受到两个主要因素的显着影响:可以通过计算运算来实现的触发率以及向其他人广播子矩阵的通信成本。在这项研究中,提出了一种为现代英特尔计算机(例如Knights Landing(KNL)和Xeon可扩展处理器(SKL))改进和调整并行双精度通用矩阵矩阵乘法(PDGEMM)例程的方法。所提出的方法包括两种方法来处理上述因素。第一的,建议基于块GEMM算法对计算部分的PDGEMM进行改进,该算法为KNL和SKL的体系结构提供更好的契合度,以执行更好的块大小计算。其次,提出了一种通过消息传递接口进行通信例程调整的方法,以克服基本线性代数通信子程序的设置,从而提高时间成本效率。因此,表明在SKL群集上使用较小的矩阵乘法的情况下,可以提高性能。

更新日期:2021-04-12
down
wechat
bug