当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2021-02-17 , DOI: 10.1016/j.jpdc.2021.02.013
Cody Rivera , Jieyang Chen , Nan Xiong , Jing Zhang , Shuaiwen Leon Song , Dingwen Tao

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small.



中文翻译:

TSM2X:GPU上的高性能细长型矩阵乘法

线性代数运算已广泛用于大数据分析和科学计算中。在优化具有规则形状输入的GPU上的线性代数运算方面,已经完成了许多工作。但是,当输入不是常规形状时,很少有工作专注于充分利用GPU资源。当前的优化并未考虑充分利用内存带宽和计算能力;因此,它们只能实现次优的性能。在本文中,我们针对GPU上的两类细长矩阵-矩阵乘法,提出了两种有效的算法TSM2RTSM2L。他们俩都专注于优化线性代数运算,其中至少要有一个又高又瘦的输入矩阵。具体来说,TSM2RTSM2L设计用于大型常规形状矩阵乘以一个细长的矩阵,而TSM2L设计用于大型常规形状矩阵乘以一个小的常规形状的矩阵。我们实施我们提出的算法,并在几种现代NVIDIA GPU微体系结构上进行测试。实验表明,与当前的最新技术相比,(1)TSM2R平均将计算速度提高了1.6倍,并将内存带宽利用率和计算能力利用率分别平均提高了18.1%和20.5%。 ,当规则形状的矩阵大小相对较大或中等时;和(2)TSM2L 当规则形状的矩阵大小相对较小时,平均可将计算速度提高1.9倍,并将内存带宽利用率平均提高9.3%。

更新日期:2021-02-28
down
wechat
bug