TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2021-02-17 , DOI: 10.1016/j.jpdc.2021.02.013
Cody Rivera , Jieyang Chen , Nan Xiong , Jing Zhang , Shuaiwen Leon Song , Dingwen Tao

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small.

中文翻译：

TSM2X：GPU上的高性能细长型矩阵乘法

线性代数运算已广泛用于大数据分析和科学计算中。在优化具有规则形状输入的GPU上的线性代数运算方面，已经完成了许多工作。但是，当输入不是常规形状时，很少有工作专注于充分利用GPU资源。当前的优化并未考虑充分利用内存带宽和计算能力；因此，它们只能实现次优的性能。在本文中，我们针对GPU上的两类细长矩阵-矩阵乘法，提出了两种有效的算法TSM2R和TSM2L。他们俩都专注于优化线性代数运算，其中至少要有一个又高又瘦的输入矩阵。具体来说，TSM2RTSM2L设计用于大型常规形状矩阵乘以一个细长的矩阵，而TSM2L设计用于大型常规形状矩阵乘以一个小的常规形状的矩阵。我们实施我们提出的算法，并在几种现代NVIDIA GPU微体系结构上进行测试。实验表明，与当前的最新技术相比，（1）TSM2R平均将计算速度提高了1.6倍，并将内存带宽利用率和计算能力利用率分别平均提高了18.1％和20.5％。，当规则形状的矩阵大小相对较大或中等时；和（2）TSM2L 当规则形状的矩阵大小相对较小时，平均可将计算速度提高1.9倍，并将内存带宽利用率平均提高9.3％。

更新日期：2021-02-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11