当前位置: X-MOL 学术Computing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
Computing ( IF 3.7 ) Pub Date : 2020-10-11 , DOI: 10.1007/s00607-020-00846-1
Homin Kang , Hyuck Chan Kwon , Duksu Kim

We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix multiplication work as a set of matrix addition and multiplication tasks among their sub-matrices. Then, we distribute the tasks to CPUs and GPUs while considering the characteristics of the tasks and computing resources to minimize the data communication overhead and fully utilize the available computing power. To handle a large matrix efficiently with limited GPU memory, we also propose a block-based work decomposition method. We then further improve the performance of our method by exploiting the concurrent execution abilities of a heterogeneous parallel computing system. We implemented our method on five different heterogeneous systems and applied it to matrices of various sizes. Our method generally shows higher performance than the prior GPU-based matrix multiplication methods. Moreover, compared with the state-of-the-art GPU matrix multiplication library (i.e., CUBLAS), our method achieved up to 1.97 times higher performance using the same GPUs and CPU cores. In some cases, our method using a low-performance GPU (e.g., GTX 1060, 3 GB) achieved performance comparable to that of CUBLAS using a high-performance GPU (e.g., RTX 2080, 8 GB). Also, our method continually improves performance as we use more computing resources like additional CPU cores and GPUs. We could achieve such high performance because our approach fully utilized the capacities of the given heterogeneous parallel computing systems while employing the Strassen’s method, which has a lower asymptotic complexity. These results demonstrate the efficiency and robustness of our algorithm.

中文翻译:

HPMaX:使用 CPU 和 GPU 的异构并行矩阵乘法

我们提出了一种新颖的异构并行矩阵乘法算法,该算法利用中央处理单元 (CPU) 和图形处理单元 (GPU) 来处理大规模矩阵。基于 Strassen 的方法,我们将矩阵乘法工作表示为子矩阵之间的一组矩阵加法和乘法任务。然后,我们将任务分配给CPU和GPU,同时考虑任务和计算资源的特性,以最小化数据通信开销并充分利用可用计算能力。为了在有限的 GPU 内存下有效地处理大型矩阵,我们还提出了一种基于块的工作分解方法。然后,我们通过利用异构并行计算系统的并发执行能力进一步提高我们方法的性能。我们在五个不同的异构系统上实现了我们的方法,并将其应用于各种大小的矩阵。我们的方法通常表现出比先前基于 GPU 的矩阵乘法方法更高的性能。此外,与最先进的 GPU 矩阵乘法库(即 CUBLAS)相比,我们的方法使用相同的 GPU 和 CPU 内核实现了高达 1.97 倍的性能。在某些情况下,我们使用低性能 GPU(例如,GTX 1060,3 GB)的方法实现了与使用高性能 GPU(例如,RTX 2080,8 GB)的 CUBLAS 相当的性能。此外,随着我​​们使用更多计算资源(如额外的 CPU 内核和 GPU),我们的方法不断提高性能。我们能够获得如此高的性能是因为我们的方法充分利用了给定异构并行计算系统的能力,同时采用了具有较低渐近复杂度的 Strassen 方法。这些结果证明了我们算法的效率和鲁棒性。
更新日期:2020-10-11
down
wechat
bug