Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication,International Journal of Parallel Programming

当前位置： X-MOL 学术 › Int. J. Parallel. Program › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
International Journal of Parallel Programming ( IF 1.5 ) Pub Date : 2019-01-01 , DOI: 10.1007/s10766-018-0604-8
Junhong Liu , Xin He , Weifeng Liu , Guangming Tan

General sparse matrix–matrix multiplication (SpGEMM) is a fundamental building block of a number of high-level algorithms and real-world applications. In recent years, several efficient SpGEMM algorithms have been proposed for many-core processors such as GPUs. However, their implementations of sparse accumulators, the core component of SpGEMM, mostly use low speed on-chip shared memory and global memory, and high speed registers are seriously underutilised. In this paper, we propose three novel register-aware SpGEMM algorithms for three representative sparse accumulators, i.e., sort, merge and hash, respectively. We fully utilise the GPU registers to fetch data, finish computations and store results out. In the experiments, our algorithms deliver excellent performance on a benchmark suite including 205 sparse matrices from the SuiteSparse Matrix Collection. Specifically, on an Nvidia Pascal P100 GPU, our three register-aware sparse accumulators achieve on average 2.0 $$\times $$× (up to 5.4 $$\times $$×), 2.6 $$\times $$× (up to 10.5 $$\times $$×) and 1.7 $$\times $$× (up to 5.2 $$\times $$×) speedups over their original implementations in libraries bhSPARSE, RMerge and NSPARSE, respectively.

中文翻译：

并行稀疏矩阵-矩阵乘法的寄存器感知优化

通用稀疏矩阵-矩阵乘法 (SpGEMM) 是许多高级算法和实际应用程序的基本构建块。近年来，针对 GPU 等多核处理器提出了几种高效的 SpGEMM 算法。然而，他们对稀疏累加器（SpGEMM 的核心组件）的实现大多使用低速片上共享内存和全局内存，而高速寄存器则严重未得到利用。在本文中，我们分别针对三个代表性稀疏累加器（即排序、合并和散列）提出了三种新颖的寄存器感知 SpGEMM 算法。我们充分利用 GPU 寄存器来获取数据、完成计算和存储结果。在实验中，我们的算法在包括来自 SuiteSparse Matrix Collection 的 205 个稀疏矩阵的基准套件上提供了出色的性能。具体来说，在 Nvidia Pascal P100 GPU 上，我们的三个寄存器感知稀疏累加器平均达到 2.0 $$\times $$×（高达 5.4 $$\times $$×），2.6 $$\times $$×（高达分别比它们在库 bhSPARSE、RMerge 和 NSPARSE 中的原始实现的速度提高了 10.5 $$\times $$×) 和 1.7 $$\times $$×（最多 5.2 $$\times $$×）。

更新日期：2019-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>