Segmented Merge: A New Primitive for Parallel Sparse Matrix Computations,International Journal of Parallel Programming

当前位置： X-MOL 学术 › Int. J. Parallel. Program › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Segmented Merge: A New Primitive for Parallel Sparse Matrix Computations
International Journal of Parallel Programming ( IF 0.9 ) Pub Date : 2021-03-26 , DOI: 10.1007/s10766-021-00695-1
Haonan Ji , Shibo Lu , Kaixi Hou , Hao Wang , Zhou Jin , Weifeng Liu , Brian Vinter

Segmented operations, such as segmented sum, segmented scan and segmented sort, are important building blocks for parallel irregular algorithms. We in this work propose a new parallel primitive called segmented merge. Its function is in parallel merging q sub-segments to p segments, both of possibly nonuniform lengths which easily cause the load balancing and the vectorization problems on massively parallel processors, such as GPUs. Our algorithm resolves these problems by first recording the boundaries of segments and sub-segments, then assigning roughly the same number of elements for GPU threads, and finally iteratively merging the sub-segments in each segment in the form of binary tree until there is only one sub-segment in each segment. We implement the segmented merge primitive on GPUs and demonstrate its efficiency on parallel sparse matrix transposition (SpTRANS) and sparse matrix–matrix multiplication (SpGEMM) operations. We conduct a comparative experiment with NVIDIA vendor library on two GPUs. The experimental results show that our algorithm achieve on average 3.94× (up to 13.09×) and 2.89× (up to 109.15×) speedup on SpTRANS and SpGEMM, respectively.

中文翻译：

分段合并：并行稀疏矩阵计算的新原语

分段运算，例如分段和，分段扫描和分段排序，是并行不规则算法的重要构建块。我们在这项工作中提出了一种新的并行原语，称为分段合并。它的功能是将q个子段并行合并到p分段，长度可能不均匀，这很容易导致大规模并行处理器（例如GPU）上的负载平衡和矢量化问题。我们的算法通过首先记录段和子段的边界，然后为GPU线程分配大致相同数量的元素，最后以二进制树的形式迭代合并每个段中的子段，直到只有每个细分中有一个子细分。我们在GPU上实现了分段合并原语，并演示了其在并行稀疏矩阵转置（SpTRANS）和稀疏矩阵-矩阵乘法（SpGEMM）操作中的效率。我们使用两个GPU上的NVIDIA供应商库进行了对比实验。实验结果表明，我们的算法平均达到3.94×（最大13.09×）和2.89×（最大109）。

更新日期：2021-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11