当前位置:
X-MOL 学术
›
arXiv.cs.PF
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
A High-Throughput Solver for Marginalized Graph Kernels on GPU
arXiv - CS - Performance Pub Date : 2019-10-14 , DOI: arxiv-1910.06310 Yu-Hang Tang, Oguz Selvitopi, Doru Popovici, Ayd{\i}n Bulu\c{c}
arXiv - CS - Performance Pub Date : 2019-10-14 , DOI: arxiv-1910.06310 Yu-Hang Tang, Oguz Selvitopi, Doru Popovici, Ayd{\i}n Bulu\c{c}
We present the design and optimization of a linear solver on General Purpose
GPUs for the efficient and high-throughput evaluation of the marginalized graph
kernel between pairs of labeled graphs. The solver implements a preconditioned
conjugate gradient (PCG) method to compute the solution to a generalized
Laplacian equation associated with the tensor product of two graphs. To cope
with the gap between the instruction throughput and the memory bandwidth of
current generation GPUs, our solver forms the tensor product linear system
on-the-fly without storing it in memory when performing matrix-vector dot
product operations in PCG. Such on-the-fly computation is accomplished by using
threads in a warp to cooperatively stream the adjacency and edge label matrices
of individual graphs by small square matrix blocks called tiles, which are then
staged in registers and the shared memory for later reuse. Warps across a
thread block can further share tiles via the shared memory to increase data
reuse. We exploit the sparsity of the graphs hierarchically by storing only
non-empty tiles using a coordinate format and nonzero elements within each tile
using bitmaps. Besides, we propose a new partition-based reordering algorithm
for aggregating nonzero elements of the graphs into fewer but denser tiles to
improve the efficiency of the sparse format. We carry out extensive theoretical analyses on the graph tensor product
primitives for tiles of various density and evaluate their performance on
synthetic and real-world datasets. Our solver delivers three to four orders of
magnitude speedup over existing CPU-based solvers such as GraKeL and
GraphKernels. The capability of the solver enables kernel-based learning tasks
at unprecedented scales.
中文翻译:
GPU 上边缘化图内核的高吞吐量求解器
我们介绍了通用 GPU 上线性求解器的设计和优化,用于对标记图对之间的边缘化图内核进行高效和高吞吐量的评估。求解器使用预处理共轭梯度 (PCG) 方法来计算与两个图的张量积相关联的广义拉普拉斯方程的解。为了应对当前一代 GPU 的指令吞吐量和内存带宽之间的差距,我们的求解器在 PCG 中执行矩阵向量点积运算时,即时形成张量积线性系统,而无需将其存储在内存中。这种即时计算是通过使用线程束中的线程通过称为瓦片的小方阵块协作传输各个图的邻接矩阵和边标签矩阵来完成的,然后在寄存器和共享内存中暂存以供以后重用。跨线程块的扭曲可以通过共享内存进一步共享瓦片以增加数据重用。我们通过使用坐标格式仅存储非空图块并使用位图在每个图块内存储非零元素,从而分层利用图的稀疏性。此外,我们提出了一种新的基于分区的重新排序算法,用于将图的非零元素聚合成更少但更密集的图块,以提高稀疏格式的效率。我们对各种密度的图块的图张量积基元进行了广泛的理论分析,并评估了它们在合成和现实世界数据集上的性能。与现有的基于 CPU 的求解器(例如 GraKeL 和 GraphKernels)相比,我们的求解器提供了三到四个数量级的加速。
更新日期:2020-05-20
中文翻译:
GPU 上边缘化图内核的高吞吐量求解器
我们介绍了通用 GPU 上线性求解器的设计和优化,用于对标记图对之间的边缘化图内核进行高效和高吞吐量的评估。求解器使用预处理共轭梯度 (PCG) 方法来计算与两个图的张量积相关联的广义拉普拉斯方程的解。为了应对当前一代 GPU 的指令吞吐量和内存带宽之间的差距,我们的求解器在 PCG 中执行矩阵向量点积运算时,即时形成张量积线性系统,而无需将其存储在内存中。这种即时计算是通过使用线程束中的线程通过称为瓦片的小方阵块协作传输各个图的邻接矩阵和边标签矩阵来完成的,然后在寄存器和共享内存中暂存以供以后重用。跨线程块的扭曲可以通过共享内存进一步共享瓦片以增加数据重用。我们通过使用坐标格式仅存储非空图块并使用位图在每个图块内存储非零元素,从而分层利用图的稀疏性。此外,我们提出了一种新的基于分区的重新排序算法,用于将图的非零元素聚合成更少但更密集的图块,以提高稀疏格式的效率。我们对各种密度的图块的图张量积基元进行了广泛的理论分析,并评估了它们在合成和现实世界数据集上的性能。与现有的基于 CPU 的求解器(例如 GraKeL 和 GraphKernels)相比,我们的求解器提供了三到四个数量级的加速。