A High-Throughput Solver for Marginalized Graph Kernels on GPU,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A High-Throughput Solver for Marginalized Graph Kernels on GPU
arXiv - CS - Performance Pub Date : 2019-10-14 , DOI: arxiv-1910.06310
Yu-Hang Tang, Oguz Selvitopi, Doru Popovici, Ayd{\i}n Bulu\c{c}

We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.

中文翻译：

GPU 上边缘化图内核的高吞吐量求解器

我们介绍了通用 GPU 上线性求解器的设计和优化，用于对标记图对之间的边缘化图内核进行高效和高吞吐量的评估。求解器使用预处理共轭梯度 (PCG) 方法来计算与两个图的张量积相关联的广义拉普拉斯方程的解。为了应对当前一代 GPU 的指令吞吐量和内存带宽之间的差距，我们的求解器在 PCG 中执行矩阵向量点积运算时，即时形成张量积线性系统，而无需将其存储在内存中。这种即时计算是通过使用线程束中的线程通过称为瓦片的小方阵块协作传输各个图的邻接矩阵和边标签矩阵来完成的，然后在寄存器和共享内存中暂存以供以后重用。跨线程块的扭曲可以通过共享内存进一步共享瓦片以增加数据重用。我们通过使用坐标格式仅存储非空图块并使用位图在每个图块内存储非零元素，从而分层利用图的稀疏性。此外，我们提出了一种新的基于分区的重新排序算法，用于将图的非零元素聚合成更少但更密集的图块，以提高稀疏格式的效率。我们对各种密度的图块的图张量积基元进行了广泛的理论分析，并评估了它们在合成和现实世界数据集上的性能。与现有的基于 CPU 的求解器（例如 GraKeL 和 GraphKernels）相比，我们的求解器提供了三到四个数量级的加速。

更新日期：2020-05-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文