High performance GPU primitives for graph-tensor learning operations,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High performance GPU primitives for graph-tensor learning operations
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-11-09 , DOI: 10.1016/j.jpdc.2020.10.011
Tao Zhang , Wang Kan , Xiao-Yang Liu

Graph-tensor learning operations extend tensor operations by taking the graph structure into account, which have been applied to diverse domains such as image processing and machine learning. However, the running time of graph-tensor operations increases rapidly with the number of nodes and the dimension of data on nodes, making them impractical for real-time applications. In this paper, we propose a GPU library called cuGraph-Tensor for high-performance graph-tensor learning operations, which consists of eight key operations: graph shift (g-shift), graph Fourier transform (g-FT), inverse graph Fourier transform (inverse g-FT), graph filter (g-filter), graph convolution (g-convolution), graph-tensor product (g-product), graph-tensor SVD (g-SVD) and graph-tensor QR (g-QR). cuGraph-Tensor supports scalar, vector, and matrix data processing on each graph node. We propose optimization techniques on computing, memory accesses, and CPU–GPU communications that significantly improve the performance of the graph-tensor learning operations. Using the optimized operations, cuGraph-Tensor builds a graph data completion application for fast and accurate reconstruction of incomplete graph data. In the experiments, the proposed graph learning operations achieve up to $142.12 \times$ speedups versus CPU-based GSPBOX and CPU MATLAB implementations running on two Xeon CPUs. The graph data completion application achieves up to $174.38 \times$ speedups over the CPU MATLAB implementation, and up to $3.82 \times$ speedups with better accuracy over the GPU-based tensor completion in the cuTensor-tubal library.

中文翻译：

用于图张量学习操作的高性能GPU原语

图张量学习操作通过考虑图结构来扩展张量操作，其已应用于多种领域，例如图像处理和机器学习。但是，图张量运算的运行时间随节点数和节点上数据的大小而迅速增加，这对于实时应用来说是不切实际的。在本文中，我们提出了一个称为cuGraph-Tensor的GPU库，用于高性能图形张量学习操作，该库由八个关键操作组成：图形移位（g-shift），图形傅里叶变换（g-FT），逆图傅里叶变换（逆g-FT），图过滤器（g-filter），图卷积（g-卷积），图张量积（g-product），图张量SVD（g-SVD）和图张量QR（g-QR）。cuGraph-Tensor支持在每个图节点上进行标量，向量和矩阵数据处理。我们提出了关于计算，内存访问和CPU-GPU通信的优化技术，这些技术可以显着提高图张量学习操作的性能。使用优化的操作，cuGraph-Tensor可以构建图形数据完成应用程序，以快速，准确地重建不完整的图形数据。在实验中，提出的图学习操作可以达到 $142 。 12 \times$ 与在两个Xeon CPU上运行的基于CPU的GSPBOX和CPU MATLAB实施相比，可以提高速度。图形数据完成应用程序达到 $174 。 38 \times$ 加快了CPU MATLAB实现的速度 $3 。 82 \times$ 与cuTensor-tubal库中基于GPU的张量完成相比，具有更高的准确性。

更新日期：2020-11-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11