High performance GPU primitives for graph-tensor learning operations

https://doi.org/10.1016/j.jpdc.2020.10.011Get rights and content

Highlights

  • We propose a GPU-based library of eight Graph-Tensor learning operations.

  • We propose optimization techniques on computing, memory access, and communications.

  • We develop a graph data completion application to demonstrate the library.

  • The proposed library achieves up to 142.12× speedups over CPU-based implementations.

  • The application achieves up to 3.82× speedups and better accuracy over GPU-based work.

Abstract

Graph-tensor learning operations extend tensor operations by taking the graph structure into account, which have been applied to diverse domains such as image processing and machine learning. However, the running time of graph-tensor operations increases rapidly with the number of nodes and the dimension of data on nodes, making them impractical for real-time applications. In this paper, we propose a GPU library called cuGraph-Tensor for high-performance graph-tensor learning operations, which consists of eight key operations: graph shift (g-shift), graph Fourier transform (g-FT), inverse graph Fourier transform (inverse g-FT), graph filter (g-filter), graph convolution (g-convolution), graph-tensor product (g-product), graph-tensor SVD (g-SVD) and graph-tensor QR (g-QR). cuGraph-Tensor supports scalar, vector, and matrix data processing on each graph node. We propose optimization techniques on computing, memory accesses, and CPU–GPU communications that significantly improve the performance of the graph-tensor learning operations. Using the optimized operations, cuGraph-Tensor builds a graph data completion application for fast and accurate reconstruction of incomplete graph data. In the experiments, the proposed graph learning operations achieve up to 142.12× speedups versus CPU-based GSPBOX and CPU MATLAB implementations running on two Xeon CPUs. The graph data completion application achieves up to 174.38× speedups over the CPU MATLAB implementation, and up to 3.82× speedups with better accuracy over the GPU-based tensor completion in the cuTensor-tubal library.

Introduction

There is a huge amount of data generated in diverse domains, such as social networks, sensor networks, biomolecular networks, citation and authorship networks, and e-commerce networks [13]. Many aspects of daily life are being recorded at all levels generating large-scale datasets such as phone trajectories, health data on monitoring devices, banking and financial records, shopping preferences, and so on. These data are irregular with complex structures [9]. Graphs offer the ability to characterize the complex interactions by data modeling. Thus researchers proposed various graphs to represent real-world data, such as scale-free graphs, ring graphs, nearest-neighbor graphs, and random geometric graphs [26]. Entities such as users on Facebook or sensors in a field are modeled as nodes and the connections among users or sensors as edges. For graphs that have a data matrix residing on each node, these data matrices form a tensor with graph structures called graph-tensor, where each frontal slice corresponds to a graph node and the connection between two frontal slices is an edge, as shown in Fig. 1.

Graph-tensor operations are widely used in various applications, such as graph neural networks (graph convolution, graph SVD, and graph QR), computer vision (graph convolution), image processing (graph filter), data completion (graph Fourier transform and inverse graph Fourier transform), and video compression (graph SVD). However, existing works are insufficient to support efficient large-scale graph-tensor computations. First, the running time of CPU-based graph-tensor operations increases rapidly with the number of nodes or the dimension of data on nodes; thus existing CPU-based graph-tensor tools cannot meet the real-time requirements of applications, such as localization and online recommendation. There is a growing impact of graph computing on high-performance GPUs [24]. Second, there are no high-performance graph-tensor computation libraries that support key graph-tensor operations as far as we know. Researchers have to implement and optimize their own graph-tensor operations in a case-by-case manner, which is inefficient and error-prone. For instance, GSPBOX [27] is a popular CPU-based graph processing toolbox. However, the graph operations in GSPBOX have long running time for big graphs, as revealed in our preliminary experiments (Fig. 18 shows g-FT, inverse g-FT, and g-filter on a graph vector of length 20,000 take 3022.36 s, 3006.91 s, and 20,443.11 s, respectively). Moreover, GSPBOX supports only scalar data on nodes and lacks several key graph operations such as graph convolution, graph shift, g-product, g-SVD, and g-QR, which have wide applications in network analysis and image processing. Although the cuTensor-tubal library [36], [37], [38] provides high performance tensor computations that are closely related to our work, it does not support the structural properties of graphs and is unsuitable for graph analyzing applications. Therefore, we are motivated to develop a library of high-performance graph-tensor operations to support diverse applications.

In this paper, we implement high-performance graph-tensor operations for big data and IoT applications on GPU. We design, implement, and optimize a library called cuGraph-Tensor for fast and accurate graph-tensor operations. cuGraph-Tensor implements eight key graph-tensor operations and provides three data types on graph nodes: scalar, vector and matrix. Graphs with these data types can model many real-world applications such as sensor networks, social networks, IoT wireless camera networks, and so on. A graph data completion application is presented in Section 4.3 to demonstrate the usage of the proposed graph-tensor operations. We further propose optimization techniques to improve the computation and memory access efficiency and reduce CPU–GPU communications, which contribute significant performance improvement. We build the cuGraph-Tensor library on top of existing highly optimized NVIDIA CUDA [7] libraries, including cuBLAS, cuSolver, and existing libraries, including Magma [1], KBLAS [4] for efficient GPU computations, as shown in Fig. 2.

Our contributions are summarized as follows.

  • We develop a high-performance GPU library called cuGraph-Tensor of eight graph-tensor operations, including graph shift (g-shift), graph Fourier transform (g-FT), inverse graph Fourier transform (inverse g-FT), graph filter (g-filter), graph convolution (g-convolution), graph-tensor product(g-product), graph-tensor SVD (g-SVD) and graph-tensor QR (g-QR). We encapsulate these operations into an open-source library and provide BLAS-like interfaces for ease of use.

  • We propose optimization techniques on batched computing, computation reduction, memory accesses, and CPU–GPU communications to improve the performance. As a demonstration, we further develop a graph data completion application using the high-performance graph-tensor operations of the cuGraph-Tensor library.

  • We perform extensive experiments to evaluate the performance of the graph-tensor operations and the graph data completion application. The g-shift, g-FT, inverse g-FT, g-filter, g-convolution, g-product, g-SVD and g-QR operations achieve up to 133.98×, 96.09×, 90.14×, 12.64×, 130.61×, 141.71×, 51.18× and 142.12× speedups, respectively, comparing with CPU-based GSPBOX [27] and CPU MATLAB implementations running on two Xeon CPUs. These graph-tensor operations with the optimizations in Section 3.2 are on average 35.27×, 18.54×, 18.12×, 4.47×, 38.16×, 27.60×, 7.91×, and 23.83× faster over the GPU baseline implementation. The graph data completion application achieves up to 174.38× speedup over the CPU MATLAB implementation, and up to 3.82× speedup with higher accuracy over the GPU-based tensor completion in the cuTensor-tubal library [36], [37], [38].

The remainder of this paper is organized as follows. In Section 2, we describe the notations and eight graph-tensor operations. Section 3 shows the implementation and optimizations of the graph-tensor operations on GPU. In Section 4, we evaluate the performance of the graph-tensor operations and the graph data completion application. Section 5 discusses the related works. The conclusions are drawn in Section 6.

Section snippets

Graph-tensor operations on GPU

The graph-tensor operations in the cuGraph-Tensor library combine the graph signal processing operations in [29], [30] with the tubal-rank tensor operations in [17], [36], [37], [38]. We introduce key graph-tensor operations and then briefly analyze the GPU parallelisms of these operations.

The cuGraph-Tensor library on GPU

We design the cuGraph-Tensor library on top of CUDA libraries including cuBLAS [7], cuSolver [7], and existing libraries including Magma [1], and KBLAS [4], as shown in Fig. 2. We present the workflow and layered design of the cuGraph-Tensor library. Then, we propose optimization strategies and efficient graph-tensor learning operation implementations based on these optimizations.

Performance evaluation

We first evaluate the performance of the cuGraph-Tensor library by measuring the running time and speedups of eight key graph-tensor operations. In addition, cuGraph-Tensor develops a data completion application using these graph-tensor learning operations for efficient and accurate graph data reconstruction. Then, we evaluate the performance of the graph data completion application.

Related works

We discuss the related works on graph operation libraries, GPU-based tensor computing and data completion, and graph algorithm libraries.

Conclusion and future work

In this paper, we proposed a cuGraph-Tensor library of eight key graph-tensor operations to support high-performance graph data processing. The cuGraph-Tensor library exploited the separability in the graph-tensor operations and mapped the parallelism onto GPU architectures. We proposed optimization techniques on computing, memory accesses, and CPU–GPU communications for graph-tensor operations. Based on these efficient graph-tensor operations, cuGraph-Tensor developed a graph data completion

CRediT authorship contribution statement

Tao Zhang: Writing - review & editing, Investigation, Supervision. Wang Kan: Software, Writing - original draft. Xiao-Yang Liu: Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank anonymous reviewers for their fruitful feedback and comments that have helped them improve the quality of this work. Tao Zhang is supported by the science and technology committee of Shanghai Municipality, China under grant No. 19511121002 and No. 19DZ2252600.

Tao Zhang received the Ph.D. degree in computer engineering from The University of New Mexico, USA, and the Ph.D degree in computer science from Shanghai Jiao Tong University, China, 2015 respectively, his Master and Bachelor degree in computer science and technology from Xi Dian university, China, 2006 and 2001 respectively. He is currently a researcher and lecturer with School of Computer Engineering and Science at Shanghai University, China. His research interests are in the area of big data

References (38)

  • NVIDIA CUDA SDK 10.1, NVIDIA CUDA software download

    (2019)
  • DefferrardM. et al.

    Pygsp: Graph signal processing in python

    (2017)
  • GiraultB. et al.

    Grasp: A matlab toolbox for graph signal processing

  • Intel indoor data trace

    (2019)
  • JouiliS. et al.

    Median graph shift: A new clustering algorithm for graph domain

  • LiH. et al.

    CUSntf: A scalable sparse non-negative tensor factorization model for large-scale industrial applications on multi-GPU

  • LiH. et al.

    High-performance tensor decoder on GPUs for wireless camera networks in IoT

  • LiuX.-Y. et al.

    Low-tubal-rank tensor completion using alternating minimization

  • LiuX.-Y. et al.

    Fourth-order tensors with multidimensional discrete transforms

    (2017)
  • Tao Zhang received the Ph.D. degree in computer engineering from The University of New Mexico, USA, and the Ph.D degree in computer science from Shanghai Jiao Tong University, China, 2015 respectively, his Master and Bachelor degree in computer science and technology from Xi Dian university, China, 2006 and 2001 respectively. He is currently a researcher and lecturer with School of Computer Engineering and Science at Shanghai University, China. His research interests are in the area of big data processing, GPU heterogeneous computing, machine learning algorithms and applications.

    Wang Kan received the B.Eng. degree in computer science and technology from Huaibei Normal University, China, in 2018. He is currently a Second-year graduate student of the school of computer engineering and science, Shanghai University, China. His research interests include GPU parallel computing, graph process, tensor computing, optimization acceleration of algorithm.

    Xiao-Yang Liu received the B.Eng. degree in computer science from the Huazhong University of Science and Technology, Wuhan, China, in 2010, and the MS degree in electrical engineering from Columbia University, New York, in 2018. He is currently working toward the Ph.D. degree with the Department of Electrical Engineering, Columbia University. He graduated from the Ph.D. program with the Department of Computer Science and Engineering, Shanghai Jiao Tong University, 2017. His research interests include tensor theory and high performance tensor computation, deep learning, optimization algorithms, big data analysis, and data privacy.

    View full text