High performance GPU primitives for graph-tensor learning operations
Introduction
There is a huge amount of data generated in diverse domains, such as social networks, sensor networks, biomolecular networks, citation and authorship networks, and e-commerce networks [13]. Many aspects of daily life are being recorded at all levels generating large-scale datasets such as phone trajectories, health data on monitoring devices, banking and financial records, shopping preferences, and so on. These data are irregular with complex structures [9]. Graphs offer the ability to characterize the complex interactions by data modeling. Thus researchers proposed various graphs to represent real-world data, such as scale-free graphs, ring graphs, nearest-neighbor graphs, and random geometric graphs [26]. Entities such as users on Facebook or sensors in a field are modeled as nodes and the connections among users or sensors as edges. For graphs that have a data matrix residing on each node, these data matrices form a tensor with graph structures called graph-tensor, where each frontal slice corresponds to a graph node and the connection between two frontal slices is an edge, as shown in Fig. 1.
Graph-tensor operations are widely used in various applications, such as graph neural networks (graph convolution, graph SVD, and graph QR), computer vision (graph convolution), image processing (graph filter), data completion (graph Fourier transform and inverse graph Fourier transform), and video compression (graph SVD). However, existing works are insufficient to support efficient large-scale graph-tensor computations. First, the running time of CPU-based graph-tensor operations increases rapidly with the number of nodes or the dimension of data on nodes; thus existing CPU-based graph-tensor tools cannot meet the real-time requirements of applications, such as localization and online recommendation. There is a growing impact of graph computing on high-performance GPUs [24]. Second, there are no high-performance graph-tensor computation libraries that support key graph-tensor operations as far as we know. Researchers have to implement and optimize their own graph-tensor operations in a case-by-case manner, which is inefficient and error-prone. For instance, GSPBOX [27] is a popular CPU-based graph processing toolbox. However, the graph operations in GSPBOX have long running time for big graphs, as revealed in our preliminary experiments (Fig. 18 shows g-FT, inverse g-FT, and g-filter on a graph vector of length 20,000 take 3022.36 s, 3006.91 s, and 20,443.11 s, respectively). Moreover, GSPBOX supports only scalar data on nodes and lacks several key graph operations such as graph convolution, graph shift, g-product, g-SVD, and g-QR, which have wide applications in network analysis and image processing. Although the cuTensor-tubal library [36], [37], [38] provides high performance tensor computations that are closely related to our work, it does not support the structural properties of graphs and is unsuitable for graph analyzing applications. Therefore, we are motivated to develop a library of high-performance graph-tensor operations to support diverse applications.
In this paper, we implement high-performance graph-tensor operations for big data and IoT applications on GPU. We design, implement, and optimize a library called cuGraph-Tensor for fast and accurate graph-tensor operations. cuGraph-Tensor implements eight key graph-tensor operations and provides three data types on graph nodes: scalar, vector and matrix. Graphs with these data types can model many real-world applications such as sensor networks, social networks, IoT wireless camera networks, and so on. A graph data completion application is presented in Section 4.3 to demonstrate the usage of the proposed graph-tensor operations. We further propose optimization techniques to improve the computation and memory access efficiency and reduce CPU–GPU communications, which contribute significant performance improvement. We build the cuGraph-Tensor library on top of existing highly optimized NVIDIA CUDA [7] libraries, including cuBLAS, cuSolver, and existing libraries, including Magma [1], KBLAS [4] for efficient GPU computations, as shown in Fig. 2.
Our contributions are summarized as follows.
- •
We develop a high-performance GPU library called cuGraph-Tensor of eight graph-tensor operations, including graph shift (g-shift), graph Fourier transform (g-FT), inverse graph Fourier transform (inverse g-FT), graph filter (g-filter), graph convolution (g-convolution), graph-tensor product(g-product), graph-tensor SVD (g-SVD) and graph-tensor QR (g-QR). We encapsulate these operations into an open-source library and provide BLAS-like interfaces for ease of use.
- •
We propose optimization techniques on batched computing, computation reduction, memory accesses, and CPU–GPU communications to improve the performance. As a demonstration, we further develop a graph data completion application using the high-performance graph-tensor operations of the cuGraph-Tensor library.
- •
We perform extensive experiments to evaluate the performance of the graph-tensor operations and the graph data completion application. The g-shift, g-FT, inverse g-FT, g-filter, g-convolution, g-product, g-SVD and g-QR operations achieve up to , , , , , , and speedups, respectively, comparing with CPU-based GSPBOX [27] and CPU MATLAB implementations running on two Xeon CPUs. These graph-tensor operations with the optimizations in Section 3.2 are on average , , , , , , , and faster over the GPU baseline implementation. The graph data completion application achieves up to speedup over the CPU MATLAB implementation, and up to speedup with higher accuracy over the GPU-based tensor completion in the cuTensor-tubal library [36], [37], [38].
The remainder of this paper is organized as follows. In Section 2, we describe the notations and eight graph-tensor operations. Section 3 shows the implementation and optimizations of the graph-tensor operations on GPU. In Section 4, we evaluate the performance of the graph-tensor operations and the graph data completion application. Section 5 discusses the related works. The conclusions are drawn in Section 6.
Section snippets
Graph-tensor operations on GPU
The graph-tensor operations in the cuGraph-Tensor library combine the graph signal processing operations in [29], [30] with the tubal-rank tensor operations in [17], [36], [37], [38]. We introduce key graph-tensor operations and then briefly analyze the GPU parallelisms of these operations.
The cuGraph-Tensor library on GPU
We design the cuGraph-Tensor library on top of CUDA libraries including cuBLAS [7], cuSolver [7], and existing libraries including Magma [1], and KBLAS [4], as shown in Fig. 2. We present the workflow and layered design of the cuGraph-Tensor library. Then, we propose optimization strategies and efficient graph-tensor learning operation implementations based on these optimizations.
Performance evaluation
We first evaluate the performance of the cuGraph-Tensor library by measuring the running time and speedups of eight key graph-tensor operations. In addition, cuGraph-Tensor develops a data completion application using these graph-tensor learning operations for efficient and accurate graph data reconstruction. Then, we evaluate the performance of the graph data completion application.
Related works
We discuss the related works on graph operation libraries, GPU-based tensor computing and data completion, and graph algorithm libraries.
Conclusion and future work
In this paper, we proposed a cuGraph-Tensor library of eight key graph-tensor operations to support high-performance graph data processing. The cuGraph-Tensor library exploited the separability in the graph-tensor operations and mapped the parallelism onto GPU architectures. We proposed optimization techniques on computing, memory accesses, and CPU–GPU communications for graph-tensor operations. Based on these efficient graph-tensor operations, cuGraph-Tensor developed a graph data completion
CRediT authorship contribution statement
Tao Zhang: Writing - review & editing, Investigation, Supervision. Wang Kan: Software, Writing - original draft. Xiao-Yang Liu: Methodology.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank anonymous reviewers for their fruitful feedback and comments that have helped them improve the quality of this work. Tao Zhang is supported by the science and technology committee of Shanghai Municipality, China under grant No. 19511121002 and No. 19DZ2252600.
Tao Zhang received the Ph.D. degree in computer engineering from The University of New Mexico, USA, and the Ph.D degree in computer science from Shanghai Jiao Tong University, China, 2015 respectively, his Master and Bachelor degree in computer science and technology from Xi Dian university, China, 2006 and 2001 respectively. He is currently a researcher and lecturer with School of Computer Engineering and Science at Shanghai University, China. His research interests are in the area of big data
References (38)
- et al.
Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression
Parallel Comput.
(2018) - et al.
Approaches to parallel graph-based knowledge discovery
J. Parallel Distrib. Comput.
(2001) - et al.
Communication-free massively distributed graph generation
J. Parallel Distrib. Comput.
(2019) - et al.
Trends in big data analytics
J. Parallel Distrib. Comput.
(2014) - et al.
Ls-decomposition for robust recovery of sensory big data
IEEE Trans. Big Data
(2017) - et al.
Exploring big graph computing—An empirical study from architectural perspective
J. Parallel Distrib. Comput.
(2017) - et al.
Performance, design, and autotuning of batched GEMM for GPUs
- et al.
Graph spectral domain feature learning with application to in-air hand-drawn number and shape recognition
IEEE Access
(2019) - et al.
Nonnegative tensor factorization accelerated using GPGPU
IEEE Trans. Parallel Distrib. Syst.
(2011) NvGraph library users’ guide, version 10.2
(2019)
NVIDIA CUDA SDK 10.1, NVIDIA CUDA software download
Pygsp: Graph signal processing in python
Grasp: A matlab toolbox for graph signal processing
Intel indoor data trace
Median graph shift: A new clustering algorithm for graph domain
CUSntf: A scalable sparse non-negative tensor factorization model for large-scale industrial applications on multi-GPU
High-performance tensor decoder on GPUs for wireless camera networks in IoT
Low-tubal-rank tensor completion using alternating minimization
Fourth-order tensors with multidimensional discrete transforms
Cited by (4)
Graph-Tensor Neural Networks for Network Traffic Data Imputation
2023, IEEE/ACM Transactions on NetworkingGraph Spectral Regularized Tensor Completion for Traffic Data Imputation
2022, IEEE Transactions on Intelligent Transportation Systems
Tao Zhang received the Ph.D. degree in computer engineering from The University of New Mexico, USA, and the Ph.D degree in computer science from Shanghai Jiao Tong University, China, 2015 respectively, his Master and Bachelor degree in computer science and technology from Xi Dian university, China, 2006 and 2001 respectively. He is currently a researcher and lecturer with School of Computer Engineering and Science at Shanghai University, China. His research interests are in the area of big data processing, GPU heterogeneous computing, machine learning algorithms and applications.
Wang Kan received the B.Eng. degree in computer science and technology from Huaibei Normal University, China, in 2018. He is currently a Second-year graduate student of the school of computer engineering and science, Shanghai University, China. His research interests include GPU parallel computing, graph process, tensor computing, optimization acceleration of algorithm.
Xiao-Yang Liu received the B.Eng. degree in computer science from the Huazhong University of Science and Technology, Wuhan, China, in 2010, and the MS degree in electrical engineering from Columbia University, New York, in 2018. He is currently working toward the Ph.D. degree with the Department of Electrical Engineering, Columbia University. He graduated from the Ph.D. program with the Department of Computer Science and Engineering, Shanghai Jiao Tong University, 2017. His research interests include tensor theory and high performance tensor computation, deep learning, optimization algorithms, big data analysis, and data privacy.