ALBUS: A method for efficiently processing SpMV using SIMD and Load balancing

https://doi.org/10.1016/j.future.2020.10.036Get rights and content

Highlights

  • We came up with a new efficient load balancing method ALBUS.

  • We proposed a new efficient SIMD vectorized operation based on ALBUS load balancing.

  • A new evaluation model PCS used to approximate reflect the parallel effects of SpMV.

Abstract

SpMV (Sparse matrix–vector multiplication) is widely used in many fields. Improving the performance of SpMV has been the pursuit of many researchers. Parallel SpMV using multi-core processors has been a standard parallel method used by researchers. In reality, the number of non-zero elements in many sparse matrices is not evenly distributed, so parallelism without preprocessing will cause a large amount of performance loss due to uneven load. In this paper, we propose ALBUS (Absolute Load Balancing Using SIMD (Single Instruction Multiple Data)), a method for efficiently processing SpMV using load balancing and SIMD vectorization. On the one hand, ALBUS can achieve multi-core balanced load processing; on the other hand, it gives full play to the ability of SIMD vectorization parallelism under the CPU. We selected 20 sets of regular matrices and 20 sets of irregular matrices to form the Benchmark suite. We performed SpMV performance comparison tests on ALBUS, CSR5 (Compressed Sparse Row 5), Merge(Merge-based SpMV), and MKL (Math Kernel Library) under the same conditions. On the E5-2670 v3 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.59x, 1.32x, 1.48x (up to 2.53x, 2.22x, 2.31x) compared to CSR5, Merge, MKL, respectively. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.38x, 1.42x, 2.44x (up to 2.33x, 2.24x, 5.37x) compared to CSR5, Merge, MKL, respectively.

Introduction

SpMV (Sparse matrix–vector multiplication) is one of the critical contents of linear algebra, and it is widely used in many computing fields. In the field of graph computing systems, GridGraph [1], GraphChi [2], Flashgraph [3], X-Stream [4], Ligra [5] and other well-known graph computing systems all include system interfaces to calculate PageRank and the core of PageRank algorithm is SpMV. In the field of deep learning, we can often see that there are a large number of SpMV calculations in neural network algorithms such as RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network). Besides, relevant scholars have reasonably optimized the computational performance of SpMV by contacting specific deep learning algorithms [6]. In the field of scientific computing, for example, the global simulation of atmospheric dynamics [7] also contains a large number of SpMV. Not only that, but SpMV also occupies a prominent position in other related application fields. However, with the advent of the big data era, the matrix for SpMV operations has become larger and more complex. Therefore, more efficient parallel SpMV has become the research focus of many researchers.

In recent years, with the emergence of many new types of parallel processors, researchers have more and more choices. Therefore, when processing SpMV in parallel, researchers first need to consider the choice of the platform because the computing performance obtained by different processing platforms is also very different. SW26010 is a many-core architecture processor. Researchers use the slave core of Sunway architecture advantages to achieve outstanding performance improvement [8], [9], [10], [11], [12], [13]. As a general-purpose processor, CPU is the preferred research platform for many research scholars and commercial research groups, because it is more popular and easy to use. Some researchers use multi-machine and multi-core processing to achieve efficient parallel effects [14], [15], [16], [17], [18]. Compared with the CPU platform, the GPU platform has more computing cores. Although its single-core performance cannot match the CPU, its hundreds of cores make the machine’s overall performance sufficient to exceed the CPU. Therefore, it is more suitable for calculation. YaSpMV is a parallel SpMV processing framework on the GPU platform. It improves the cache hit rate and solves the problem of an unbalanced load [19] by dividing the slices. For large-scale SpMV, using GPU can obtain higher performance [20], [21], [22], [23], [24], [25]. Besides, many researchers have chosen dedicated accelerators when studying SpMV performance. As a well-known professional accelerator, FPGA also has a good influence on the accelerated calculation of SpMV [26], [27], [28].

Secondly, choosing a suitable matrix compression format is also the key to improve the calculation performance of SpMV. Because of lots of real sparse matrix contains most of the zero elements, using the original matrix for SpMV will not only bring additional time overhead, but the rapid increase in memory overhead also makes large-dimensional matrices difficult to handle. As we all know, zero elements will not affect the final calculation results when participating in calculations. Efficiently store the sparse matrix by greatly reducing the number of zero elements of the sparse matrix is the key to improve the performance of SpMV. For example, CSR [29], COO [30], HYB [31], and other traditional compression formats have improved the computing performance of SpMV to varying degrees. In recent years, with the higher performance requirements of SpMV, many latest matrix compression formats suitable for SpMV have emerged. For example, CSR2 (Compressed Sparse Row 2) [32], CSR5 (Compressed Sparse Row 5) [33], Cvr [34], BCCOO [19], CSX [17], LSRB-CSR [35], BiELL [36], LIL [37] and other new formats, they has a good performance improvement compared with the traditional compression formats. In the current era of artificial intelligence, some researchers have combined machine learning models to train and analyze the feature points of sparse matrices, and then select the best compression format for different matrices according to the training results [38].

However, in reality, most sparse matrices such as social network graphs, webpage graphs, and road network graphs are distributed in a power law. There is a vast difference in the number of non-zero elements in each row of the matrix. For this type of matrix, in real multi-threaded parallelism, the overall computing performance is often reduced due to an unbalanced load. Therefore, load balancing has become another critical point to improve the performance of SpMV. Merge (Merge-based SpMV) [39] uses a novel two-dimensional balanced segmentation method [40], [41] to split the load based on the CSR compression format. By comparing the performance of the MKL(Math Kernel Library), Merge has better computing performance in most cases. As a new sparse matrix compression format, CSR5 [33] divides the load more evenly into each thread to run parallel through the matrix block method, thereby obtaining efficient parallel effects. Besides, a method of transforming irregular matrices into many regular matrix subsets to achieve efficient parallel SpMV also has excellent load balancing capabilities [42].

Under the micro-architecture, SIMD vectorization plays a vital role in improving the performance of SpMV. CSR5 [33] used AVX2 and AVX512 instruction set to optimize SpMV on Xeon CPU and Xeon phi machines, respectively, and achieved high computing performance by using single instruction and multiple data processing sparse matrices in the divided matrix block. Cvr [34] is a method for vector optimization SpMV based on the Xeon phi machine using the AVX512 instruction set. It is transformed into a format suitable for efficient SIMD calculation through preprocessing, and efficient calculation performance is obtained in several SpMV iterations. VHCC [16] is a new hybrid format that combines COO and CSR in two traditional formats. On Xeon phi high-performance machines, it uses a two-dimensional matrix segmentation method combined with SIMD vectorization to exert the computing performance of the processor effectively.

Although SpMV, which has undergone load balancing and SIMD vectorization processing, has achieved excellent performance acceleration, it does not maximize the machine’s overall performance. First, most researchers achieve load balancing through the matrix block. For the matrix block, there is indeed a balanced load processing among the threads. However, the frequent boundary value processing between blocks will bring much extra overhead. As the load continues to increase, the additional overhead caused by an increased number of matrix blocks will become more and more apparent. Second, the load-balancing method based on matrix block cannot fully utilize SIMD vectorization optimization. For example, there are some continuous matrix blocks, and all elements in the block belong to the same row of the original sparse matrix. If block processing is adopted, using SIMD to process boundary elements continuously will be lost. Therefore, the advantages of SIMD vectorization cannot be thoroughly and efficiently used.

In this paper, we propose ALBUS,1 a new method that uses load balancing and SIMD vectorization to optimize the performance of SpMV together. It uses multithreading to divide elements evenly based on the CSR compression format, and the number of boundary value processing times is equal to the number of threads. Compared with the matrix block, it dramatically reduces the number of boundary value processing; in the use of SIMD, we have chosen the better vectorized FMA instruction set. There are a large number of element multiplication and summation operations in SpMV. Based on the ALBUS load balancing method, it can significantly improve the optimization effect of SIMD vectorization. At the same time, we selected 40 sets of matrices composed of 20 sets of regular matrices and 20 sets of irregular matrices as the benchmark suite for our experiment. We performed SpMV performance comparison tests on ALBUS, CSR5, Merge, and MKL on the mainstream high-performance machines E5-2670 v3 CPU and E5-2680 v4 CPU. On the E5-2670 v3 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.59x, 1.32x, 1.48x (up to 2.53x, 2.22x, 2.31x) compared to CSR5, Merge, MKL. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.38x, 1.42x, 2.44x (up to 2.33x, 2.24x, 5.37x) compared to CSR5, Merge, MKL. On the E5-2680 v4 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.40x, 1.37x, 1.55x (up to 2.49x, 1.71x, 2.21x) compared to CSR5, Merge, MKL. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.40x, 1.35x, 2.63x (up to 1.68x, 1.89x, 6.85x) compared to CSR5, Merge, MKL.

In this paper, we have the following three contributions:

(1) A new method ALBUS for efficient load balancing;

(2) An optimization method that can give full play to the performance of computer SIMD vectorization;

(3) A new performance evaluation model PCS (Performance Core Speedup) used to approximate reflect the parallel effects of SpMV.

The specific arrangement of this paper is as follows: Section 2 is the background. We introduce the standard matrix format, CSR compression format, and the corresponding SpMV algorithm. Section 3 is related work. We introduced the current mainstream work has excellent performance: CSR5, Merge-based SpMV. At the same time, we also introduced the widely used high-performance math library MKL and other open-source BLAS libraries. Section 4 is the SpMV algorithm based on ALBUS. We introduce the specific operation steps of ALBUS load balancing, the SIMD vectorization optimization operation, and the realization process of the SpMV algorithm. Section 5 is the SpMV performance evaluation model. We proposed the PCS model and explained the principle of the model in detail. Section 6 is the evaluation of the experiment. First, we introduce the test platform of the experiment; secondly, we analyze the Benchmark suite consisting of 20 sets of regular matrices and 20 sets of irregular matrices; next, we show how the E5-2670 v3 CPU and E5-2680 v4 CPU compares the performance of ALBUS with CSR5, Merge, MKL and gives a specific analysis. Then, we use the PCS model to prove the superiority of ALBUS; finally, we explain in detail the evaluation indicators that appeared in the experiment: the calculation method of the number of iterations, performance index GFlops, and SpeedUp. Section 7 is a summary.

Section snippets

Sparse matrix–vector multiplication

Given a sparse matrix A and a dense vector x, we can use Eq. (1) to multiply the two and finally get the result vector B. B=A×xFor a given sparse matrix A of n×m and a dense vector x of m×1, in Algorithm 1, we can use multithreading to solve SpMV in parallel.

The CSR storage format and SpMV parallel algorithm based on CSR

In Algorithm 1, we can find that using a sparse matrix stored in the standard matrix format will allow all elements in the matrix to participate in the calculation. However, in actual calculation operations, a large number of zero elements

Related work

SpMV optimization model

SpMV performance evaluation model

Load balancing and efficient vectorization are the keys to improving performance. Because the SpMV algorithm itself has good load balancing characteristics, we propose the novel evaluation method PCS, which reflects the performance of the SpMV model by combining single-core performance and multi-threaded speedup. P=C×S S=aT+b(T2,a>0) E(a,b)=i=1n(SiaTib) a=i=1nSi(TiT¯)i=1nTi21n(i=1nTi)2 b=1ni=1n(SiaTi) T¯=1ni=1nTi

As shown in Eq. (3), P represents the overall performance of the

Experiment platform

To thoroughly verify the performance of processing SpM-V based on ALBUS, we chose two different architectures of Intel’s fourth-generation Haswell and fifth-generation Broadwell CPUs as our experimental platform. In the source code compilation environment, we chose Intel’s compiler because the source code of CSR5 [46] and MKL [47] only supports the Intel compiler, and Megre [47] has an excellent performance in the Intel compiler and terrible performance in GNU compiler. In addition to the Intel

Conclusions

Multithreading and SIMD vectorization technology are playing an increasingly important role in SpMV calculations. However, in reality, the performance of SpMV is often inextricably linked to the data. Through MKL, we can find that the performance fluctuations caused by different matrices are enormous. After our research, it is found that a balanced load can often bring higher computing performance, which is also the inspiration that CSR5 brings to us. After overcoming the boundary value

CRediT authorship contribution statement

Haodong Bian: Writing - review & editing, Writing - original draft, Methodology, Software, Conceptualization. Jianqiang Huang: Writing - review & editing, Supervision, Project administration, Funding acquisition. Lingbin Liu: Visualization, Validation, Data curation. Dongqiang Huang: Investigation, Validation, Data curation. Xiaoying Wang: Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors are grateful to the reviewers for valuable comments that have greatly improved the paper. This paper is partially supported by the National Natural Science Foundation of China (No. 62062059, No. 61762074, No. 61962051), National Natural Science Foundation of Qinghai Province, China (No. 2019-ZJ-7034). “Qinghai Province High-end Innovative Thousand Talents Program - Leading Talents” Project Support. The Open Project of State Key Laboratory of Plateau Ecology and Agriculture, Qinghai

Haodong Bian is a Master student in Department of Computer Technology and Applications, Qinghai University, China. His research interests include high performance computing and Graph computing systems.

References (48)

  • C. Liu, et al. Towards efficient spmv on sunway manycore architectures, in: Proceedings of the 32nd International...
  • XiaoG.

    Caspmv: A customized and accelerative spmv framework for the sunway taihulight

    IEEE Trans. Parallel Distrib. Syst.

    (2021)
  • Q. Sun, C. Zhang, C. Wu, J. Zhang, L. Li, Bandwidth reduced parallel SpMV on the SW26010 many-core platform, in:...
  • XiaoG. et al.

    AhSpMV: An autotuning hybrid computing scheme for SpMV on the sunway architecture

    IEEE Internet Things J.

    (2020)
  • HuangJ. et al.

    Heterogeneous parallel algorithm design and performance optimization for WENO on the sunway taihulight supercomputer

    Tsinghua Sci. Technol.

    (2020)
  • YeF. et al.

    A study of spmv implementation using MPI and openmp on intel many-core architecture

  • W.T. Tang, et al. Optimizing and auto-tuning scale-free sparse matrix–vector multiplication on Intel Xeon Phi, in:...
  • K. Kourtis, et al. CSX: an extended compression format for spmv on shared memory systems, in: Proceedings of the 16th...
  • KarsavuranM.O. et al.

    Reduce operations: Send volume balancing while minimizing latency

    IEEE Trans. Parallel Distrib. Syst.

    (2020)
  • S. Yan, C. Li, Y. Zhang, H. Zhou, yaSpMV: yet another SpMV framework on GPUs, in: ACM SIGPLAN Symposium on Principles...
  • H. Jeljeli, Accelerating iterative SpMV for the discrete logarithm problem using GPUs, in: Arithmetic of Finite Fields...
  • AhmadK. et al.

    Data-driven mixed precision sparse matrix vector multiplication for GPUs

    ACM Trans. Archit. Code Optim.

    (2020)
  • H. Anzt, S. Tomov, J. Dongarra, Energy efficiency and performance frontiers for sparse computations on GPU...
  • M. Steinberger, R. Zayer, H. Seidel, Globally homogeneous, locally adaptive sparse matrix–vector multiplication on the...
  • Cited by (19)

    View all citing articles on Scopus

    Haodong Bian is a Master student in Department of Computer Technology and Applications, Qinghai University, China. His research interests include high performance computing and Graph computing systems.

    Jianqiang Huang is an associate professor at Qinghai University, China. He is currently a Ph.D. candidate in the Department of Computer Science and Technology, Tsinghua University. His research interests include high performance computing and Graph computing systems.

    Lingbin Liu is an undergraduate student in the Department of Computer Technology and Applications, Qinghai University, China. His research interests include graph computing systems.

    Dongqiang Huang is a graduate student of Qinghai University, China. His research direction is large-scale high-performance computing.

    Xiaoying Wang is a Professor in Department of Computer Technology and Applications, Qinghai University, China. She received her Ph.D. from Tsinghua University in 2008. Her research interests include cloud computing, parallel computing.

    View full text