ALBUS: A method for efficiently processing SpMV using SIMD and Load balancing
Introduction
SpMV (Sparse matrix–vector multiplication) is one of the critical contents of linear algebra, and it is widely used in many computing fields. In the field of graph computing systems, GridGraph [1], GraphChi [2], Flashgraph [3], X-Stream [4], Ligra [5] and other well-known graph computing systems all include system interfaces to calculate PageRank and the core of PageRank algorithm is SpMV. In the field of deep learning, we can often see that there are a large number of SpMV calculations in neural network algorithms such as RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network). Besides, relevant scholars have reasonably optimized the computational performance of SpMV by contacting specific deep learning algorithms [6]. In the field of scientific computing, for example, the global simulation of atmospheric dynamics [7] also contains a large number of SpMV. Not only that, but SpMV also occupies a prominent position in other related application fields. However, with the advent of the big data era, the matrix for SpMV operations has become larger and more complex. Therefore, more efficient parallel SpMV has become the research focus of many researchers.
In recent years, with the emergence of many new types of parallel processors, researchers have more and more choices. Therefore, when processing SpMV in parallel, researchers first need to consider the choice of the platform because the computing performance obtained by different processing platforms is also very different. SW26010 is a many-core architecture processor. Researchers use the slave core of Sunway architecture advantages to achieve outstanding performance improvement [8], [9], [10], [11], [12], [13]. As a general-purpose processor, CPU is the preferred research platform for many research scholars and commercial research groups, because it is more popular and easy to use. Some researchers use multi-machine and multi-core processing to achieve efficient parallel effects [14], [15], [16], [17], [18]. Compared with the CPU platform, the GPU platform has more computing cores. Although its single-core performance cannot match the CPU, its hundreds of cores make the machine’s overall performance sufficient to exceed the CPU. Therefore, it is more suitable for calculation. YaSpMV is a parallel SpMV processing framework on the GPU platform. It improves the cache hit rate and solves the problem of an unbalanced load [19] by dividing the slices. For large-scale SpMV, using GPU can obtain higher performance [20], [21], [22], [23], [24], [25]. Besides, many researchers have chosen dedicated accelerators when studying SpMV performance. As a well-known professional accelerator, FPGA also has a good influence on the accelerated calculation of SpMV [26], [27], [28].
Secondly, choosing a suitable matrix compression format is also the key to improve the calculation performance of SpMV. Because of lots of real sparse matrix contains most of the zero elements, using the original matrix for SpMV will not only bring additional time overhead, but the rapid increase in memory overhead also makes large-dimensional matrices difficult to handle. As we all know, zero elements will not affect the final calculation results when participating in calculations. Efficiently store the sparse matrix by greatly reducing the number of zero elements of the sparse matrix is the key to improve the performance of SpMV. For example, CSR [29], COO [30], HYB [31], and other traditional compression formats have improved the computing performance of SpMV to varying degrees. In recent years, with the higher performance requirements of SpMV, many latest matrix compression formats suitable for SpMV have emerged. For example, CSR2 (Compressed Sparse Row 2) [32], CSR5 (Compressed Sparse Row 5) [33], Cvr [34], BCCOO [19], CSX [17], LSRB-CSR [35], BiELL [36], LIL [37] and other new formats, they has a good performance improvement compared with the traditional compression formats. In the current era of artificial intelligence, some researchers have combined machine learning models to train and analyze the feature points of sparse matrices, and then select the best compression format for different matrices according to the training results [38].
However, in reality, most sparse matrices such as social network graphs, webpage graphs, and road network graphs are distributed in a power law. There is a vast difference in the number of non-zero elements in each row of the matrix. For this type of matrix, in real multi-threaded parallelism, the overall computing performance is often reduced due to an unbalanced load. Therefore, load balancing has become another critical point to improve the performance of SpMV. Merge (Merge-based SpMV) [39] uses a novel two-dimensional balanced segmentation method [40], [41] to split the load based on the CSR compression format. By comparing the performance of the MKL(Math Kernel Library), Merge has better computing performance in most cases. As a new sparse matrix compression format, CSR5 [33] divides the load more evenly into each thread to run parallel through the matrix block method, thereby obtaining efficient parallel effects. Besides, a method of transforming irregular matrices into many regular matrix subsets to achieve efficient parallel SpMV also has excellent load balancing capabilities [42].
Under the micro-architecture, SIMD vectorization plays a vital role in improving the performance of SpMV. CSR5 [33] used AVX2 and AVX512 instruction set to optimize SpMV on Xeon CPU and Xeon phi machines, respectively, and achieved high computing performance by using single instruction and multiple data processing sparse matrices in the divided matrix block. Cvr [34] is a method for vector optimization SpMV based on the Xeon phi machine using the AVX512 instruction set. It is transformed into a format suitable for efficient SIMD calculation through preprocessing, and efficient calculation performance is obtained in several SpMV iterations. VHCC [16] is a new hybrid format that combines COO and CSR in two traditional formats. On Xeon phi high-performance machines, it uses a two-dimensional matrix segmentation method combined with SIMD vectorization to exert the computing performance of the processor effectively.
Although SpMV, which has undergone load balancing and SIMD vectorization processing, has achieved excellent performance acceleration, it does not maximize the machine’s overall performance. First, most researchers achieve load balancing through the matrix block. For the matrix block, there is indeed a balanced load processing among the threads. However, the frequent boundary value processing between blocks will bring much extra overhead. As the load continues to increase, the additional overhead caused by an increased number of matrix blocks will become more and more apparent. Second, the load-balancing method based on matrix block cannot fully utilize SIMD vectorization optimization. For example, there are some continuous matrix blocks, and all elements in the block belong to the same row of the original sparse matrix. If block processing is adopted, using SIMD to process boundary elements continuously will be lost. Therefore, the advantages of SIMD vectorization cannot be thoroughly and efficiently used.
In this paper, we propose ALBUS,1 a new method that uses load balancing and SIMD vectorization to optimize the performance of SpMV together. It uses multithreading to divide elements evenly based on the CSR compression format, and the number of boundary value processing times is equal to the number of threads. Compared with the matrix block, it dramatically reduces the number of boundary value processing; in the use of SIMD, we have chosen the better vectorized FMA instruction set. There are a large number of element multiplication and summation operations in SpMV. Based on the ALBUS load balancing method, it can significantly improve the optimization effect of SIMD vectorization. At the same time, we selected 40 sets of matrices composed of 20 sets of regular matrices and 20 sets of irregular matrices as the benchmark suite for our experiment. We performed SpMV performance comparison tests on ALBUS, CSR5, Merge, and MKL on the mainstream high-performance machines E5-2670 v3 CPU and E5-2680 v4 CPU. On the E5-2670 v3 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.59x, 1.32x, 1.48x (up to 2.53x, 2.22x, 2.31x) compared to CSR5, Merge, MKL. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.38x, 1.42x, 2.44x (up to 2.33x, 2.24x, 5.37x) compared to CSR5, Merge, MKL. On the E5-2680 v4 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.40x, 1.37x, 1.55x (up to 2.49x, 1.71x, 2.21x) compared to CSR5, Merge, MKL. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.40x, 1.35x, 2.63x (up to 1.68x, 1.89x, 6.85x) compared to CSR5, Merge, MKL.
In this paper, we have the following three contributions:
(1) A new method ALBUS for efficient load balancing;
(2) An optimization method that can give full play to the performance of computer SIMD vectorization;
(3) A new performance evaluation model PCS (Performance Core Speedup) used to approximate reflect the parallel effects of SpMV.
The specific arrangement of this paper is as follows: Section 2 is the background. We introduce the standard matrix format, CSR compression format, and the corresponding SpMV algorithm. Section 3 is related work. We introduced the current mainstream work has excellent performance: CSR5, Merge-based SpMV. At the same time, we also introduced the widely used high-performance math library MKL and other open-source BLAS libraries. Section 4 is the SpMV algorithm based on ALBUS. We introduce the specific operation steps of ALBUS load balancing, the SIMD vectorization optimization operation, and the realization process of the SpMV algorithm. Section 5 is the SpMV performance evaluation model. We proposed the PCS model and explained the principle of the model in detail. Section 6 is the evaluation of the experiment. First, we introduce the test platform of the experiment; secondly, we analyze the Benchmark suite consisting of 20 sets of regular matrices and 20 sets of irregular matrices; next, we show how the E5-2670 v3 CPU and E5-2680 v4 CPU compares the performance of ALBUS with CSR5, Merge, MKL and gives a specific analysis. Then, we use the PCS model to prove the superiority of ALBUS; finally, we explain in detail the evaluation indicators that appeared in the experiment: the calculation method of the number of iterations, performance index GFlops, and SpeedUp. Section 7 is a summary.
Section snippets
Sparse matrix–vector multiplication
Given a sparse matrix and a dense vector , we can use Eq. (1) to multiply the two and finally get the result vector . For a given sparse matrix of and a dense vector of , in Algorithm 1, we can use multithreading to solve SpMV in parallel.
The CSR storage format and SpMV parallel algorithm based on CSR
In Algorithm 1, we can find that using a sparse matrix stored in the standard matrix format will allow all elements in the matrix to participate in the calculation. However, in actual calculation operations, a large number of zero elements
Related work
SpMV optimization model
SpMV performance evaluation model
Load balancing and efficient vectorization are the keys to improving performance. Because the SpMV algorithm itself has good load balancing characteristics, we propose the novel evaluation method PCS, which reflects the performance of the SpMV model by combining single-core performance and multi-threaded speedup.
As shown in Eq. (3), P represents the overall performance of the
Experiment platform
To thoroughly verify the performance of processing SpM-V based on ALBUS, we chose two different architectures of Intel’s fourth-generation Haswell and fifth-generation Broadwell CPUs as our experimental platform. In the source code compilation environment, we chose Intel’s compiler because the source code of CSR5 [46] and MKL [47] only supports the Intel compiler, and Megre [47] has an excellent performance in the Intel compiler and terrible performance in GNU compiler. In addition to the Intel
Conclusions
Multithreading and SIMD vectorization technology are playing an increasingly important role in SpMV calculations. However, in reality, the performance of SpMV is often inextricably linked to the data. Through MKL, we can find that the performance fluctuations caused by different matrices are enormous. After our research, it is found that a balanced load can often bring higher computing performance, which is also the inspiration that CSR5 brings to us. After overcoming the boundary value
CRediT authorship contribution statement
Haodong Bian: Writing - review & editing, Writing - original draft, Methodology, Software, Conceptualization. Jianqiang Huang: Writing - review & editing, Supervision, Project administration, Funding acquisition. Lingbin Liu: Visualization, Validation, Data curation. Dongqiang Huang: Investigation, Validation, Data curation. Xiaoying Wang: Writing - review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors are grateful to the reviewers for valuable comments that have greatly improved the paper. This paper is partially supported by the National Natural Science Foundation of China (No. 62062059, No. 61762074, No. 61962051), National Natural Science Foundation of Qinghai Province, China (No. 2019-ZJ-7034). “Qinghai Province High-end Innovative Thousand Talents Program - Leading Talents” Project Support. The Open Project of State Key Laboratory of Plateau Ecology and Agriculture, Qinghai
Haodong Bian is a Master student in Department of Computer Technology and Applications, Qinghai University, China. His research interests include high performance computing and Graph computing systems.
References (48)
- et al.
TpSpMV: A two-phase large-scale sparse matrix–vector multiplication kernel for manycore architectures
Inform. Sci.
(2020) - et al.
Parallel symmetric sparse matrix–vector product on scalar multi-core CPUs
Parallel Comput.
(2010) - et al.
BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs
J. Parallel Distrib. Comput.
(2014) - X. Zhu, W. Han, W. Chen, GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical...
- A. Kyrola, G. Blelloch, C. Guestrin, Graphchi: Large-scale graph computation on just a PC, in: 10th USENIX Symposium on...
- D. Zheng, et al. FlashGraph: Processing billion-node graphs on an array of commodity SSDs, in: 13th USENIX Conference...
- A. Roy, I. Mihailovic, W. Zwaenepoel, X-stream: Edge-centric graph processing using streaming partitions, in: ACM...
- J. Shun, G.E. Blelloch, Ligra: a lightweight graph processing framework for shared memory, in: ACM SIGPLAN Symposium on...
- Y. Zhao, et al. Bridging the gap between deep learning and sparse matrix format selection, in: Proceedings of the 23rd...
- W. Xue, et al. Enabling and scaling a global shallow-water atmospheric model on Tianhe-2, in: IEEE 28th International...
Caspmv: A customized and accelerative spmv framework for the sunway taihulight
IEEE Trans. Parallel Distrib. Syst.
AhSpMV: An autotuning hybrid computing scheme for SpMV on the sunway architecture
IEEE Internet Things J.
Heterogeneous parallel algorithm design and performance optimization for WENO on the sunway taihulight supercomputer
Tsinghua Sci. Technol.
A study of spmv implementation using MPI and openmp on intel many-core architecture
Reduce operations: Send volume balancing while minimizing latency
IEEE Trans. Parallel Distrib. Syst.
Data-driven mixed precision sparse matrix vector multiplication for GPUs
ACM Trans. Archit. Code Optim.
Cited by (19)
RedMule: A mixed-precision matrix–matrix operation engine for flexible and energy-efficient on-chip linear algebra and TinyML training acceleration
2023, Future Generation Computer SystemsDASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication
2023, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023A Survey of Accelerating Parallel Sparse Linear Algebra
2023, ACM Computing SurveysGTLB:A Load-Balanced SpMV Computation Method on GPU
2023, ACM International Conference Proceeding SeriesAlgorithm-Oriented SIMD Computer Mathematical Model and Its Application
2023, International Journal of Information and Communication Technology Education
Haodong Bian is a Master student in Department of Computer Technology and Applications, Qinghai University, China. His research interests include high performance computing and Graph computing systems.
Jianqiang Huang is an associate professor at Qinghai University, China. He is currently a Ph.D. candidate in the Department of Computer Science and Technology, Tsinghua University. His research interests include high performance computing and Graph computing systems.
Lingbin Liu is an undergraduate student in the Department of Computer Technology and Applications, Qinghai University, China. His research interests include graph computing systems.
Dongqiang Huang is a graduate student of Qinghai University, China. His research direction is large-scale high-performance computing.
Xiaoying Wang is a Professor in Department of Computer Technology and Applications, Qinghai University, China. She received her Ph.D. from Tsinghua University in 2008. Her research interests include cloud computing, parallel computing.