Elsevier

Parallel Computing

Volume 100, December 2020, 102710
Parallel Computing

CCF: An efficient SpMV storage format for AVX512 platforms

https://doi.org/10.1016/j.parco.2020.102710Get rights and content

Abstract

We present a sparse matrix vector multiplication (SpMV) kernel that uses a novel sparse matrix storage format and delivers superior performance for unstructured matrices on Intel x86 processors. Our kernel exploits the properties of our storage format to enhance load balancing, SIMD efficiency, and data locality. We evaluate the performance of our kernel on a dual 24-core Skylake Xeon Platinum 8160 using 82 HPC and 36 scale-free unstructured matrices from 42 application areas. For HPC matrices, our kernel achieves a speed improvement of up to 19.5x over MKL Inspector–executor SpMV kernel (1.6x on average). For scale-free matrices, the speed improvement is up to 2.6x (1.3x on average).

Introduction

Sparse matrix–vector multiplication (SpMV) is a fundamental performance bottleneck in solving sparse linear systems and eigenvalue problems. Moreover, SpMV is one of the most important and time-consuming computations in many applications such as graph analytics [1], [2], [3], [4] and machine learning [5], [6]. In SpMV, the operation y = A*x+y is performed, where A is a sparse matrix and x, y are dense vectors. Sparse matrices use special data structures that store only the nonzero elements and hence eliminate unnecessary storage and computation. SpMV is memory bandwidth bound and has low computational intensity. Moreover, the emergence of modern processors with high thread count and wide vector units has introduced new performance bottlenecks that any new storage formats must mitigate to improve performance. These performance bottlenecks are: (a) low SIMD efficiency [7], [8], [9], (b) load imbalance [8], [10], and (c) irregular memory access pattern [7], [8], [11]. SIMD efficiency refers to the fraction of the vectors units’ peak throughput that is actually delivered during the computation. Load balancing is the attempt to divide a workload evenly across threads. Memory access pattern refers to the pattern with which the dense vector x is accessed. This is governed by the sparsity pattern of the sparse matrix.

The Compressed Sparse Row (CSR) is a general-purpose storage format that is commonly used due to its compact memory requirements. Parallel and vectorized SpMV CSR kernels divide the rows evenly between threads and each thread processes its assigned rows in row-major order. Each row is processed by a single vector unit. Fig. 1 illustrates how a row is processed by an 8-lane vector unit. A CSR kernel can suffer from low SIMD efficiency and load imbalance when it is parallelized and vectorized. Low SIMD efficiency is caused by processing a row with the number of nonzero elements less than the SIMD width (SIMDW) which is the number of available lanes in the vector unit. Dividing rows evenly between threads can lead to load imbalance when the number of nonzero elements per row (nnzr) is irregular across rows. In such case, threads will have different numbers of nonzero elements to process.

The ELLPACK (ELL) format is particularly well suited to vector architectures [12]. ELL converts an H x W matrix with a maximum nnzr of M into a dense H x M matrix by padding zeros to the rows which are shorter than the one with the maximum nonzero elements. A vectorized ELL kernel loads elements from sequential rows, performs vector fused-multiply add, and accumulates to a temporary vector as shown in Fig. 4. With transposition (or column-major ordering), ELL eliminates the need by a CSR kernel for the reduction operation at the end of each row processing (Fig. 1). Each entry of the temporary vector has the final value for a different y element. The fundamental challenge in ELL is the excessive zero-padding when a matrix has rows with irregular nnzr or when some rows are very long. Thus, many storage formats were developed to try processing the matrix in a transposed form with minimal zero-padding [13], [14]. Nevertheless, these formats have zero-padding overhead for unstructured matrices.

As large and highly irregular sparse matrices (e.g. scale-free matrices) are emerging in application areas such as data analytics, social and transportation networks [15], [16], [17] performing SpMV efficiently on unstructured matrices is turning into a compelling problem. Several formats were proposed to deal with scale-free matrices [9], [17]. High performance computing (HPC) matrices are more regular in nature. For HPC unstructured matrices, several storage formats have been proposed to allow the design of SpMV kernels that deliver high performance [8], [10], [13], [14]. However, no single kernel/storage format achieves the best performance for every unstructured matrix.

In this paper, we present our novel sparse matrix compressed chunk storage format (CCF) and our heuristics-based SpMV CCF kernel. SpMV CCF enhances load balancing, SIMD efficiency, and data locality for unstructured matrices.

For performance evaluation, we use 82 HPC and 36 scale-free unstructured matrices that exhibit low fill ratio per block from 42 application areas and a dual 24-core Skylake Intel Xeon Platinum 8160. We compare the performance of our SpMV CCF with two Intel MKL kernels: SpMV CSR and Inspector–executor SpMV CSR [18]. Also, we compare with two recent storage formats designed specially for AVX512 architectures without zero-padding. The first is CVR [9] which is designed for unstructured matrices and outperforms five state-of-the-art kernels: Intel MKL SpMV CSR, Intel MKL SpMV CSR(I) [18], SpMV CSR5[10], SpMV ESB [8], and SpMV VHCC [17]. The second is SPC5 [19] which is a block-based format by exploiting AVX512 instruction set. Our results show that our CCF kernel significantly outperforms all of the above SpMV implementations for matrices that are highly unstructured.

The following is a summary of our contributions:

  • We introduce our novel compressed chunk sparse matrix storage format (CCF) and SpMV CCF kernel and describe how the properties of CCF enhance load balancing, SIMD efficiency, and data locality.

  • We present a heuristic-based approach to estimate the parameters that deliver the highest kernel performance with minimal overhead to the preprocessing phase.

  • We define an interesting set of 118 matrices: highly unstructured with low block fill ratio. Then, show that CCF with its combination of light-weight but effective techniques has a very good performance improvement over four cutting-edge kernels on a recent Intel HPC platform.

  • We show that CCF has low preprocessing and storage overheads. Moreover, CCF has the least application end-to-end time including overhead when compared to the other four formats.

This paper is organized as follows. In Section 2, we present related work. In Section 3, we introduce our sparse matrix compressed chunk storage format (CCF) and our SpMV CCF kernel. In Section 4, we analyze the design of CCF, evaluate SpMV CCF performance, and discuss preprocessing and storage overheads. We present our conclusions in Section 6.

Section snippets

Related work

The introduction of multicore, integrated many-core, and graphics processing units (GPU) triggered substantial research on the development and evaluation of SpMV algorithms for such platforms [13], [20], [21], [22], [23], [24], [25], [26], [27].

A widely used optimization technique for high performance SpMV on CPUs is matrix blocking [19], [20], [21], [23], [28], [29]. This is because matrices with block sub-structures are encountered in important applications [30]. Furthermore, blocking

Mapping a sparse matrix into CCF

CCF is designed for use when executing SpMV on multi-core vector processors and aims at enhancing load balancing and SIMD efficiency. To store a matrix in CCF for a given processor and runtime system, the values of the following parameters are used:

  • 1.

    The number of nonzero elements in the matrix, NNZ.

  • 2.

    The number of threads to perform SpMV, T.

  • 3.

    The vector units’ width in cores, SIMDW.

We make the following definitions, as depicted in Fig. 2:

  • 1.

    A set is a collection of consecutive rows. The nonzero

Platform

The SpMV kernels are evaluated using a dual socket machine with two 24-core Skylake Intel Xeon Platinum 8160 CPUs (Table 1) and the Intel C++ v19.0.3 compiler with OpenMP 4.5. Thread scheduling is static. The OpenMP API automatically detects the number of sets for SpMV CCF and divides them evenly across threads. For all the SpMV kernels in this paper, we report the highest performance for 1 and 2 sockets using 1 and 2 threads per core (this evaluation methodology is adopted in CVR [9] as well).

Extra contributions to the community

The CCF storage format will be publicly available online at https://github.com/ssmoha7/spmv-ccf.

Conclusion

This paper presented our SpMV kernel that exploits the properties of our novel compressed chunk storage format, CCF. Our kernel delivers superior performance for HPC and scale-free unstructured matrices on Intel platforms. CCF improves the SIMD efficiency and load balancing using four techniques. For SIMD efficiency, CCF collects rows with the same nnzr in multi-row chunks and separates long tail rows into single-row chunks. The CCF kernel uses a hybrid execution strategy to process multi-row

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Intel Corporation supported this work through Grant No 552147-239012-191100. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [35], Stampede2 at Texas Advanced Computing Center through allocation which is supported by National Science Foundation grant number ACI-1548562. We thank Wen-mei Hwu for hosting the grants and Ahmed Sameh for comments and suggestions.

References (39)

  • ZhangH. et al.

    Vectorized parallel sparse matrix-vector multiplication in PETSc using AVX-512

  • WilliamsS. et al.

    Optimization of sparse matrix–vector multiplication on emerging multicore platforms

    Parallel Comput.

    (2009)
  • SlotaG.M. et al.

    High-performance graph analytics on manycore processors

  • AgarwalV. et al.

    Scalable graph exploration on multicore processors

  • SundaramN. et al.

    Graphmat: High performance graph analytics made productive

    (2015)
  • AndersonM.J. et al.

    Graphpad: Optimized graph primitives for parallel and distributed platforms

  • CuiH. et al.

    A machine learning-based approach for selecting spmv kernels and matrix storage formats

    IEICE Trans. Inf. Syst.

    (2018)
  • WangS. et al.

    Coded sparse matrix multiplication

    (2018)
  • LiuX. et al.

    Efficient sparse matrix-vector multiplication on x86-based many-core processors

  • XieB. et al.

    CVR: efficient vectorization of SpMV on x86 processors

  • LiuW. et al.

    CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication

  • De GonzaloS.G. et al.

    Revisiting online autotuning for sparse-matrix vector multiplication kernels on next-generation architectures

  • SaadY.

    Krylov subspace methods on supercomputers

    SIAM J. Sci. Stat. Comput.

    (1989)
  • JainA.

    pOSKI: An Extensible Autotuning Framework to Perform Optimized Spmvs on Multicore Architectures

    (2009)
  • KreutzerM. et al.

    A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units

    SIAM J. Sci. Comput.

    (2014)
  • WangL. et al.

    Bigdatabench: A big data benchmark suite from internet services

  • BarabásiA.-L. et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • TangW.T. et al.

    Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi

  • WangE. et al.

    Intel math kernel library

  • Cited by (0)

    View full text