Skip to main content
Log in

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. A SpMV operation – \(\mathbf {y}=\mathbf {Ax}\) – multiplies a sparse matrix \(\mathbf {A}\) of size \(m \times n\) by a dense vector \(\mathbf {x}\) of size n, and then produces a dense vector \(\mathbf {y}\) of size m.

References

  1. FT-2000. Phytium Technology Co. Ltd. (2017). http://www.phytium.com.cn/Product/detail?language=1&product_id=7

  2. Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC (2009)

  3. Che, Y., Xu, C., Fang, J., Wang, Y., Wang, Z.: Realistic performance characterization of CFD applications on intel many integrated core architecture. Comput. J. 58(12), 3279–3294 (2015)

    Article  Google Scholar 

  4. Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable ALS matrix factorization for recommender systems. In: IPDPS (2017)

  5. Chen, J., Fang, J., Liu, W., Tang, T., Yang, C.: clmf: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization. FGCS (2018a). https://doi.org/10.1016/j.future.2018.04.071

  6. Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: HPCC ’18 (2018b)

  7. Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT ’17 (2017)

  8. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)

  9. Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO ’13 (2013)

  10. Grewe, D., et al.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC ’11 (2011)

  11. Grewe, D. et al.: Opencl task partitioning in the presence of gpu contention. In: LCPC ’13 (2013a)

  12. Grewe, D. et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO ’13 (2013b)

  13. Ho, T.K.: Random decision forests. In: ICDAR, pp. 278–282 (1995)

  14. Hollowell, C., et al.: The effect of numa tunings on cpu performance. J. Phys. Conf. Ser. 664(092010), 1–7 (2015)

    Google Scholar 

  15. Im, E., Yelick, K.A., Vuduc, R.W.: Sparsity: Optimization framework for sparse matrix kernels. IJHPCA (2004)

  16. Kincaid, D. et al.: Itpackv 2d user’s guide. Tech. rep., Center for Numerical Analysis, Texas Univ., Austin, TX (USA) (1989)

  17. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5) (2014). https://doi.org/10.1137/130930352

  18. Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., Jr WAW, Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit armv8 platform. In: ISPASS (2016)

  19. Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: SC (2017)

  20. Li, J., Tan, G., Chen, M., Sun, N.: SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In: PLDI (2013)

  21. Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: PPoPP (2018)

  22. Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)

  23. Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS (2015a)

  24. Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)

    Article  MathSciNet  Google Scholar 

  25. Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix–vector multiplication on x86-based many-core processors. In: ICS (2013)

  26. Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUS. In: ICCS (2013)

  27. Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix-vector product computations using unroll and jam. IJHPCA 18(2), 225–236 (2004)

    Google Scholar 

  28. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)

  29. Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC ’14 (2014)

  30. Ogilvie, W.F., et al.: Minimizing the cost of iterative compilation with active learning. In: CGO ’17 (2017)

  31. Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (2011)

  32. Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vector multiplication. In: SC (1999)

  33. Ren, J. et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM ’17 (2017)

  34. Ren, J., et al.: Adaptive web browsing on mobile heterogeneous multi-cores. IEEE Comput. Architect. Lett. (2018)

  35. Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on gpus. In: ICS (2015)

  36. Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)

  37. Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES ’17 (2017)

  38. Taylor, B. et al.: Adaptive deep learning model selection on embedded systems. In: LCTES ’18 (2018)

  39. Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI ’09 (2009)

  40. Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of the IEEE (2018)

  41. Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP ’09 (2009)

  42. Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT ’10 (2010)

  43. Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013)

  44. Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2014a)

  45. Wang, Z. et al.: Exploitation of gpus for the parallelisation of probably parallel legacy code. In: CC ’14 (2014b)

  46. Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014c)

  47. Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC (2007)

  48. Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. (2009)

  49. Yang, X., Fang, J., Chen, J., Wu, C., Tang, T., Lu, K.: High performance coordinate descent matrix factorization for recommender systems. In: CF (2017)

  50. Zhang, C.: Mars: A 64-core armv8 processor. In: HotChips (2015)

  51. Zhang, P. et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS ’18 (2018)

  52. Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. In: PPoPP (2018)

Download references

Acknowledgements

This work was partially funded by the National Key R&D Program of China under Grant No. 2017YFB0202003, the National Natural Science Foundation of China under grant Agreements 61602501, 11502296, 61772542, 61561146395 and 61872294; the Open Research Program of China State Key Laboratory of Aerodynamics under grant agreement SKLA20160104; the UK Engineering and Physical Sciences Research Council under Grants EP/M01567X/1 (SANDeRs) and EP/M015793/1 (DIVIDEND); and the Royal Society International Collaboration Grant (IE161012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianbin Fang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, D., Fang, J., Chen, S. et al. Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture. Int J Parallel Prog 47, 418–432 (2019). https://doi.org/10.1007/s10766-018-00625-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-018-00625-8

Keywords

Navigation