Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Chen, Donglin; Fang, Jianbin; Chen, Shizhao; Xu, Chuanfu; Wang, Zheng

doi:10.1007/s10766-018-00625-8

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Published: 01 January 2019

Volume 47, pages 418–432, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Donglin Chen¹,
Jianbin Fang ORCID: orcid.org/0000-0003-3542-4869¹,
Shizhao Chen¹,
Chuanfu Xu¹ &
…
Zheng Wang²

529 Accesses
19 Citations
Explore all metrics

Abstract

Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Notes

A SpMV operation – \(\mathbf {y}=\mathbf {Ax}\) – multiplies a sparse matrix \(\mathbf {A}\) of size \(m \times n\) by a dense vector \(\mathbf {x}\) of size n, and then produces a dense vector \(\mathbf {y}\) of size m.

References

FT-2000. Phytium Technology Co. Ltd. (2017). http://www.phytium.com.cn/Product/detail?language=1&product_id=7
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC (2009)
Che, Y., Xu, C., Fang, J., Wang, Y., Wang, Z.: Realistic performance characterization of CFD applications on intel many integrated core architecture. Comput. J. 58(12), 3279–3294 (2015)
Article Google Scholar
Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable ALS matrix factorization for recommender systems. In: IPDPS (2017)
Chen, J., Fang, J., Liu, W., Tang, T., Yang, C.: clmf: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization. FGCS (2018a). https://doi.org/10.1016/j.future.2018.04.071
Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: HPCC ’18 (2018b)
Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT ’17 (2017)
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO ’13 (2013)
Grewe, D., et al.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC ’11 (2011)
Grewe, D. et al.: Opencl task partitioning in the presence of gpu contention. In: LCPC ’13 (2013a)
Grewe, D. et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO ’13 (2013b)
Ho, T.K.: Random decision forests. In: ICDAR, pp. 278–282 (1995)
Hollowell, C., et al.: The effect of numa tunings on cpu performance. J. Phys. Conf. Ser. 664(092010), 1–7 (2015)
Google Scholar
Im, E., Yelick, K.A., Vuduc, R.W.: Sparsity: Optimization framework for sparse matrix kernels. IJHPCA (2004)
Kincaid, D. et al.: Itpackv 2d user’s guide. Tech. rep., Center for Numerical Analysis, Texas Univ., Austin, TX (USA) (1989)
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5) (2014). https://doi.org/10.1137/130930352
Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., Jr WAW, Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit armv8 platform. In: ISPASS (2016)
Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: SC (2017)
Li, J., Tan, G., Chen, M., Sun, N.: SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In: PLDI (2013)
Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: PPoPP (2018)
Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS (2015a)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)
Article MathSciNet Google Scholar
Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix–vector multiplication on x86-based many-core processors. In: ICS (2013)
Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUS. In: ICCS (2013)
Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix-vector product computations using unroll and jam. IJHPCA 18(2), 225–236 (2004)
Google Scholar
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)
Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC ’14 (2014)
Ogilvie, W.F., et al.: Minimizing the cost of iterative compilation with active learning. In: CGO ’17 (2017)
Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (2011)
Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vector multiplication. In: SC (1999)
Ren, J. et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM ’17 (2017)
Ren, J., et al.: Adaptive web browsing on mobile heterogeneous multi-cores. IEEE Comput. Architect. Lett. (2018)
Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on gpus. In: ICS (2015)
Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)
Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES ’17 (2017)
Taylor, B. et al.: Adaptive deep learning model selection on embedded systems. In: LCTES ’18 (2018)
Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI ’09 (2009)
Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of the IEEE (2018)
Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP ’09 (2009)
Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT ’10 (2010)
Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013)
Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2014a)
Wang, Z. et al.: Exploitation of gpus for the parallelisation of probably parallel legacy code. In: CC ’14 (2014b)
Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014c)
Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC (2007)
Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. (2009)
Yang, X., Fang, J., Chen, J., Wu, C., Tang, T., Lu, K.: High performance coordinate descent matrix factorization for recommender systems. In: CF (2017)
Zhang, C.: Mars: A 64-core armv8 processor. In: HotChips (2015)
Zhang, P. et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS ’18 (2018)
Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. In: PPoPP (2018)

Download references

Acknowledgements

This work was partially funded by the National Key R&D Program of China under Grant No. 2017YFB0202003, the National Natural Science Foundation of China under grant Agreements 61602501, 11502296, 61772542, 61561146395 and 61872294; the Open Research Program of China State Key Laboratory of Aerodynamics under grant agreement SKLA20160104; the UK Engineering and Physical Sciences Research Council under Grants EP/M01567X/1 (SANDeRs) and EP/M015793/1 (DIVIDEND); and the Royal Society International Collaboration Grant (IE161012).

Author information

Authors and Affiliations

College of Computer Science, National University of Defense Technology, Changsha, 410073, China
Donglin Chen, Jianbin Fang, Shizhao Chen & Chuanfu Xu
School of Computing and Communications, Lancaster University, Lancaster, LA1 4WA, UK
Zheng Wang

Authors

Donglin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Shizhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chuanfu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianbin Fang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Fang, J., Chen, S. et al. Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture. Int J Parallel Prog 47, 418–432 (2019). https://doi.org/10.1007/s10766-018-00625-8

Download citation

Received: 20 September 2018
Accepted: 22 December 2018
Published: 01 January 2019
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10766-018-00625-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Can GPU performance increase faster than the code error rate?

Performance improvement of the triangular matrix product in commodity clusters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Can GPU performance increase faster than the code error rate?

Performance improvement of the triangular matrix product in commodity clusters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation