Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

Chen, Donglin; Fang, Jianbin; Xu, Chuanfu; Chen, Shizhao; Wang, Zheng

doi:10.1007/s10766-019-00646-x

Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

Published: 15 November 2019

Volume 48, pages 80–97, (2020)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Donglin Chen¹,
Jianbin Fang ORCID: orcid.org/0000-0003-3542-4869¹,
Chuanfu Xu¹,
Shizhao Chen¹ &
…
Zheng Wang^2,3

419 Accesses
9 Citations
6 Altmetric
Explore all metrics

Abstract

Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrix–vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Article 01 January 2019

A Study of SpMV Implementation Using MPI and OpenMP on Intel Many-Core Architecture

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

References

Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 685–701 (2010)
Google Scholar
Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC 2006, October 25–27, 2006, San Jose, California, USA, pp. 225–236 (2006)
Bell, N., Garland, M.: Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: SC (2009)
Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16–19, 2016, pp. 496–505 (2016)
Bhattacharjee, A., Martonosi, M.: Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In: 36th International Symposium on Computer Architecture (ISCA 2009), June 20–24, 2009, Austin, TX, USA, pp. 290–301 (2009)
Chen, D., Fang, J., Chen, S., Xu, C., Wang, Z.: Optimizing sparse matrix–vector multiplications on an armv8-based many-core architecture. Int. J. Parallel Program. 47(3), 418–432 (2019)
Article Google Scholar
Chen, S., et al.: Adaptive optimization of sparse matrix–vector multiplication on emerging many-core architectures. In: HPCC’18 (2018)
Cummins, C., Petoumenos, P., Wang, Z., Leather, H.: End-to-end deep learning of optimization heuristics. In: PACT (2017)
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011)
MathSciNet MATH Google Scholar
Diamond, J.R., Burtscher, M., McCalpin, J.D., Kim, B., Keckler, S.W., Browne, J.C.: Evaluation and optimization of multicore performance bottlenecks in supercomputing applications. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2011, 10–12 April, 2011, Austin, TX, USA, pp. 32–43 (2011)
Emani, M.K., Wang, Z., O’Boyle, M.F.P.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013)
Eyerman, S., Bois, K.D., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software, New Brunswick, NJ, USA, April 1–3, 2012, pp. 145–155 (2012)
FT-2000 Plus. Phytium Technology Co. Ltd., http://tech.sina.com.cn/d/2017-10-16/doc-ifymvuyt0962449.shtml (2017)
Grewe, D., Wang, Z., O’Boyle, M.F.P.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC (2011)
Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013a)
Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013b)
Gupta, V., Kim, H., Schwan, K.: Evaluating Scalability of Multi-threaded Applications on a Many-Core Platform. Georgia Institute of Technology, Georgia (2012)
Google Scholar
Kincaid, D.R., Young, T.C.: Itpackv 2d user’s guide. In: Technical Report, Center for Numerical Analysis, Texas University, Austin, TX (USA) (1989)
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, C401–C423 (2014)
Article MathSciNet Google Scholar
Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., William W.A., Jr., Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: ISPASS (2016)
Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on GPUS. In: PPoPP (2018)
Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7–12, 2008, pp. 359–367 (2008)
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: ICS (2015a)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)
Article MathSciNet Google Scholar
Lv, Y., Sun, B., Luo, Q., Wang, J., Yu, Z., Qian, X.: Counterminer: Mining big performance data from hardware counters. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24, 2018, pp. 613–626 (2018)
Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix–vector multiplication on GPUS. In: ICCS (2013)
Magni, A., Dubach, C., O’Boyle, M.F.P.: A large-scale cross-architecture evaluation of thread-coarsening. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA, November 17–21, 2013, pp. 11:1–11:11 (2013)
Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix–vector product computations using unroll and jam. In: IJHPCA (2004)
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)
Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Fast automatic heuristic construction using active learning. In: LCPC (2014)
Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Minimizing the cost of iterative compilation with active learning. In: CGO (2017)
Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pinar, A., Heath, M.T.: Improving performance of sparse matrix–vector multiplication. In: SC (1999)
Ren, J., Gao, L., Wang, H., Wang, Z.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)
Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT’18 (2018)
Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUS. In: ICS (2015)
Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)
Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collectingperformance data with PAPI-C. In: Tools for High Performance Computing 2009, pp. 157–173 (2009)
Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.P.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI (2009)
Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of IEEE (2018)
Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP’09 (2009)
Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT’10 (2010)
Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM Trans. Arch. Code Optm. 10, 20 (2013)
Google Scholar
Wang, Z., Tournavitis, G., Franke, B., O’Boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Arch. Code Optm. 11, 2 (2014a)
Google Scholar
Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for GPU-based heterogeneous systems. ACM Trans. Arch. Code Optm. 11, 42 (2014b)
Google Scholar
Wen, Y., Wang, Z., O’Boyle, M.F.P.: Smart multi-task scheduling for opencl programs on CPU/GPU heterogeneous platforms. In: HiPC’14 (2014)
Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix–vector multiplication onemerging multicore platforms. In: Parallel Computing (2009)
Zhang, C.: Mars: A 64-core ARMv8 processor. In: HotChips (2015)
Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS (2018)

Download references

Acknowledgements

This work was partially funded by the National Key R&D Program of China under Grant No. 2017YFB0202003, the National Science Foundation of China under Grant Agreements 61602501, 61772542, and 61872294; and the Royal Society International Collaboration Grant (IE161012).

Author information

Authors and Affiliations

College of Computer Science, National University of Defense Technology, Changsha, China
Donglin Chen, Jianbin Fang, Chuanfu Xu & Shizhao Chen
University of Leeds, Leeds, UK
Zheng Wang
Xi’an University of Posts and Telecommunications, Xi’an, China
Zheng Wang

Authors

Donglin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Chuanfu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shizhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jianbin Fang or Chuanfu Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Fang, J., Xu, C. et al. Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+. Int J Parallel Prog 48, 80–97 (2020). https://doi.org/10.1007/s10766-019-00646-x

Download citation

Received: 02 August 2019
Accepted: 04 November 2019
Published: 15 November 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10766-019-00646-x

Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

Abstract

Access this article

Similar content being viewed by others

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

A Study of SpMV Implementation Using MPI and OpenMP on Intel Many-Core Architecture

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

Abstract

Access this article

Similar content being viewed by others

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

A Study of SpMV Implementation Using MPI and OpenMP on Intel Many-Core Architecture

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation