Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Fang, Jian-Bin; Liao, Xiang-Ke; Huang, Chun; Dong, De-Zun

doi:10.1007/s11390-020-0741-6

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Regular Paper
Published: 30 January 2021

Volume 36, pages 33–43, (2021)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jian-Bin Fang¹,
Xiang-Ke Liao¹,
Chun Huang¹ &
…
De-Zun Dong¹

332 Accesses
15 Citations
Explore all metrics

Abstract

This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known rooine model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Case Study for Running Memory-Bound Kernels on RISC-V CPUs

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

References

Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. https://doi.org/10.1109/ISPASS.2016.7482072.
Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. https://doi.org/10.1109/HOTCHIPS.2016.7936203.
Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. https://doi.org/10.1109/HOTCHIPS.2015.7477454.
You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. https://doi.org/10.1007/978-3-030-18645-6_6.
Dongarra J. Report on the Fujitsu Fugaku system.Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020.
Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. https://doi.org/10.1109/PACT.2009.22.
McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020.
Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. https://doi.org/10.1145/1111583.1111589.
Williams S, Waterman A, Patterson D A. Rooine: An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4): 65-76. https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Ilic A, Pratas F, Sousa L. Cache-aware rooine model: Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1): 21-24. https://doi.org/10.1109/L-CA.2013.6.
Article Google Scholar
Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. https://doi.org/10.1109/IPDPS.2016.14.
Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3): Article No. 12. https://doi.org/10.1145/1356052.1356053.
Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO: Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4): Article No. 42. https://doi.org/10.1145/3210754.
Su X, Liao X, Jiang H, Yang C, Xue J. SCP: Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4): Article No. 43. https://doi.org/10.1145/3274654.
Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics: Conference Series, 2015, 664(9): Article No. 092010. https://doi.org/10.1088/1742-6596/664/9/092010.
Liu W, Vinter B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. https://doi.org/10.1145/2751205.2751209.
Grimes R, Kincaid D, Young D. ITPACK 2.0 user’s guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.
Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5): 401-423. https://doi.org/10.1137/130930352.
Article MathSciNet MATH Google Scholar
Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. https://doi.org/10.1145/1654059.1654078.
Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1): 80-97. https://doi.org/10.1007/s10766-019-00646-x.
Article Google Scholar
Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based many-core architecture. Int. J. Parallel Program., 2019, 47(3): 418-432. https://doi.org/10.1007/s10766-018-00625-8.
Article Google Scholar
Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00116.
Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. https://doi.org/10.1007/978-3-540-93799-9_5.
Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. https://doi.org/10.1145/2568088.2576799.
Ramos S, Hoeer T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, June 2013, pp.97-108. https://doi.org/10.1145/2462902.2462916.

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their valuable and constructive comments. We thank Wei-Ling Yang and Wan-Rong Gao from National University of Defense Technology for the experiment support.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Jian-Bin Fang, Xiang-Ke Liao, Chun Huang & De-Zun Dong

Authors

Jian-Bin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang-Ke Liao
View author publications
You can also search for this author in PubMed Google Scholar
Chun Huang
View author publications
You can also search for this author in PubMed Google Scholar
De-Zun Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to De-Zun Dong.

Supplementary Information

ESM 1

(PDF 399 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, JB., Liao, XK., Huang, C. et al. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 36, 33–43 (2021). https://doi.org/10.1007/s11390-020-0741-6

Download citation

Received: 24 June 2020
Accepted: 09 December 2020
Published: 30 January 2021
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11390-020-0741-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Abstract

Access this article

Similar content being viewed by others

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Case Study for Running Memory-Bound Kernels on RISC-V CPUs

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Abstract

Access this article

Similar content being viewed by others

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Case Study for Running Memory-Bound Kernels on RISC-V CPUs

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation