Skip to main content
Log in

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known rooine model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. https://doi.org/10.1109/ISPASS.2016.7482072.

  2. Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. https://doi.org/10.1109/HOTCHIPS.2016.7936203.

  3. Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. https://doi.org/10.1109/HOTCHIPS.2015.7477454.

  4. You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. https://doi.org/10.1007/978-3-030-18645-6_6.

  5. Dongarra J. Report on the Fujitsu Fugaku system.Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020.

  6. Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. https://doi.org/10.1109/PACT.2009.22.

  7. McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020.

  8. Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. https://doi.org/10.1145/1111583.1111589.

  9. Williams S, Waterman A, Patterson D A. Rooine: An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4): 65-76. https://doi.org/10.1145/1498765.1498785

    Article  Google Scholar 

  10. Ilic A, Pratas F, Sousa L. Cache-aware rooine model: Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1): 21-24. https://doi.org/10.1109/L-CA.2013.6.

    Article  Google Scholar 

  11. Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. https://doi.org/10.1109/IPDPS.2016.14.

  12. Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3): Article No. 12. https://doi.org/10.1145/1356052.1356053.

  13. Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO: Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4): Article No. 42. https://doi.org/10.1145/3210754.

  14. Su X, Liao X, Jiang H, Yang C, Xue J. SCP: Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4): Article No. 43. https://doi.org/10.1145/3274654.

  15. Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics: Conference Series, 2015, 664(9): Article No. 092010. https://doi.org/10.1088/1742-6596/664/9/092010.

  16. Liu W, Vinter B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. https://doi.org/10.1145/2751205.2751209.

  17. Grimes R, Kincaid D, Young D. ITPACK 2.0 user’s guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.

  18. Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5): 401-423. https://doi.org/10.1137/130930352.

    Article  MathSciNet  MATH  Google Scholar 

  19. Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. https://doi.org/10.1145/1654059.1654078.

  20. Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1): 80-97. https://doi.org/10.1007/s10766-019-00646-x.

    Article  Google Scholar 

  21. Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based many-core architecture. Int. J. Parallel Program., 2019, 47(3): 418-432. https://doi.org/10.1007/s10766-018-00625-8.

    Article  Google Scholar 

  22. Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00116.

  23. Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. https://doi.org/10.1007/978-3-540-93799-9_5.

  24. Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. https://doi.org/10.1145/2568088.2576799.

  25. Ramos S, Hoeer T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, June 2013, pp.97-108. https://doi.org/10.1145/2462902.2462916.

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their valuable and constructive comments. We thank Wei-Ling Yang and Wan-Rong Gao from National University of Defense Technology for the experiment support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to De-Zun Dong.

Supplementary Information

ESM 1

(PDF 399 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, JB., Liao, XK., Huang, C. et al. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 36, 33–43 (2021). https://doi.org/10.1007/s11390-020-0741-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-0741-6

Keywords

Navigation