当前位置: X-MOL 学术Int. J. Parallel. Program › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture
International Journal of Parallel Programming ( IF 1.5 ) Pub Date : 2019-01-01 , DOI: 10.1007/s10766-018-00625-8
Donglin Chen , Jianbin Fang , Shizhao Chen , Chuanfu Xu , Zheng Wang

Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.

中文翻译:

在基于 ARMv8 的多核架构上优化稀疏矩阵向量乘法

稀疏矩阵向量乘法 (SpMV) 在科学和 HPC 应用中很常见,但很难优化。虽然基于 ARMv8 的处理器 IP 正在成为传统 x64 HPC 处理器设计的替代方案,但很少有关于此类新众核上的 SpMV 性能的研究。为了设计高效的 HPC 软件和硬件,我们需要了解 SpMV 的性能如何。这项工作开发了一种量化方法来表征基于 ARMv8 的最新多核架构 Phytium FT-2000 Plus (FTP) 上的 SpMV 性能。我们在 956 个稀疏数据集和五种主流稀疏矩阵存储格式上执行了涉及超过 9500 次不同分析运行的广泛实验,并将 FTP 与 Intel Knights Landing 众核进行了比较。我们通过实验表明,选择最佳稀疏矩阵存储格式和参数并非易事,因为正确的决策需要输入矩阵和硬件的专业知识。我们通过提出一个基于机器学习的模型来解决这个问题,该模型使用输入矩阵特征预测最佳存储格式和参数。该模型自动专用于我们考虑的众核架构。实验结果表明,我们的方法平均实现了 93% 的最佳可用性能,而不会产生运行时分析开销。该模型自动专用于我们考虑的众核架构。实验结果表明,我们的方法平均实现了 93% 的最佳可用性能,而不会产生运行时分析开销。该模型自动专用于我们考虑的众核架构。实验结果表明,我们的方法平均实现了 93% 的最佳可用性能,而不会产生运行时分析开销。
更新日期:2019-01-01
down
wechat
bug