当前位置:
X-MOL 学术
›
arXiv.cs.MS
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Performance Partial Spectrum Computation for Symmetric eigenvalue problems and the SVD
arXiv - CS - Mathematical Software Pub Date : 2021-04-29 , DOI: arxiv-2104.14186 D. Keyes, H. Ltaief, Y. Nakatsukasa, D. Sukkari
arXiv - CS - Mathematical Software Pub Date : 2021-04-29 , DOI: arxiv-2104.14186 D. Keyes, H. Ltaief, Y. Nakatsukasa, D. Sukkari
Current dense symmetric eigenvalue (EIG) and singular value decomposition
(SVD) implementations may suffer from the lack of concurrency during the
tridiagonal and bidiagonal reductions, respectively. This performance
bottleneck is typical for the two-sided transformations due to the Level-2 BLAS
memory-bound calls. Therefore, the current state-of-the-art EIG and SVD
implementations may achieve only a small fraction of the system's sustained
peak performance. The QR-based Dynamically Weighted Halley (QDWH) algorithm may
be used as a pre-processing step toward the EIG and SVD solvers, while
mitigating the aforementioned bottleneck. QDWH-EIG and QDWH-SVD expose more
parallelism, while relying on compute-bound matrix operations. Both run closer
to the sustained peak performance of the system, but at the expense of
performing more FLOPS than the standard EIG and SVD algorithms. In this paper,
we introduce a new QDWH-based solver for computing the partial spectrum for EIG
(QDWHpartial-EIG) and SVD (QDWHpartial-SVD) problems. By optimizing the
rational function underlying the algorithms only in the desired part of the
spectrum, QDWHpartial-EIG and QDWHpartial-SVD algorithms efficiently compute a
fraction (say 1-20%) of the corresponding spectrum. We develop high-performance
implementations of QDWHpartial-EIG and QDWHpartial-SVD on distributed-memory
anymore systems and demonstrate their numerical robustness. Experimental
results using up to 36K MPI processes show performance speedups for
QDWHpartial-SVD up to 6X and 2X against PDGESVD from ScaLAPACK and KSVD,
respectively. QDWHpartial-EIG outperforms PDSYEVD from ScaLAPACK up to 3.5X but
remains slower compared to ELPA. QDWHpartial-EIG achieves, however, a better
occupancy of the underlying hardware by extracting higher sustained peak
performance than ELPA, which is critical moving forward with accelerator-based
supercomputers.
中文翻译:
对称特征值问题和SVD的高性能部分频谱计算
当前的密集对称特征值(EIG)和奇异值分解(SVD)实现可能分别在三对角线和对角线缩小过程中缺少并发性。由于级别2 BLAS内存绑定调用,此性能瓶颈对于双向转换是典型的。因此,当前最新的EIG和SVD实现可能仅实现系统持续峰值性能的一小部分。基于QR的动态加权Halley(QDWH)算法可以用作EIG和SVD求解器的预处理步骤,同时可以减轻上述瓶颈。QDWH-EIG和QDWH-SVD在依赖于计算绑定矩阵运算的同时,提供了更多的并行性。两者都接近于系统持续的峰值性能,但以执行比标准EIG和SVD算法更多的FLOPS为代价。在本文中,我们介绍了一种基于QDWH的新求解器,用于计算EIG(QDWHpartial-EIG)和SVD(QDWHpartial-SVD)问题的部分频谱。通过仅在频谱的所需部分中优化算法基础的有理函数,QDWHpartial-EIG和QDWHpartial-SVD算法有效地计算了相应频谱的一部分(例如1-20%)。我们将在分布式内存系统上开发QDWHpartial-EIG和QDWHpartial-SVD的高性能实现,并展示其数值鲁棒性。使用多达36K MPI流程的实验结果表明,与来自ScaLAPACK和KSVD的PDGESVD相比,QDWHpartial-SVD的性能分别提高了6倍和2倍。从ScaLAPACK到QDWHpartial-EIG的性能均优于PDSYEVD。是ELPA的5倍,但仍然较慢。但是,QDWHpartial-EIG通过提取比ELPA更高的持续峰值性能,可以更好地利用底层硬件,这对于基于加速器的超级计算机而言至关重要。
更新日期:2021-04-30
中文翻译:
对称特征值问题和SVD的高性能部分频谱计算
当前的密集对称特征值(EIG)和奇异值分解(SVD)实现可能分别在三对角线和对角线缩小过程中缺少并发性。由于级别2 BLAS内存绑定调用,此性能瓶颈对于双向转换是典型的。因此,当前最新的EIG和SVD实现可能仅实现系统持续峰值性能的一小部分。基于QR的动态加权Halley(QDWH)算法可以用作EIG和SVD求解器的预处理步骤,同时可以减轻上述瓶颈。QDWH-EIG和QDWH-SVD在依赖于计算绑定矩阵运算的同时,提供了更多的并行性。两者都接近于系统持续的峰值性能,但以执行比标准EIG和SVD算法更多的FLOPS为代价。在本文中,我们介绍了一种基于QDWH的新求解器,用于计算EIG(QDWHpartial-EIG)和SVD(QDWHpartial-SVD)问题的部分频谱。通过仅在频谱的所需部分中优化算法基础的有理函数,QDWHpartial-EIG和QDWHpartial-SVD算法有效地计算了相应频谱的一部分(例如1-20%)。我们将在分布式内存系统上开发QDWHpartial-EIG和QDWHpartial-SVD的高性能实现,并展示其数值鲁棒性。使用多达36K MPI流程的实验结果表明,与来自ScaLAPACK和KSVD的PDGESVD相比,QDWHpartial-SVD的性能分别提高了6倍和2倍。从ScaLAPACK到QDWHpartial-EIG的性能均优于PDSYEVD。是ELPA的5倍,但仍然较慢。但是,QDWHpartial-EIG通过提取比ELPA更高的持续峰值性能,可以更好地利用底层硬件,这对于基于加速器的超级计算机而言至关重要。