当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations
arXiv - CS - Performance Pub Date : 2021-08-20 , DOI: arxiv-2108.09337
Grzegorz Kwasniewski, Marko Kabić, Tal Ben-Nun, Alexandros Nikolaos Ziogas, Jens Eirik Saethre, André Gaillard, Timo Schneider, Maciej Besta, Anton Kozhevnikov, Joost VandeVondele, Torsten Hoefler

Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

中文翻译:

关于线性代数核的并行 I/O 最优性:近似最优矩阵分解

矩阵分解是科学计算最重要的构建块之一。然而,最先进的库不是通信最佳的,未充分利用当前的并行架构。我们提出了 Cholesky 和 ​​LU 分解的新算法,这些算法利用了渐近通信最优 2.5D 分解。我们首先建立一个理论框架来推导线性代数内核的并行 I/O 下界,然后利用它的见解推导出 Cholesky 和 ​​LU 调度,每个处理器都通信 N^3/(P*sqrt(M)) 个元素,其中M 是本地内存大小。实证结果符合我们的理论分析:我们的实现通信明显少于英特尔 MKL、SLATE 以及渐近通信优化的 CANDMC 和 CAPITAL 库。我们的代码在几乎所有测试场景中都优于这些最先进的库,在 Piz Daint 超级计算机的多达 512 个 CPU 节点上,矩阵大小从 2,048 到 262,144 不等,将解决问题的时间缩短了多达三倍. 我们的代码与 ScaLAPACK 兼容,可作为开源库使用。
更新日期:2021-08-24
down
wechat
bug