Evaluating Abstract Asynchronous Schwarz solvers on GPUs,arXiv - CS - Mathematical Software

当前位置： X-MOL 学术 › arXiv.cs.MS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating Abstract Asynchronous Schwarz solvers on GPUs
arXiv - CS - Mathematical Software Pub Date : 2020-03-11 , DOI: arxiv-2003.05361
Pratik Nayak, Terry Cojean, Hartwig Anzt

With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel even on a single node with multiple co-processors such as GPUs and multiple cores on each node. For example, ORNLs Summit accumulates six NVIDIA Tesla V100s and 42 core IBM Power9s on each node. Synchronizing across all these compute resources in a single node or even across multiple nodes is prohibitively expensive. Hence it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing for massive parallelism. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver where we do not explicitly synchronize, but allow for communication of the data between the sub-domains to be completely asynchronous thereby removing the bulk synchronous nature of the algorithm. We accomplish this by using the onesided RMA functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart on both multi-core architectures and on multiple GPUs. We also study the communication patterns and local solvers and their effect on the global solver. Finally, we show that this concept can render attractive runtime benefits over the synchronous counterparts.

中文翻译：

在 GPU 上评估抽象异步 Schwarz 求解器

随着百亿亿级计算时代的开始，我们意识到大多数领先的超级计算机即使在单个节点上也是异构的、大规模并行的，每个节点上有多个协处理器（如 GPU）和多个内核。例如，ORNLs Summit 在每个节点上累积了 6 个 NVIDIA Tesla V100 和 42 个核心 IBM Power9。在单个节点或什至跨多个节点中的所有这些计算资源之间进行同步是非常昂贵的。因此，有必要开发和研究异步算法来规避大规模并行计算的批量同步计算问题。在这项研究中，我们将抽象受限加法 Schwarz 方法的异步版本作为求解器进行检查，其中我们没有明确同步，但允许子域之间的数据通信完全异步，从而消除算法的批量同步特性。我们通过使用 MPI 标准的单边 RMA 函数来实现这一点。我们研究了在多核架构和多个 GPU 上使用这种异步求解器而不是同步求解器的好处。我们还研究了通信模式和本地求解器及其对全局求解器的影响。最后，我们展示了这个概念可以呈现比同步对应物更具吸引力的运行时优势。我们研究了在多核架构和多个 GPU 上使用这种异步求解器而不是同步求解器的好处。我们还研究了通信模式和本地求解器及其对全局求解器的影响。最后，我们展示了这个概念可以呈现比同步对应物更具吸引力的运行时优势。我们研究了在多核架构和多个 GPU 上使用这种异步求解器而不是同步求解器的好处。我们还研究了通信模式和本地求解器及其对全局求解器的影响。最后，我们展示了这个概念可以呈现比同步对应物更具吸引力的运行时优势。

更新日期：2020-05-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文