Taming next‐generation HPC systems: Run‐time system and algorithmic advancements,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Taming next‐generation HPC systems: Run‐time system and algorithmic advancements
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2020-12-25 , DOI: 10.1002/cpe.6153
Roman Wyrzykowski ₁ , Boleslaw K. Szymanski ₂

Affiliation

This special issue of Concurrency and Computation: Practice and Experience contains revised and extended versions of selected papers presented at the 13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019, which was held on September 8–11, 2019 in Bialystok, Poland. PPAM 2019 was organized by the Department of Computer and Information Science of the Czestochowa University of Technology together with the Bialystok University of Technology, under the patronage of the Committee of Informatics of the Polish Academy of Sciences, in technical cooperation with the IEEE Computer Society and IEEE Computational Intelligence Society.

PPAM is a biennial series of international conferences dedicated to exchanging ideas between researchers involved in parallel and distributed computing, including theory and applications, as well as applied and computational mathematics. Twelve previous events have been held in different universities in Poland since 1994, when the first PPAM took place in Czestochowa. Thus, the event in Bialystok was an opportunity to celebrate the 25th anniversary of PPAM. The focus of PPAM 2019 was on models, algorithms, and software tools that facilitate efficient and convenient use of modern parallel and distributed computing systems, as well as on large‐scale modern applications, including advances in machine learning and artificial intelligence.

This meeting gathered more than 170 participants from 26 countries. The accepted papers were presented at the regular tracks of the PPAM 2019 conference and during the workshops. With each submission evaluated by at least three reviewers, a strict reviewing process resulted in the acceptance of 91 contributed papers for publication in the conference proceedings, while approximately 43% of the submissions were rejected. The Program Committee selected 41 papers for presentation in the regular conference track, resulting in an acceptance rate of about 46%.

Based on the review results, 10 papers (11% of submissions) were selected for a special journal issue. Besides quality, another important criterion for selection was each paper's contribution to the thematic consistency of the issue. The focus of this special issue is on algorithmic advancements in matching the software properties to parallel architecture, including GPU accelerators and clusters. These advancements are crucial for successfully parallelizing such complex applications as simulating geophysical flows, solving ordinary differential equations (ODEs), structural analysis of nuclear reactor containment buildings, solving generalized eigenvalue problems, modeling of material science phenomena, and others. A complementary topic of this issue is advances in run‐time systems since increasing levels of parallelism in multi‐ and many‐core chips and the emerging heterogeneity of computational resources coupled with energy, resilience, and data movement constraints radically increase the importance of efficient run‐time scheduling and execution control. After the conference, the Program Committee invited the authors of selected papers to submit revised and extended versions of their works. These new versions were reviewed independently again by at least three reviewers. Finally, nine contributions were accepted for publication. They are summarized below.

Paper [1] focuses on the accurate assembly of the system matrix, which is an essential step in any code that solves partial differential equations on a mesh. This step can become costly in multigrid codes requiring cascades of matrices that depend upon each other, or dynamic adaptive mesh refinement. To reduce the time to solution, the authors propose that these constructions can be performed concurrently with the multigrid cycles. Furthermore, they desynchronize the assembly from the solution process. This non‐trivial increase in the concurrency level improves the scalability. As assembly routines are notoriously memory‐ and bandwidth‐demanding, the final algorithmic enhancement uses a hierarchical, lossy compression scheme that brings the memory footprint down aggressively even when the system matrix entries carry little information or are not yet available with high accuracy.

An efficient algorithm for the parallel solution of indefinite saddle point systems with iterative solvers based on the Golub–Kahan bidiagonalization is presented in Reference [2]. Such systems arise in many application fields, for example, in structural mechanics. A scalability study of the generalized solver shows improved performance for the two‐dimensional (2D) Stokes equations compared to previous works. Furthermore, the authors investigate the performance of different parallel inner solvers in the outer Golub–Kahan iteration for a three‐dimensional (3D) Stokes problem. When the number of cores is increasing for a fixed problem size, the solver exhibits good speedups of up to 50% with the 1024 cores. For the tests in which the problem size grows while the workload in each core stays constant, the performance of the solver scales almost linearly with the increase in the number of cores.

Paper [3] proposes a locality optimization technique for the parallel solution on GPUs of large systems of ODEs by explicit one‐step methods. This technique is based on tiling across the stages of a one‐step method and is enabled by a special structure of the class of ODE systems—with the limited access distance. The paper focuses on increasing the range of access distances for which the tiling technique can provide a speedup by joining the memory resources and computational power of multiple workgroups for computations in one tile. Consequently, a much wider range of applications can benefit from tiling across the stages, in particular, for stencils on 2D and 3D grids. The experiments show the novel technique's speedup compared with traditional single‐workgroup tiling for two different Runge–Kutta methods on NVIDIA Kepler and Volta architectures.

The StarNEig library for solving dense nonsymmetric standard and generalized eigenvalue problems is presented in paper [4]. This task‐based library is built on top of the StarPU runtime system and targets both shared and distributed memory machines. Some components of the library have support for GPU acceleration. The library currently applies to real matrices with real and complex eigenvalues, and all calculations are performed using real arithmetic. Support for complex matrices is planned for a future release. The library's design choices and capabilities are disclosed and compared to existing software such as LAPACK and ScaLAPACK. StarNEig implements a ScaLAPACK compatibility layer, which should assist new users in the transition to StarNEig. The authors demonstrate the performance of the library with a sample of computational experiments.

Paper [5] is a part of the global tendency to use high performance computing (HPC) systems for modeling the phase‐field phenomena. This work aims to improve the performance of a parallel application for the solidification modeling with the dynamic intensity of computations in successive time steps when calculations are performed in selected nodes of the grid. A two‐step method is proposed to optimize the application for multi‐ and many‐core architectures. In the first step, loop fusion is used to reduce the number of conditional operators, while the second step includes an algorithm for the dynamic workload prediction and load balancing across cores. Two versions of the algorithm are proposed—one with the one‐dimensional map and the other with the 2D map for predicting the computation domain within the grid. These optimizations allow increasing the application performance significantly for all tested configurations of computing resources, including a single and two Intel Xeon Platinum 8180 CPUs, and a single KNL accelerator.

Paper [6] presents how advanced parallel graph partitioning algorithms can be designed using global application states monitoring of distributed programs. The proposed algorithms are implemented inside a novel distributed program design framework PEGASUS DA. Two strategies for designing the control of parallel/distributed partitioning algorithms are investigated. In the first one, the parallel algorithm control runs on top of the popular graph partitioning METIS tool. The second strategy is based on genetic programming. The use of the framework allows easy design and testing of different graph optimization strategies. Experiments with benchmark graphs illustrate the presented partitioning methods. These experiments comparatively assess the graph partitioning quality for which a clear improvement is observed, and the benefits of the proposed approach for programmers are identified.

Profiling and tuning of parallel applications is an essential part of high performance computing. Analysis and elimination of application hotspots can be performed using many available tools, which also provide resource consumption measurements for instrumented parts of the code. Since each tool can bring different insights into an application's behavior, it is valuable to analyze and optimize an application using a variety of them. Paper [7] presents the C/C++ API that simplifies manual instrumentation for the most common open‐source HPC performance analysis tools. Simultaneously, profiling libraries provide different instrumentation methods, with the binary patching being the universal mechanism that highly improves the user‐friendliness of a tool. The authors analyze the most commonly used binary patching tools and present a workflow for using them to implement binary instrumentation for any profiler or autotuner.

Paper [8] introduces DiPOSH, a multi‐network, distributed implementation of the OpenSHMEM standard. The core idea behind DiPOSH is to have an API‐to‐network software stack as slim as possible to minimize the software overhead. Following the heritage of its non‐distributed parent POSH, DiPOSH's communication engine is organized around the processes' shared heaps, and remote communications are moving data directly from and to these shared heaps. This paper presents DiPOSH's architecture and several communication drivers, including one that takes advantage of a helper process for inter‐process communications. This architecture allows exploring different options for implementing the communication drivers, from using high‐level, portable, optimized libraries to low‐level, close to the hardware communication routines. The authors present the perspectives opened by this additional component in terms of communication scheduling between and on the nodes.

Multicore NUMA systems present memory hierarchies and communication networks that influence the performance of parallel codes. Understanding the effect of particular hardware configurations on various codes is of paramount importance. In paper [9], monitoring information from hardware counters at runtime is used to characterize the behavior of each thread of multithreaded processes running in a NUMA environment. The authors propose a runtime tool, executed in user space, that uses this information to guide two different thread migration strategies for improving execution efficiency by increasing locality and affinity without requiring any modification in the codes. The benefits of this tool are validated using NAS Parallel OpenMP benchmarks with various locality and affinity scenarios. In more than 95% of them, the novel tool outperforms those of the operating system (OS) and produces up to 38% improvement in execution time over the OS.

中文翻译：

驯服下一代HPC系统：运行时系统和算法改进

本期《并发与计算：实践与经验》特刊包含在2019年9月8日至11日在波兰比亚韦斯托克举行的第13届并行处理和应用数学国际会议（PPAM 2019）上展示的部分论文的修订版和扩展版。PPAM 2019由Czestochowa技术大学计算机与信息科学系和Bialystok技术大学在波兰科学院信息学委员会的赞助下与IEEE计算机学会和IEEE计算智能学会。

PPAM是每两年举行一次的国际会议系列，致力于在涉及并行和分布式计算的研究人员之间交换思想，包括理论和应用程序以及应用和计算数学。自1994年第一届PPAM在琴斯托霍瓦（Czestochowa）举行以来，在波兰的各大学中已举办过十二届。因此，在比亚韦斯托克（Bialystok）举行的活动是庆祝PPAM成立25周年的机会。PPAM 2019的重点是模型，算法和软件工具，这些工具，算法和软件工具可促进高效，便捷地使用现代并行和分布式计算系统，以及大规模现代应用程序，包括机器学习和人工智能方面的进步。

这次会议聚集了来自26个国家的170多名与会者。接受的论文在PPAM 2019会议的常规曲目和研讨会期间进行了介绍。每个提交至少要由三位审稿人进行评估，严格的审阅过程导致接受了91篇论文发表在会议记录中，而约43％的提交被拒绝。计划委员会选择了41篇论文在常规会议上发表，接受率约为46％。

根据审阅结果，选择10篇论文（占提交论文的11％）作为特殊期刊。除质量外，另一个重要的选择标准是每篇论文对课题主题一致性的贡献。本期专刊的重点是在使软件属性与并行架构（包括GPU加速器和群集）匹配的算法方面的进步。这些进展对于成功地并行化诸如模拟地球物理流，求解常微分方程（ODE），核反应堆安全壳建筑物的结构分析，解决广义特征值问题，材料科学现象的建模等复杂应用至关重要。此问题的一个补充主题是运行时系统的进步，这是因为多核和多核芯片中并行性水平的提高以及新兴的计算资源异质性以及能源，弹性和数据移动约束条件从根本上提高了有效运行的重要性时间安排和执行控制。会议之后，计划委员会邀请部分论文的作者提交其作品的修订版和扩展版。这些新版本至少由三位审阅者再次独立审阅。最终，有九篇论文被接受发表。它们总结如下。数据移动约束从根本上增加了有效的运行时调度和执行控制的重要性。会议之后，计划委员会邀请部分论文的作者提交其作品的修订版和扩展版。这些新版本至少由三位审阅者再次独立审阅。最终，有九篇论文被接受发表。它们总结如下。数据移动约束从根本上增加了有效的运行时调度和执行控制的重要性。会议之后，计划委员会邀请部分论文的作者提交其作品的修订版和扩展版。这些新版本至少由三位审阅者再次独立审阅。最终，有九篇论文被接受发表。它们总结如下。

纸[ 1]着重于系统矩阵的精确组装，这是解决网格上的偏微分方程的任何代码中必不可少的步骤。在需要相互依赖的矩阵级联或动态自适应网格细化的多网格代码中，此步骤可能会变得很昂贵。为了减少求解时间，作者建议可以与多重网格周期同时执行这些构造。此外，它们使解决方案过程中的装配不同步。并发级别的这种不平凡的提高提高了可伸缩性。众所周知，由于汇编例程对内存和带宽的要求很高，因此最终的算法增强功能采用了分层，

参考文献[ 2]中提出了一种基于Golub–Kahan二次对角化的带有迭代求解器的不确定鞍点系统并行求解的有效算法。]。这样的系统出现在许多应用领域中，例如在结构力学中。广义求解器的可伸缩性研究显示，与以前的工作相比，二维（2D）Stokes方程的性能有所提高。此外，作者研究了三维（3D）Stokes问题在外部Golub-Kahan迭代中不同并行内部求解器的性能。对于固定的问题大小，当核心数量增加时，对于1024个核心，求解器显示出高达50％的良好加速比。对于问题规模增大而每个核心工作量保持恒定的测试，求解器的性能几乎随核心数量的增加而线性扩展。

纸[ 3]通过显式的一步方法为大型ODE系统的GPU上的并行解决方案提出了一种局部性优化技术。该技术基于跨一步方法的平铺，并通过ODE系统类的特殊结构（访问距离有限）实现。本文着重于通过将存储资源和多个工作组的计算能力结合在一起以在一个图块中进行计算，来扩大切片技术可以提供的加速访问距离的范围。因此，特别是对于2D和3D网格上的模板，跨阶段的平铺可受益于更广泛的应用程序。实验表明，与针对NVIDIA Kepler和Volta架构的两种不同的Runge-Kutta方法的传统单工作组拼贴相比，该新技术具有更快的速度。

论文介绍了用于解决密集的非对称标准和广义特征值问题的StarNEig库[ 4]。这个基于任务的库建立在StarPU运行时系统之上，并以共享和分布式内存机器为目标。该库的某些组件支持GPU加速。该库当前适用于具有真实和复杂特征值的真实矩阵，并且所有计算都是使用真实算术执行的。计划在将来的版本中支持复杂的矩阵。公开了库的设计选择和功能，并将其与现有软件（例如LAPACK和ScaLAPACK）进行比较。StarNEig实现了ScaLAPACK兼容性层，该层应有助于新用户过渡到StarNEig。作者通过计算实验样本演示了该库的性能。

纸[ 5]是使用高性能计算（HPC）系统对相场现象进行建模的全球趋势的一部分。这项工作旨在通过在网格的选定节点中执行计算时，在连续的时间步长中以动态的计算强度来提高并行建模的性能。提出了一种分两步的方法来优化多核和多核体系结构的应用程序。第一步，使用循环融合来减少条件运算符的数量，而第二步包括用于动态工作负载预测和跨核负载平衡的算法。提出了该算法的两个版本-一个带有一维映射，另一个带有2D映射，用于预测网格内的计算域。

纸[ 6]展示了如何使用分布式程序的全局应用程序状态监视来设计高级并行图分区算法。所提出的算法在新颖的分布式程序设计框架PEGASUS DA中实现。研究了设计并行/分布式分区算法控制的两种策略。在第一个中，并行算法控件在流行的图分区METIS工具之上运行。第二种策略基于基因编程。使用该框架可以轻松设计和测试不同的图形优化策略。使用基准图进行的实验说明了所提出的分区方法。这些实验比较地评估了可明显改善的图分割质量，

对并行应用程序进行性能分析和调整是高性能计算的重要组成部分。可以使用许多可用的工具来分析和消除应用程序热点，这些工具还可以为代码的检测部分提供资源消耗度量。由于每种工具都可以为应用程序的行为带来不同的见解，因此使用各种工具来分析和优化应用程序非常有价值。纸[ 7]展示了C / C ++ API，该API简化了最常见的开源HPC性能分析工具的手动检测。同时，性能分析库提供了不同的检测方法，二进制修补程序是一种通用机制，可以极大地提高工具的用户友好性。作者分析了最常用的二进制修补工具，并提出了使用它们对任何探查器或自动调谐器实施二进制检测的工作流程。

纸[ 8]介绍了DiPOSH，这是OpenSHMEM标准的多网络分布式实现。DiPOSH的核心思想是使API到网络的软件堆栈尽可能的薄，以最大程度地减少软件开销。遵循其非分布式父POSH的传统，DiPOSH的通信引擎围绕进程的共享堆进行组织，并且远程通信正在将数据直接从这些共享堆中移出或移入这些共享堆中。本文介绍了DiPOSH的体系结构和几种通信驱动程序，其中包括一个利用助手过程进行进程间通信的驱动程序。从使用高级，可移植的，优化的库到接近硬件通信例程的低级，此体系结构允许探索用于实现通信驱动程序的不同选项。

多核NUMA系统提供了影响并行代码性能的内存层次结构和通信网络。了解特定硬件配置对各种代码的影响至关重要。纸上[ 9]，运行时来自硬件计数器的监视信息用于表征在NUMA环境中运行的多线程进程的每个线程的行为。作者提出了一种在用户空间中执行的运行时工具，该工具使用此信息来指导两种不同的线程迁移策略，以通过增加局部性和亲和力来提高执行效率，而无需对代码进行任何修改。使用NAS Parallel OpenMP基准测试以及各种本地性和相似性方案，可以验证此工具的优势。在超过95％的应用程序中，该新颖工具的性能优于操作系统（OS），并且与OS相比，执行时间最多可缩短38％。

更新日期：2020-12-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文