当前位置: X-MOL 学术Int. J. Comput. Sci. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Graph-based multi-core higher-order time integration of linear autonomous partial differential equations
Journal of Computational Science ( IF 3.1 ) Pub Date : 2021-04-01 , DOI: 10.1016/j.jocs.2021.101349
Dominik Huber , Martin Schreiber , Martin Schulz

Modern high-performance computing (HPC) systems rely on increasingly complex nodes with a steadily growing number of cores and matching deep memory hierarchies. In order to fully exploit them, algorithms must be explicitly designed to exploit these features. In this work we address this challenge for a widely used class of application kernels: polynomial-based time integration of linear autonomous partial differential equations.

We build on prior work [1] of a cache-aware, yet sequential solution and provide an innovative way to parallelize it, while addressing cache-awareness across a large number of cores. For this, we introduce a dependency graph driven view of the algorithm and then use both static graph partitioning and dynamic scheduling to efficiently map the execution to the underlying platform. We implement our approach on top of the widely available Intel Threading Building Blocks (TBB) library, although the concepts are programming model agnostic and can apply to any task-driven parallel programming approach.

We demonstrate the performance of our approach for a 2nd, 4th and 6th order time integration of the linear advection equation on three different architectures with widely varying memory systems and achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non-cache-aware approach.



中文翻译:

基于图的线性自治偏微分方程的多核高阶时间积分

现代高性能计算 (HPC) 系统依赖于日益复杂的节点,这些节点具有稳定增长的内核数量和匹配的深存储器层次结构。为了充分利用它们,必须明确设计算法来利用这些特征。在这项工作中,我们针对广泛使用的一类应用内核解决了这一挑战:线性自治偏微分方程的基于多项式的时间积分。

我们建立在缓存感知但顺序解决方案的先前工作 [1] 的基础上,并提供了一种创新的并行化方法,同时解决了跨大量内核的缓存感知问题。为此,我们引入了算法的依赖图驱动视图,然后使用静态图分区和动态调度来有效地将执行映射到底层平台。我们在广泛可用的英特尔线程构建块 (TBB) 库之上实施我们的方法,尽管这些概念与编程模型无关,并且可以应用于任何任务驱动的并行编程方法。

我们展示了我们的方法在具有广泛变化的内存系统的三种不同架构上线性平流方程的二阶、四阶和六阶时间积分的性能,与传统的状态相比,挂钟时间减少了 60%。最先进的非缓存感知方法。

更新日期:2021-06-08
down
wechat
bug