当前位置: X-MOL 学术arXiv.cs.MS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory
arXiv - CS - Mathematical Software Pub Date : 2020-06-30 , DOI: arxiv-2006.16465
Mohammad Shafaet Islam, Qiqi Wang

High fidelity scientific simulations modeling physical phenomena typically require solving large linear systems of equations which result from discretization of a partial differential equation (PDE) by some numerical method. This step often takes a vast amount of computational time to complete, and therefore presents a bottleneck in simulation work. Solving these linear systems efficiently requires the use of massively parallel hardware with high computational throughput, as well as the development of algorithms which respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth. In this paper, we present an algorithm to accelerate Jacobi iteration for solving structured problems on graphics processing units (GPUs) using a hierarchical approach in which multiple iterations are performed within on-chip shared memory every cycle. A domain decomposition style procedure is adopted in which the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory, avoiding the need to perform expensive global memory accesses every step. We test our algorithm on the linear systems arising from discretization of Poisson's equation in 1D and 2D, and observe speedup in convergence using our shared memory approach compared to a traditional Jacobi implementation which only uses global memory on the GPU. We observe a x8 speedup in convergence in the 1D problem and a nearly x6 speedup in the 2D case from the use of shared memory compared to a conventional GPU approach.

中文翻译:

使用共享内存的 GPU 上结构化矩阵的分层雅可比迭代

对物理现象建模的高保真科学模拟通常需要求解大型线性方程组,这些方程组是通过某种数值方法对偏微分方程 (PDE) 进行离散化而产生的。这一步通常需要大量的计算时间才能完成,因此是模拟工作的瓶颈。有效地解决这些线性系统需要使用具有高计算吞吐量的大规模并行硬件,以及开发尊重这些硬件架构的内存层次结构以实现高内存带宽的算法。在本文中,我们提出了一种算法来加速 Jacobi 迭代,以使用分层方法解决图形处理单元 (GPU) 上的结构化问题,其中每个周期在片上共享内存中执行多次迭代。采用域分解式程序,其中问题域被划分为子域,子域的数据被复制到每个 GPU 块的共享内存中。Jacobi 迭代在每个块的共享内存内部执行,避免了每一步都执行昂贵的全局内存访问的需要。我们在由一维和二维泊松方程离散化产生的线性系统上测试我们的算法,并观察使用我们的共享内存方法与仅在 GPU 上使用全局内存的传统 Jacobi 实现相比的收敛加速。
更新日期:2020-07-01
down
wechat
bug