当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture
arXiv - CS - Hardware Architecture Pub Date : 2020-11-06 , DOI: arxiv-2011.05160
Jesmin Jahan Tithi, Fabrizio Petrini, Hongbo Rong, Andrei Valentin, Carl Ebeling

Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of supercomputing resources. Therefore, it has been always important to optimize stencil programs for the best performance. A rich body of research has focused on optimizing stencil computations on almost all parallel architectures. Stencil applications have regular dependency patterns, inherent pipeline-parallelism, and plenty of data reuse. This makes these applications a perfect match for a coarse-grained reconfigurable spatial architecture (CGRA). A CGRA consists of many simple, small processing elements (PEs) connected with an on-chip network. Each PE can be configured to execute part of a stencil computation and all PEs run in parallel; the network can also be configured so that data loaded can be passed from a PE to a neighbor PE directly and thus reused by many PEs without register spilling and memory traffic. How to efficiently map a stencil computation to a CGRA is the key to performance. In this paper, we show a few unique and generalizable ways of mapping one- and multidimensional stencil computations to a CGRA, fully exploiting the data reuse opportunities and parallelism. Our simulation experiments demonstrate that these mappings are efficient and enable the CGRA to outperform state-of-the-art GPUs.

中文翻译:

在粗粒度可重构空间架构上映射模板

模板代表一类计算模式,其中输出网格点取决于输入网格中相邻点的固定形状。模板计算在涉及大量超级计算资源的科学应用中很普遍。因此,优化模板程序以获得最佳性能一直很重要。大量的研究集中在优化几乎所有并行架构上的模板计算。Stencil 应用程序具有规则的依赖模式、固有的管道并行性和大量的数据重用。这使得这些应用程序与粗粒度可重构空间架构 (CGRA) 完美匹配。CGRA 由许多与片上网络连接的简单的小型处理元件 (PE) 组成。每个 PE 都可以配置为执行部分模板计算,并且所有 PE 并行运行;还可以配置网络,以便加载的数据可以从 PE 直接传递到相邻 PE,从而被许多 PE 重用,而不会造成寄存器溢出和内存流量。如何有效地将模板计算映射到 CGRA 是性能的关键。在本文中,我们展示了一些将一维和多维模板计算映射到 CGRA 的独特且可推广的方法,充分利用数据重用机会和并行性。我们的模拟实验表明,这些映射是有效的,并且使 CGRA 能够胜过最先进的 GPU。还可以配置网络,以便加载的数据可以从 PE 直接传递到相邻 PE,从而被许多 PE 重用,而不会造成寄存器溢出和内存流量。如何有效地将模板计算映射到 CGRA 是性能的关键。在本文中,我们展示了一些将一维和多维模板计算映射到 CGRA 的独特且可推广的方法,充分利用数据重用机会和并行性。我们的模拟实验表明,这些映射是有效的,并且使 CGRA 能够胜过最先进的 GPU。还可以配置网络,以便加载的数据可以从 PE 直接传递到相邻 PE,从而被许多 PE 重用,而不会造成寄存器溢出和内存流量。如何有效地将模板计算映射到 CGRA 是性能的关键。在本文中,我们展示了一些将一维和多维模板计算映射到 CGRA 的独特且可推广的方法,充分利用数据重用机会和并行性。我们的模拟实验表明,这些映射是有效的,并且使 CGRA 能够胜过最先进的 GPU。充分利用数据重用机会和并行性。我们的模拟实验表明,这些映射是有效的,并且使 CGRA 能够胜过最先进的 GPU。充分利用数据重用机会和并行性。我们的模拟实验表明,这些映射是有效的,并且使 CGRA 能够胜过最先进的 GPU。
更新日期:2020-11-11
down
wechat
bug