PFACC: An OpenACC‐like programming model for irregular nested parallelism,Software: Practice and Experience

当前位置： X-MOL 学术 › Softw. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PFACC: An OpenACC‐like programming model for irregular nested parallelism
Software: Practice and Experience ( IF 3.5 ) Pub Date : 2020-07-09 , DOI: 10.1002/spe.2868
Ming Hsiang Huang, Wuu Yang

OpenACC is a directive‐based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration‐sharing and memory allocation routines. The PFACC runtime iteration‐sharing routine is a two‐level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth‐first order. Different thread blocks share iterations among one another with an iteration‐stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth‐first execution order. The two‐level iteration‐sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.

中文翻译：

PFACC：用于不规则嵌套并行的类似 OpenACC 的编程模型

OpenACC 是一种基于指令的编程模型，它允许程序员通过简单地注释并行循环来编写图形处理单元 (GPU) 程序。然而，OpenACC 对不规则嵌套并行循环的支持很差，这是表达嵌套并行性的自然选择。我们提出了 PFACC，一种类似于 OpenACC 的编程模型。PFACC 指令可用于注释并行循环并指导不同级别的存储器层次结构之间的数据移动。并行循环可以任意嵌套或放置在将（可能递归地）在其他并行循环中调用的函数内。PFACC 转换器通过插入运行时迭代共享和内存分配例程，将带有 PFACC 指令的 C 程序转换为 CUDA 程序。PFACC 运行时迭代共享例程是一个两级机制。线程块动态地将循环迭代组织成批次，并以深度优先的顺序执行批次。不同的线程块通过迭代窃取机制在彼此之间共享迭代。由于深度优先执行顺序，PFACC 生成具有合理内存使用量的 CUDA 程序。两级迭代共享机制纯粹在软件中实现，非常适合 CUDA 线程层次结构。实验表明，在大多数基准测试中，PFACC 在性能和代码大小方面优于 CUDA 动态并行。两级迭代共享机制纯粹在软件中实现，非常适合 CUDA 线程层次结构。实验表明，在大多数基准测试中，PFACC 在性能和代码大小方面优于 CUDA 动态并行。两级迭代共享机制纯粹在软件中实现，非常适合 CUDA 线程层次结构。实验表明，在大多数基准测试中，PFACC 在性能和代码大小方面优于 CUDA 动态并行。

更新日期：2020-07-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>