当前位置: X-MOL 学术Int. J. Parallel. Program › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction
International Journal of Parallel Programming ( IF 1.5 ) Pub Date : 2020-06-24 , DOI: 10.1007/s10766-020-00670-2
M. Köster , J. Groß , A. Krüger

Interpreters are well researched in the field of compiler construction and program generation. They are typically used to realize program execution of different programming languages without a compilation step. However, they can also be used to model complex rule-based simulations: The interpreter applies all rules one after another. These can be iteratively applied on a globally updated state in order to get the final simulation result. Many simulations for domain-specific problems already leverage the parallel processing capabilities of Graphics Processing Units (GPUs). They use hardware-specific tuned rule implementations to achieve maximum performance. However, every interpreter-based system requires a high-level algorithm that detects active rules and determines when they are evaluated. A common approach in this context is the use of different interpreter routines for every problem domain. Executing such functions in an efficient way mainly involves dealing with hardware peculiarities like thread divergences, ALU computations and memory operations. Furthermore, the interpreter is often executed on multiple states in parallel these days. This is particularly important for heuristic search or what-if analyses, for instance. In this paper, we present a novel and easy-to-implement method based on thread compaction to realize generic rule-based interpreters in an efficient way on GPUs. It is optimized for many states using a specially designed memory layout. Benchmarks on our evaluation scenarios show that the performance can be significantly increased in comparison to existing commonly-used implementations.

中文翻译:

使用线程压缩在 GPU 上大规模并行基于规则的解释器执行

解释器在编译器构造和程序生成领域得到了很好的研究。它们通常用于在没有编译步骤的情况下实现不同编程语言的程序执行。但是,它们也可用于对基于规则的复杂模拟进行建模:解释器一个接一个地应用所有规则。这些可以迭代地应用于全局更新的状态,以获得最终的模拟结果。许多针对特定领域问题的模拟已经利用了图形处理单元 (GPU) 的并行处理能力。他们使用特定于硬件的调整规则实现来实现最高性能。然而,每个基于解释器的系统都需要一个高级算法来检测活动规则并确定何时评估它们。在这种情况下,一种常见的方法是对每个问题域使用不同的解释器例程。以有效的方式执行这些功能主要涉及处理硬件特性,如线程分歧、ALU 计算和内存操作。此外,现在解释器经常在多个状态上并行执行。例如,这对于启发式搜索或假设分析尤其重要。在本文中,我们提出了一种基于线程压缩的新颖且易于实现的方法,以在 GPU 上以有效的方式实现基于通用规则的解释器。它使用专门设计的内存布局针对许多状态进行了优化。我们评估场景的基准测试表明,与现有的常用实现相比,性能可以显着提高。
更新日期:2020-06-24
down
wechat
bug