当前位置: X-MOL 学术Parallel Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines.
Parallel Computing ( IF 2.0 ) Pub Date : 2013-04-01 , DOI: 10.1016/j.parco.2013.03.001
George Teodoro 1 , Tony Pan , Tahsin Kurc , Jun Kong , Lee Cooper , Joel Saltz
Affiliation  

We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.

中文翻译:


CPU-GPU 混合机器上的高效不规则波前传播算法。



我们解决了在具有多个 CPU 和 GPU 的混合系统上高效执行计算模式的问题,这里称为不规则波前传播模式 (IWPP)。 IWPP 在多种图像处理操作中很常见。在 IWPP 中,如果满足传播条件,则波前中的数据元素将波传播到网格上的相邻元素。接收传播波的元件成为波前的一部分。这种模式会导致不规则的数据访问和计算。我们开发和评估使用多级队列结构有效计算和传播波前的策略。这种队列结构提高了 GPU 中快速内存的利用率并减少了同步开销。我们还开发了基于图块的并行化策略,以支持在多个 CPU 和 GPU 上执行。我们在最先进的 GPU 加速机器(配备 3 个 GPU 和 2 个多核 CPU)上使用两种广泛使用的图像处理操作的 IWPP 实现来评估我们的方法:形态重建和欧几里德距离变换。我们的结果显示 GPU 的性能显着提高。相对于单核 CPU 执行形态重建和欧氏距离变换,协同使用多个 CPU 和 GPU 分别实现了 50 倍和 85 倍的加速。
更新日期:2019-11-01
down
wechat
bug