当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On-GPU thread-data remapping for nested branch divergence
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-02-10 , DOI: 10.1016/j.jpdc.2020.02.003
Huanxin Lin , Cho-Li Wang

Nested branches are common in applications with decision trees. The more layers in the branch nest, the larger slowdown is caused by nested branch divergence on GPU. Since inner branches are impractical to evaluate on host end, thread-data remapping via GPU shared memory is so far the most suitable solution. However, existing solution cannot handle inner branches directly due to undefined behavior of GPU barrier function when executed within branch statements. Race condition needs to be prevented without using barrier function. Targeting nested divergence, we propose NeX as a nested extension scheme featuring an inter-thread protocol that supports sub-workgroup synchronization. We further exploit the on-the-fly nature of Head-or-Tail (HoT) algorithm and propose HoT2 with enhanced flexibility of wavefront scheduling. Evaluated on four GPU models including NVIDIA Volta and Turing, HoT2 confirms to be more efficient. For benchmarks with branch nests up to five-layer-deep, NeX further boosts performance by up to 1.56x.



中文翻译:

GPU上的线程数据重新映射可实现嵌套分支分歧

嵌套分支在具有决策树的应用程序中很常见。分支嵌套中的层越多,由GPU上嵌套的分支分歧导致的减慢幅度越大。由于在主机端评估内部分支是不切实际的,因此迄今为止,通过GPU共享内存进行线程数据重新映射是最合适的解决方案。但是,由于在分支语句中执行GPU屏障功能时未定义的行为,因此现有解决方案无法直接处理内部分支。需要在不使用障碍物功能的情况下防止比赛状况。针对嵌套发散,我们提出NeX作为嵌套扩展方案,该方案具有支持子工作组同步的线程间协议。我们进一步利用头尾HoT)算法的动态特性,提出了HoT2增强了波前调度的灵活性。对包括NVIDIA Volta和Turing在内的四种GPU模型进行了评估,HoT2确认效率更高。对于具有最多五层深度的分支嵌套的基准,NeX进一步将性能提高了1.56倍。

更新日期:2020-02-10
down
wechat
bug