当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Don't forget about synchronization! Guidelines for using locks on graphics processing units
Concurrency and Computation: Practice and Experience ( IF 2 ) Pub Date : 2020-04-13 , DOI: 10.1002/cpe.5757
Jacob Nelson 1 , dePaul Miller 1 , Roberto Palmieri 1
Affiliation  

Heterogeneous devices are becoming necessary components of high performance computing infrastructures, and the graphics processing unit (GPU) plays an important role in this landscape. Given a problem, the established approach for exploiting the GPU is to design solutions that are parallel, without data dependencies. These solutions are then offloaded to the GPU's massively parallel capability. This design principle often leads to developing applications that cannot maximize GPU hardware utilization. The goal of this article is to challenge this common belief by empirically showing that allowing even simple forms of synchronization enables programmers to design solutions that admit conflicts and achieve better performance. Our experience shows that lock-based solutions to the k-means clustering problem, implemented using two well-known locking strategies, outperform the well-engineered and parallel KMCUDA on both synthetic and real datasets; with an average 8× faster runtimes across all locking algorithms on a synthetic dataset and 1.7× faster on a real world dataset across all locking algorithms (and max speedups of 71.3× and 2.75×, respectively). We validate these results using a more sophisticated clustering algorithm, namely fuzzy c-means and summarize our findings by identifying three guidelines to help make concurrency effective when programming GPU applications.

中文翻译:

不要忘记同步!在图形处理单元上使用锁的指南

异构设备正在成为高性能计算基础设施的必要组件,图形处理单元 (GPU) 在这一领域发挥着重要作用。给定一个问题,利用 GPU 的既定方法是设计并行的解决方案,没有数据依赖性。然后将这些解决方案卸载到 GPU 的大规模并行能力。这种设计原则通常会导致开发应用程序无法最大化 GPU 硬件利用率。本文的目标是通过经验证明即使允许简单的同步形式也能让程序员设计出承认冲突并获得更好性能的解决方案,从而挑战这一普遍信念。我们的经验表明,k-means 聚类问题的基于锁的解决方案,使用两种众所周知的锁定策略实现,在合成和真实数据集上均优于精心设计的并行 KMCUDA;合成数据集上所有锁定算法的平均运行时间快 8 倍,所有锁定算法的真实世界数据集的运行时间平均快 1.7 倍(最大加速分别为 71.3 倍和 2.75 倍)。我们使用更复杂的聚类算法(即模糊 c 均值)验证这些结果,并通过确定三个指导方针来总结我们的发现,以帮助在编程 GPU 应用程序时提高并发效率。分别为 3 倍和 2.75 倍)。我们使用更复杂的聚类算法(即模糊 c 均值)验证这些结果,并通过确定三个指导方针来总结我们的发现,以帮助在编程 GPU 应用程序时提高并发效率。分别为 3 倍和 2.75 倍)。我们使用更复杂的聚类算法(即模糊 c 均值)验证这些结果,并通过确定三个指导方针来总结我们的发现,以帮助在编程 GPU 应用程序时提高并发效率。
更新日期:2020-04-13
down
wechat
bug