EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs
arXiv - CS - Databases Pub Date : 2020-06-12 , DOI: arxiv-2006.06890
Seung Won Min, Vikram Sharma Mailthody, Zaid Qureshi, Jinjun Xiong, Eiman Ebrahimi, Wen-mei Hwu

Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as CSR. To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs often do not fit into the GPU memory. Prior works have either used input data pre-processing/partitioning or UVM to migrate chunks of data from the host memory to the GPU memory. However, the large, multi-dimensional, and sparse nature of graph data presents a major challenge to these schemes and results in significant amplification of data movement and reduced effective data throughput. In this work, we propose EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cacheline-sized access to data stored in host memory. This paper addresses the open question of whether a sufficiently large number of overlapping cacheline-sized accesses can be sustained to 1) tolerate the long latency to host memory, 2) fully utilize the available bandwidth, and 3) achieve favorable execution performance. We analyze the data access patterns of several graph traversal applications in GPU over PCIe using an FPGA to understand the cause of poor external bandwidth utilization. By carefully coalescing and aligning external memory requests, we show that we can minimize the number of PCIe transactions and nearly fully utilize the PCIe bandwidth even with direct cache-line accesses to the host memory. EMOGI achieves 2.92$\times$ speedup on average compared to the optimized UVM implementations in various graph traversal applications. We also show that EMOGI scales better than a UVM-based solution when the system uses higher bandwidth interconnects such as PCIe 4.0.

中文翻译：

EMOGI：用于 GPU 中内存不足图遍历的高效内存访问

现代分析和推荐系统越来越多地基于捕获被分析实体之间关系的图形数据。实用的图有很大的尺寸，提供大量的并行性，并以稀疏矩阵格式存储，例如 CSR。为了利用大规模并行性，开发人员对使用 GPU 进行图遍历越来越感兴趣。然而，由于它们的大小，图形通常不适合 GPU 内存。之前的工作要么使用输入数据预处理/分区，要么使用 UVM 将数据块从主机内存迁移到 GPU 内存。然而，图数据的大、多维和稀疏性质对这些方案提出了重大挑战，并导致数据移动的显着放大和有效数据吞吐量的降低。在这项工作中，我们提出了 EMOGI，使用直接缓存行大小访问存储在主机内存中的数据来遍历不适合 GPU 内存的图的另一种方法。本文解决了一个悬而未决的问题，即是否可以维持足够多的重叠缓存行大小的访问，以 1) 容忍主机内存的长延迟，2) 充分利用可用带宽，以及 3) 实现良好的执行性能。我们使用 FPGA 分析了 GPU over PCIe 中几个图形遍历应用程序的数据访问模式，以了解外部带宽利用率低的原因。通过仔细合并和对齐外部内存请求，我们表明即使直接缓存行访问主机内存，我们也可以最大限度地减少 PCIe 事务的数量并几乎完全利用 PCIe 带宽。EMOGI 达到 2。与各种图形遍历应用程序中优化的 UVM 实现相比，平均加速了 92$\times$。我们还表明，当系统使用更高带宽的互连（如 PCIe 4.0）时，EMOGI 的扩展性比基于 UVM 的解决方案更好。

更新日期：2020-06-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文