Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms,ACM Transactions on Reconfigurable Technology and Systems

当前位置： X-MOL 学术 › ACM Trans. Reconfig. Technol. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms
ACM Transactions on Reconfigurable Technology and Systems ( IF 3.1 ) Pub Date : 2020-06-03 , DOI: 10.1145/3391920
Zhiyuan Shao ₁ , Chenhao Liu ₁ , Ruoshi Li ₁ , Xiaofei Liao ₁ , Hai Jin ₁

Affiliation

Graph processing is one of the important research topics in the big-data era. To build a general framework for graph processing by using a DRAM-based FPGA board with deep memory hierarchy, one of the reasonable methods is to partition a given big graph into multiple small subgraphs, represent the graph with a two-dimensional grid, and then process the subgraphs one after another to divide and conquer the whole problem. Such a method (grid-graph processing) stores the graph data in the off-chip memory devices (e.g., on-board or host DRAM) that have large storage capacities but relatively small bandwidths, and processes individual small subgraphs one after another by using the on-chip memory devices (e.g., FFs, BRAM, and URAM) that have small storage capacities but superior random access performances. However, directly exchanging graph (vertex and edge) data between the processing units in FPGA chip with slow off-chip DRAMs during grid-graph processing leads to limited performances and excessive data transmission amounts between the FPGA chip and off-chip memory devices. In this article, we show that it is effective in improving the performance of grid-graph processing on DRAM-based FPGA hardware accelerators by leveraging the flexibility and programmability of FPGAs to build application-specific caching mechanisms, which bridge the performance gaps between on-chip and off-chip memory devices, and reduce the data transmission amounts by exploiting the localities on data accessing. We design two application-specific caching mechanisms (i.e., vertex caching and edge caching ) to exploit two types of localities (i.e., vertex locality and subgraph locality ) that exist in grid-graph processing, respectively. Experimental results show that with the vertex caching mechanism, our system (named as FabGraph) achieves up to 3.1× and 2.5× speedups for BFS and PageRank, respectively, over ForeGraph when processing medium graphs stored in the on-board DRAM. With the edge caching mechanism, the extension of FabGraph (named as FabGraph+) achieves up to 9.96× speedups for BFS over FPGP when processing large graphs stored in the host DRAM.

中文翻译：

在具有特定应用缓存机制的基于 DRAM 的 FPGA 加速器上处理网格格式的真实世界图

图处理是大数据时代的重要研究课题之一。使用具有深存储器层次结构的基于 DRAM 的 FPGA 板构建图处理的通用框架，合理的方法之一是将给定的大图划分为多个小子图，用二维网格表示图，然后依次处理子图以分治整个问题。这种方法（网格图处理）将图数据存储在存储容量大但带宽相对较小的片外存储设备（例如，板载或主机DRAM）中，并通过使用存储容量小但随机存取性能优越的片上存储设备（如 FF、BRAM 和 URAM）。然而，在网格图处理过程中，FPGA芯片中的处理单元之间直接交换图形（顶点和边）数据，速度较慢的片外DRAM会导致FPGA芯片与片外存储设备之间的性能受限和数据传输量过大。在本文中，我们展示了它可以有效地提高基于 DRAM 的 FPGA 硬件加速器上的网格图处理性能，方法是利用 FPGA 的灵活性和可编程性来构建特定于应用程序的缓存机制，从而弥合在线之间的性能差距。芯片和片外存储设备，并通过利用数据访问的局部性来减少数据传输量。我们设计了两种特定于应用程序的缓存机制（即，我们表明，通过利用 FPGA 的灵活性和可编程性来构建特定于应用程序的缓存机制，它可以有效地提高基于 DRAM 的 FPGA 硬件加速器上的网格图处理的性能，从而弥合片上和片外之间的性能差距。芯片存储设备，并通过利用数据访问的局部性来减少数据传输量。我们设计了两种特定于应用程序的缓存机制（即，我们表明，通过利用 FPGA 的灵活性和可编程性来构建特定于应用程序的缓存机制，它可以有效地提高基于 DRAM 的 FPGA 硬件加速器上的网格图处理的性能，从而弥合片上和片外之间的性能差距。芯片存储设备，并通过利用数据访问的局部性来减少数据传输量。我们设计了两种特定于应用程序的缓存机制（即，顶点缓存和边缘缓存）来利用两种类型的地方（即，顶点局部性和子图局部性) 分别存在于网格图处理中。实验结果表明，通过顶点缓存机制，我们的系统（称为 FabGraph）在处理存储在板载 DRAM 中的介质图时，BFS 和 PageRank 的速度分别比 ForeGraph 提高了 3.1 倍和 2.5 倍。借助边缘缓存机制，FabGraph（称为 FabGraph+）的扩展在处理存储在主机 DRAM 中的大型图时，BFS 比 FPGP 实现了高达 9.96 倍的加速。

更新日期：2020-06-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11